Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:121889
MIME-Version: 1.0
Reply-To: Stefan Schiller <stefan.schiller@sonarsource.com>
Date: Fri, 1 Dec 2023 10:31:29 +0100
Message-ID: <CAMZNyFfiP5y8QAcNBCo235xG30Ra-PgCKcYfsZVuvgwy9Re-8w@mail.gmail.com>
To: internals@lists.php.net
Content-Type: text/plain; charset="UTF-8"
Subject: Inconsistency mbstring functions
From: internals@lists.php.net ("Stefan Schiller via internals")

Hi,

I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:

mb_strpos("\xf0\x9fABCD", "B"); // int(2)

The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
character, and 0x41 is regarded as another character. Accordingly, the
resulting index of "B" is 2.

On the other hand, mb_substr, for example, simply skips over
continuation bytes when encountering a leading byte. Let's consider
the following example, which uses mb_substr to cut the first two
characters from the string used in the previous example:

mb_substr("\xf0\x9fABCD", 2); // string(1) "D"

Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
time, mb_substr just skips over the next three bytes and considers all
4 bytes one character. Next, it continues to process at byte 0x43
("C"), which is regarded as another character. Thus, the resulting
string is "D".

This inconsistency in handling invalid multibyte sequences not only
exists between different functions but also affects single functions.
Let's consider the following example, which uses mb_strstr to
determine the first occurrence of the string "B" in the same string:

mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"

The principle is the same, just in a single function call.

This inconsistency may not only lead to an unexpected behavior but can
also have a security impact when the affected functions are used to
filter input.


Best Regards,
Stefan Schiller

[1]: https://www.php.net/manual/en/function.mb-strpos.php
[2]: https://www.php.net/manual/de/function.mb-substr.php
[3]: https://www.php.net/manual/de/function.mb-strstr.php