Hi,
I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:
mb_strpos("\xf0\x9fABCD", "B"); // int(2)
The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
character, and 0x41 is regarded as another character. Accordingly, the
resulting index of "B" is 2.
On the other hand, mb_substr, for example, simply skips over
continuation bytes when encountering a leading byte. Let's consider
the following example, which uses mb_substr to cut the first two
characters from the string used in the previous example:
mb_substr("\xf0\x9fABCD", 2); // string(1) "D"
Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
time, mb_substr just skips over the next three bytes and considers all
4 bytes one character. Next, it continues to process at byte 0x43
("C"), which is regarded as another character. Thus, the resulting
string is "D".
This inconsistency in handling invalid multibyte sequences not only
exists between different functions but also affects single functions.
Let's consider the following example, which uses mb_strstr to
determine the first occurrence of the string "B" in the same string:
mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"
The principle is the same, just in a single function call.
This inconsistency may not only lead to an unexpected behavior but can
also have a security impact when the affected functions are used to
filter input.
Best Regards,
Stefan Schiller
On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals <
internals@lists.php.net> wrote:
Hi,
I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:mb_strpos("\xf0\x9fABCD", "B"); // int(2)
The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
character, and 0x41 is regarded as another character. Accordingly, the
resulting index of "B" is 2.On the other hand, mb_substr, for example, simply skips over
continuation bytes when encountering a leading byte. Let's consider
the following example, which uses mb_substr to cut the first two
characters from the string used in the previous example:mb_substr("\xf0\x9fABCD", 2); // string(1) "D"
Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
time, mb_substr just skips over the next three bytes and considers all
4 bytes one character. Next, it continues to process at byte 0x43
("C"), which is regarded as another character. Thus, the resulting
string is "D".This inconsistency in handling invalid multibyte sequences not only
exists between different functions but also affects single functions.
Let's consider the following example, which uses mb_strstr to
determine the first occurrence of the string "B" in the same string:mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"
The principle is the same, just in a single function call.
This inconsistency may not only lead to an unexpected behavior but can
also have a security impact when the affected functions are used to
filter input.Best Regards,
Stefan Schiller
This might have been better to raise as a bug, but in any case I am CCing
Alex who's the main maintainer of the mbstring extension so he's aware of
this and can possibly provide some explanations.
Best regards,
Gina P. Banyard
2023年12月1日(金) 18:48 G. P. B. george.banyard@gmail.com:
On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals <
internals@lists.php.net> wrote:Hi,
I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:mb_strpos("\xf0\x9fABCD", "B"); // int(2)
The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
character, and 0x41 is regarded as another character. Accordingly, the
resulting index of "B" is 2.On the other hand, mb_substr, for example, simply skips over
continuation bytes when encountering a leading byte. Let's consider
the following example, which uses mb_substr to cut the first two
characters from the string used in the previous example:mb_substr("\xf0\x9fABCD", 2); // string(1) "D"
Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
time, mb_substr just skips over the next three bytes and considers all
4 bytes one character. Next, it continues to process at byte 0x43
("C"), which is regarded as another character. Thus, the resulting
string is "D".This inconsistency in handling invalid multibyte sequences not only
exists between different functions but also affects single functions.
Let's consider the following example, which uses mb_strstr to
determine the first occurrence of the string "B" in the same string:mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"
The principle is the same, just in a single function call.
This inconsistency may not only lead to an unexpected behavior but can
also have a security impact when the affected functions are used to
filter input.Best Regards,
Stefan SchillerThis might have been better to raise as a bug, but in any case I am CCing
Alex who's the main maintainer of the mbstring extension so he's aware of
this and can possibly provide some explanations.Best regards,
Gina P. Banyard
Hi,
I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:mb_strpos("\xf0\x9fABCD", "B"); // int(2)
Yes, that's true. Because mb_strpos is convert to UTF-8 in internal.
However, other mbstring function is temporary convert to UTF-32, then
reconvert to original character encoding.
Anyway, I'll wait Alex's reply.
Regards
Yuya
--
Yuya Hamada (tekimen)