Inconsistency mbstring functions

2 years ago by Stefan Schiller via internals — view source — reply

unread

Hi,

I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:

mb_strpos("\xf0\x9fABCD", "B"); // int(2)

The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
character, and 0x41 is regarded as another character. Accordingly, the
resulting index of "B" is 2.

On the other hand, mb_substr, for example, simply skips over
continuation bytes when encountering a leading byte. Let's consider
the following example, which uses mb_substr to cut the first two
characters from the string used in the previous example:

mb_substr("\xf0\x9fABCD", 2); // string(1) "D"

Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
time, mb_substr just skips over the next three bytes and considers all
4 bytes one character. Next, it continues to process at byte 0x43
("C"), which is regarded as another character. Thus, the resulting
string is "D".

This inconsistency in handling invalid multibyte sequences not only
exists between different functions but also affects single functions.
Let's consider the following example, which uses mb_strstr to
determine the first occurrence of the string "B" in the same string:

mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"

The principle is the same, just in a single function call.

This inconsistency may not only lead to an unexpected behavior but can
also have a security impact when the affected functions are used to
filter input.

Best Regards,
Stefan Schiller

2 years ago by G. P. B. — view source — reply

unread

On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals <
internals@lists.php.net> wrote:

Hi,

I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:

mb_strpos("\xf0\x9fABCD", "B"); // int(2)

The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
character, and 0x41 is regarded as another character. Accordingly, the
resulting index of "B" is 2.

On the other hand, mb_substr, for example, simply skips over
continuation bytes when encountering a leading byte. Let's consider
the following example, which uses mb_substr to cut the first two
characters from the string used in the previous example:

mb_substr("\xf0\x9fABCD", 2); // string(1) "D"

Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
time, mb_substr just skips over the next three bytes and considers all
4 bytes one character. Next, it continues to process at byte 0x43
("C"), which is regarded as another character. Thus, the resulting
string is "D".

This inconsistency in handling invalid multibyte sequences not only
exists between different functions but also affects single functions.
Let's consider the following example, which uses mb_strstr to
determine the first occurrence of the string "B" in the same string:

mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"

The principle is the same, just in a single function call.

This inconsistency may not only lead to an unexpected behavior but can
also have a security impact when the affected functions are used to
filter input.

Best Regards,
Stefan Schiller

This might have been better to raise as a bug, but in any case I am CCing
Alex who's the main maintainer of the mbstring extension so he's aware of
this and can possibly provide some explanations.

Best regards,

Gina P. Banyard

2 years ago by youkidearitai — view source — reply

unread

2023年12月1日(金) 18:48 G. P. B. george.banyard@gmail.com:

On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals <
internals@lists.php.net> wrote:

Hi,

I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:

mb_strpos("\xf0\x9fABCD", "B"); // int(2)

The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
character, and 0x41 is regarded as another character. Accordingly, the
resulting index of "B" is 2.

On the other hand, mb_substr, for example, simply skips over
continuation bytes when encountering a leading byte. Let's consider
the following example, which uses mb_substr to cut the first two
characters from the string used in the previous example:

mb_substr("\xf0\x9fABCD", 2); // string(1) "D"

Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
time, mb_substr just skips over the next three bytes and considers all
4 bytes one character. Next, it continues to process at byte 0x43
("C"), which is regarded as another character. Thus, the resulting
string is "D".

This inconsistency in handling invalid multibyte sequences not only
exists between different functions but also affects single functions.
Let's consider the following example, which uses mb_strstr to
determine the first occurrence of the string "B" in the same string:

mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"

The principle is the same, just in a single function call.

This inconsistency may not only lead to an unexpected behavior but can
also have a security impact when the affected functions are used to
filter input.

Best Regards,
Stefan Schiller

This might have been better to raise as a bug, but in any case I am CCing
Alex who's the main maintainer of the mbstring extension so he's aware of
this and can possibly provide some explanations.

Best regards,

Gina P. Banyard

Hi,

I would like to raise attention to an inconsistency in how mbstring
functions handle invalid multibyte sequences. When, for example,
mb_strpos encounters a UTF-8 leading byte, it tries to parse the
following continuation bytes until the full byte sequence is read. If
an invalid byte is encountered, all previously read bytes are
considered one character, and the parsing is started over again at the
invalid byte. Let's consider the following example:

mb_strpos("\xf0\x9fABCD", "B"); // int(2)

Yes, that's true. Because mb_strpos is convert to UTF-8 in internal.
However, other mbstring function is temporary convert to UTF-32, then
reconvert to original character encoding.

Anyway, I'll wait Alex's reply.

Regards
Yuya

--

Yuya Hamada (tekimen)

2 years ago by Stefan Schiller via internals — view source — reply

unread

Dear Stefan, and Dear Gina,

Thanks for the message. Yes, Stefan has rediscovered an interesting quirk of the mbstring library. I have been aware of this for a long time, and other mbstring developers have too. It dates back to the origin of the library; actually, even before the origin of mbstring, since mbstring was based on another library called libmbfl, and this behavior originates from libmbfl.

Pull your chair up around the fire and let me tell you the tale of libmbfl. Once upon a time, there was a text-processing library called libmbfl. libmbfl was based on a collection of text-decoding routines (which converted bytes to codepoints) and text-encoding routines (which converted codepoints to bytes). Each such routine was structured as a stateful "filter". These filters could be assembled into "chains", whereby the output values generated by one routine would automatically be passed to the next. libmbfl could perform many wonderful text-processing tasks by substituting a different final filter at the end of the chain.

But all was not well. Since libmbfl's filters processed text only one byte or codepoint at a time, and each routine had to save its state before returning, and restore its state upon entry, libmbfl was slow. Slow as a turtle, slow as a snail, slow as whatever-slowly-moving-thing-you-can-think-of. Oh, what was libmbfl to do? A clever plan was hatched: give libmbfl a 256-entry table called a "mblen_table" for each supported text encoding with the property that the byte length of a character can be determined from its first byte. Then, text-processing tasks which were not dependent on the actual content of a string, but only on the number of codepoints, could be performed without ever invoking those wonderful, but painfully slow filters! libmbfl could skip through a string while just examining the first byte of each character. (Of course, this only worked for text encodings with an mblen_table.) For valid strings, the new method worked identically to the previous one. For invalid strings, there were significant differences in behavior, but libmbfl tried to ignore these and bravely pressed on.

The story ends with an ironic twist. Many years later, I became interested in mbstring and reimplemented its internals, replacing the libmbfl code with fresh new code which ran many times faster. The new code was so much faster that in some cases, the mblen_table optimization actually became a pessimization! In other cases, the mblen_table-based code is still faster, but not by a large amount. But now mbstring was haunted by the spectre of Hyrum's Law (https://www.hyrumslaw.com/). With a huge body of legacy code relying on mbstring, almost any observable behavior change runs a significant risk of breaking someone's code. And when this happens, they will not hesitate to vent their rage on the hapless maintainers.

Notwithstanding the rage of the users, about a year ago, I did remove the mblen_table-based code in one place where benchmarks clearly showed it was acting as a pessimization. I don't remember which mbstring function was affected and would need to check the commit log to confirm.

Hi Alex,

Thank you very much for sharing this background context.

Personally, I think the real issue here is not the inconsistency between mbstring functions which are based on the mblen_tables and those which are not. I think a lot of mbstring operations should not be used on invalid strings at all, and that for such operations, mbstring would do well to throw an exception if it receives invalid input. (Like mb_strpos; how do you define the "position of a UTF-8 substring" when the parent string is not UTF-8 at all?) But that would be a huge BC break.

My biggest concern is that this quirk can cause security issues in
user code. I came across this in the first place when discovering an
exploitable security vulnerability in an application. From my point of
view, this is not only about inconsistent behavior but also violates
the documentation for specific functions like mb_strstr. I agree that
a lot of mbstring operations should not be used on invalid strings,
and an exception seems to be an appropriate answer despite the huge BC
impact.

Best Regards,
Stefan Schiller

2 years ago by Robert Landers — view source — reply

unread

On Mon, Dec 4, 2023 at 1:51 PM Stefan Schiller via internals
internals@lists.php.net wrote:

Dear Stefan, and Dear Gina,

Thanks for the message. Yes, Stefan has rediscovered an interesting quirk of the mbstring library. I have been aware of this for a long time, and other mbstring developers have too. It dates back to the origin of the library; actually, even before the origin of mbstring, since mbstring was based on another library called libmbfl, and this behavior originates from libmbfl.

Pull your chair up around the fire and let me tell you the tale of libmbfl. Once upon a time, there was a text-processing library called libmbfl. libmbfl was based on a collection of text-decoding routines (which converted bytes to codepoints) and text-encoding routines (which converted codepoints to bytes). Each such routine was structured as a stateful "filter". These filters could be assembled into "chains", whereby the output values generated by one routine would automatically be passed to the next. libmbfl could perform many wonderful text-processing tasks by substituting a different final filter at the end of the chain.

But all was not well. Since libmbfl's filters processed text only one byte or codepoint at a time, and each routine had to save its state before returning, and restore its state upon entry, libmbfl was slow. Slow as a turtle, slow as a snail, slow as whatever-slowly-moving-thing-you-can-think-of. Oh, what was libmbfl to do? A clever plan was hatched: give libmbfl a 256-entry table called a "mblen_table" for each supported text encoding with the property that the byte length of a character can be determined from its first byte. Then, text-processing tasks which were not dependent on the actual content of a string, but only on the number of codepoints, could be performed without ever invoking those wonderful, but painfully slow filters! libmbfl could skip through a string while just examining the first byte of each character. (Of course, this only worked for text encodings with an mblen_table.) For valid strings, the new method worked identically to the previous one. For invalid strings, there were significant differences in behavior, but libmbfl tried to ignore these and bravely pressed on.

The story ends with an ironic twist. Many years later, I became interested in mbstring and reimplemented its internals, replacing the libmbfl code with fresh new code which ran many times faster. The new code was so much faster that in some cases, the mblen_table optimization actually became a pessimization! In other cases, the mblen_table-based code is still faster, but not by a large amount. But now mbstring was haunted by the spectre of Hyrum's Law (https://www.hyrumslaw.com/). With a huge body of legacy code relying on mbstring, almost any observable behavior change runs a significant risk of breaking someone's code. And when this happens, they will not hesitate to vent their rage on the hapless maintainers.

Notwithstanding the rage of the users, about a year ago, I did remove the mblen_table-based code in one place where benchmarks clearly showed it was acting as a pessimization. I don't remember which mbstring function was affected and would need to check the commit log to confirm.

Hi Alex,

Thank you very much for sharing this background context.

Personally, I think the real issue here is not the inconsistency between mbstring functions which are based on the mblen_tables and those which are not. I think a lot of mbstring operations should not be used on invalid strings at all, and that for such operations, mbstring would do well to throw an exception if it receives invalid input. (Like mb_strpos; how do you define the "position of a UTF-8 substring" when the parent string is not UTF-8 at all?) But that would be a huge BC break.

My biggest concern is that this quirk can cause security issues in
user code. I came across this in the first place when discovering an
exploitable security vulnerability in an application. From my point of
view, this is not only about inconsistent behavior but also violates
the documentation for specific functions like mb_strstr. I agree that
a lot of mbstring operations should not be used on invalid strings,
and an exception seems to be an appropriate answer despite the huge BC
impact.

I think it is only a security issue when people accidentally think
mb_* functions should be used if it is available. I've seen people do
mb_strlen() on binary data, for example, not realizing the differences
between mb_strlen and strlen. Or using mb_* functions and then passing
them off to cryptographic functions.

Robert Landers
Software Engineer
Utrecht NL

2 years ago by youkidearitai — view source — reply

unread

2023年12月4日(月) 22:25 Robert Landers landers.robert@gmail.com:

On Mon, Dec 4, 2023 at 1:51 PM Stefan Schiller via internals
internals@lists.php.net wrote:

Dear Stefan, and Dear Gina,

Thanks for the message. Yes, Stefan has rediscovered an interesting quirk of the mbstring library. I have been aware of this for a long time, and other mbstring developers have too. It dates back to the origin of the library; actually, even before the origin of mbstring, since mbstring was based on another library called libmbfl, and this behavior originates from libmbfl.

Pull your chair up around the fire and let me tell you the tale of libmbfl. Once upon a time, there was a text-processing library called libmbfl. libmbfl was based on a collection of text-decoding routines (which converted bytes to codepoints) and text-encoding routines (which converted codepoints to bytes). Each such routine was structured as a stateful "filter". These filters could be assembled into "chains", whereby the output values generated by one routine would automatically be passed to the next. libmbfl could perform many wonderful text-processing tasks by substituting a different final filter at the end of the chain.

But all was not well. Since libmbfl's filters processed text only one byte or codepoint at a time, and each routine had to save its state before returning, and restore its state upon entry, libmbfl was slow. Slow as a turtle, slow as a snail, slow as whatever-slowly-moving-thing-you-can-think-of. Oh, what was libmbfl to do? A clever plan was hatched: give libmbfl a 256-entry table called a "mblen_table" for each supported text encoding with the property that the byte length of a character can be determined from its first byte. Then, text-processing tasks which were not dependent on the actual content of a string, but only on the number of codepoints, could be performed without ever invoking those wonderful, but painfully slow filters! libmbfl could skip through a string while just examining the first byte of each character. (Of course, this only worked for text encodings with an mblen_table.) For valid strings, the new method worked identically to the previous one. For invalid strings, there were significant differences in behavior, but libmbfl tried to ignore these and bravely pressed on.

The story ends with an ironic twist. Many years later, I became interested in mbstring and reimplemented its internals, replacing the libmbfl code with fresh new code which ran many times faster. The new code was so much faster that in some cases, the mblen_table optimization actually became a pessimization! In other cases, the mblen_table-based code is still faster, but not by a large amount. But now mbstring was haunted by the spectre of Hyrum's Law (https://www.hyrumslaw.com/). With a huge body of legacy code relying on mbstring, almost any observable behavior change runs a significant risk of breaking someone's code. And when this happens, they will not hesitate to vent their rage on the hapless maintainers.

Notwithstanding the rage of the users, about a year ago, I did remove the mblen_table-based code in one place where benchmarks clearly showed it was acting as a pessimization. I don't remember which mbstring function was affected and would need to check the commit log to confirm.

Hi Alex,

Thank you very much for sharing this background context.

Personally, I think the real issue here is not the inconsistency between mbstring functions which are based on the mblen_tables and those which are not. I think a lot of mbstring operations should not be used on invalid strings at all, and that for such operations, mbstring would do well to throw an exception if it receives invalid input. (Like mb_strpos; how do you define the "position of a UTF-8 substring" when the parent string is not UTF-8 at all?) But that would be a huge BC break.

My biggest concern is that this quirk can cause security issues in
user code. I came across this in the first place when discovering an
exploitable security vulnerability in an application. From my point of
view, this is not only about inconsistent behavior but also violates
the documentation for specific functions like mb_strstr. I agree that
a lot of mbstring operations should not be used on invalid strings,
and an exception seems to be an appropriate answer despite the huge BC
impact.

I think it is only a security issue when people accidentally think
mb_* functions should be used if it is available. I've seen people do
mb_strlen() on binary data, for example, not realizing the differences
between mb_strlen and strlen. Or using mb_* functions and then passing
them off to cryptographic functions.

Robert Landers
Software Engineer
Utrecht NL

--

To unsubscribe, visit: https://www.php.net/unsub.php

Hi, Internals.

Sorry if I'm off topic.
I don't know if it will be helpful, Japanese mbstring user if use
these mb_* functions, we use mb_check_encoding.
If character encoding is invalid, then occur error.

&lt;?php

if (mb_check_encoding($string, $encoding) !== true)) {
    // throw Exception or exit(not zero)
}

mb_strlen($string);
// do something

Anyway, I agree if invalid encoding then mbstring is raise exception.
Because not only people who are familiar with character codes.
However, this is BC break, I think need more discussion.

Regards
Yuya

--

Yuya Hamada (tekimen)

2 years ago by Stefan Schiller via internals — view source — reply

unread

Stefan,

My biggest concern is that this quirk can cause security issues in
user code. I came across this in the first place when discovering an
exploitable security vulnerability in an application. From my point of
view, this is not only about inconsistent behavior but also violates
the documentation for specific functions like mb_strstr. I agree that
a lot of mbstring operations should not be used on invalid strings,
and an exception seems to be an appropriate answer despite the huge BC
impact.

I'm sure the vulnerable application is proprietary, but can you at least tell us which mbstring functions were directly involved in this vulnerability, give an overview of the mechanism by which the exploit worked (scrubbed of any specific details about the application), and let us know the general nature of the resulting compromise? (i.e. denial-of-service, information disclosure, allowing users to impersonate other users, etc)

Real-life examples of actual security impact provide far more compelling grounds for a BC break than hypothetical security impact.

The case I am referring to involves mb_strpos, which is used to
determine the index of a specific character, and mb_substr, which is
used to extract the substring until the first occurrence of this
character. For this specific case, this results in a Cross-Site
Scripting (XSS) vulnerability, which would allow an attacker to
perform actions on behalf of an authenticated user. The actual code is
more complex, but the following lines should be a suitable
representation of the issue:

$data = $_GET['data'];
$pos = mb_strpos($data, "<");
$out = $pos ? mb_substr($data, 0, $pos) : $data;
echo $out;

The developer's assumption here is that $out should never contain an
opening HTML tag, and it is thus safe to echo it back in the response.
Despite the fact that e.g. htmlspecialchars should be used to encode
the output, this assumption seems reasonable to me. However, due to
the inconsistent behavior, an attacker can break this assumption. If
$data, e.g., contains the sequence "\xf0AAA<b>", str_pos will consider
the index of "<" to be 4. However, mb_substr assumes the full sequence
to have a length of 4 (1 x 4-byte Unicode character + 3 x 1-byte
Unicode characters). Accordingly, the full sequence, including the
"<b>" tag, is reflected in the response.

Your point about documentation is appreciated. However, you should consider the context. We are dealing with a 20+-year old library, with a large installed base, which has (historically) always had inconsistent, poorly defined semantics in edge cases, and which (historically) has always had poor documentation. Don't forget Hyrum's Law, Stefan. Never forget Hyrum's Law.

In view of all this, we are usually much more likely to amend the mbstring documentation to match behavior than to modify behavior to match documentation. Now, that's not to say we can never modify mbstring behavior to match documentation (I have done it myself several times), but there should always be good reasons to prefer the new behavior (not just "because that's what the documentation says").

I can fully understand your concerns, especially taking into account
the long-standing history of the library and Hyrum's Law. While the
above-mentioned example might even be arguable since two different
functions are involved, the behavior of, e.g., mb_strstr just feels
wrong to me:

$data = "\xf0start";
var_dump(mb_strstr($data, "start")); // string(2) "rt"

Modifying mbstring behavior to make the handling of invalid strings more consistent is a smaller BC break than raising an exception. However, completely refusing to operate on invalid strings (by raising an exception) will be more effective in guarding against potential security vulnerabilities.

Considering BC, completely refusing to operate on invalid strings
seems like a huge break. From a
preventing-security-issues-in-user-code point of view, either of these
solutions seems appropriate.

Hamada-san is absolutely correct that all mbstring users should check strings derived from user input using mb_check_encoding() before operating on them. Unfortunately, "should" doesn't count for much. Most users will not do what they "should".

Amending the documentation to call out these dangers more clearly would be a positive step, though probably not enough.

In general, I like the idea of warning against pitfalls like this in
the documentation, although I estimate the effect, especially on
existing applications to be low.

Best Regards,
Stefan Schiller

2 years ago by Robert Landers — view source — reply

unread

On Tue, Dec 5, 2023 at 9:43 AM Stefan Schiller via internals
internals@lists.php.net wrote:

Stefan,

My biggest concern is that this quirk can cause security issues in
user code. I came across this in the first place when discovering an
exploitable security vulnerability in an application. From my point of
view, this is not only about inconsistent behavior but also violates
the documentation for specific functions like mb_strstr. I agree that
a lot of mbstring operations should not be used on invalid strings,
and an exception seems to be an appropriate answer despite the huge BC
impact.

I'm sure the vulnerable application is proprietary, but can you at least tell us which mbstring functions were directly involved in this vulnerability, give an overview of the mechanism by which the exploit worked (scrubbed of any specific details about the application), and let us know the general nature of the resulting compromise? (i.e. denial-of-service, information disclosure, allowing users to impersonate other users, etc)

Real-life examples of actual security impact provide far more compelling grounds for a BC break than hypothetical security impact.

The case I am referring to involves mb_strpos, which is used to
determine the index of a specific character, and mb_substr, which is
used to extract the substring until the first occurrence of this
character. For this specific case, this results in a Cross-Site
Scripting (XSS) vulnerability, which would allow an attacker to
perform actions on behalf of an authenticated user. The actual code is
more complex, but the following lines should be a suitable
representation of the issue:

$data = $_GET['data'];
$pos = mb_strpos($data, "<");
$out = $pos ? mb_substr($data, 0, $pos) : $data;
echo $out;

The developer's assumption here is that $out should never contain an
opening HTML tag, and it is thus safe to echo it back in the response.
Despite the fact that e.g. htmlspecialchars should be used to encode
the output, this assumption seems reasonable to me. However, due to
the inconsistent behavior, an attacker can break this assumption. If
$data, e.g., contains the sequence "\xf0AAA<b>", str_pos will consider
the index of "<" to be 4. However, mb_substr assumes the full sequence
to have a length of 4 (1 x 4-byte Unicode character + 3 x 1-byte
Unicode characters). Accordingly, the full sequence, including the
"<b>" tag, is reflected in the response.

This feels like an exact case of where mb_* functions shouldn't be
used. HTML spec very clearly states that only "<" (\x3C LESS-THAN
SIGN) is considered for html opening tags. I don't think there is a
reason to consider multibyte strings here.