Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:121893
MIME-Version: 1.0
References: <CAMZNyFfiP5y8QAcNBCo235xG30Ra-PgCKcYfsZVuvgwy9Re-8w@mail.gmail.com>
 <CAFPFaMJeLD1WVzczjfXa4iKg0QtgKxhH0e3mzOMjsa9POf=C9w@mail.gmail.com>
In-Reply-To: <CAFPFaMJeLD1WVzczjfXa4iKg0QtgKxhH0e3mzOMjsa9POf=C9w@mail.gmail.com>
Date: Fri, 1 Dec 2023 18:58:45 +0900
Message-ID: <CAEPPVa1M56cPBu1r9NZcWamDqTx0tFR-DsahVUqSAG47Puifng@mail.gmail.com>
To: internals@lists.php.net
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [PHP-DEV] Inconsistency mbstring functions
From: youkidearitai@gmail.com (youkidearitai)

2023=E5=B9=B412=E6=9C=881=E6=97=A5(=E9=87=91) 18:48 G. P. B. <george.banyar=
d@gmail.com>:
>
> On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals <
> internals@lists.php.net> wrote:
>
> > Hi,
> >
> > I would like to raise attention to an inconsistency in how mbstring
> > functions handle invalid multibyte sequences. When, for example,
> > mb_strpos encounters a UTF-8 leading byte, it tries to parse the
> > following continuation bytes until the full byte sequence is read. If
> > an invalid byte is encountered, all previously read bytes are
> > considered one character, and the parsing is started over again at the
> > invalid byte. Let's consider the following example:
> >
> > mb_strpos("\xf0\x9fABCD", "B"); // int(2)
> >
> > The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
> > byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
> > a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
> > character, and 0x41 is regarded as another character. Accordingly, the
> > resulting index of "B" is 2.
> >
> > On the other hand, mb_substr, for example, simply skips over
> > continuation bytes when encountering a leading byte. Let's consider
> > the following example, which uses mb_substr to cut the first two
> > characters from the string used in the previous example:
> >
> > mb_substr("\xf0\x9fABCD", 2); // string(1) "D"
> >
> > Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
> > time, mb_substr just skips over the next three bytes and considers all
> > 4 bytes one character. Next, it continues to process at byte 0x43
> > ("C"), which is regarded as another character. Thus, the resulting
> > string is "D".
> >
> > This inconsistency in handling invalid multibyte sequences not only
> > exists between different functions but also affects single functions.
> > Let's consider the following example, which uses mb_strstr to
> > determine the first occurrence of the string "B" in the same string:
> >
> > mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"
> >
> > The principle is the same, just in a single function call.
> >
> > This inconsistency may not only lead to an unexpected behavior but can
> > also have a security impact when the affected functions are used to
> > filter input.
> >
> >
> > Best Regards,
> > Stefan Schiller
> >
> > [1]: https://www.php.net/manual/en/function.mb-strpos.php
> > [2]: https://www.php.net/manual/de/function.mb-substr.php
> > [3]: https://www.php.net/manual/de/function.mb-strstr.php
> >
>
> This might have been better to raise as a bug, but in any case I am CCing
> Alex who's the main maintainer of the mbstring extension so he's aware of
> this and can possibly provide some explanations.
>
> Best regards,
>
> Gina P. Banyard

Hi,

> >
> > I would like to raise attention to an inconsistency in how mbstring
> > functions handle invalid multibyte sequences. When, for example,
> > mb_strpos encounters a UTF-8 leading byte, it tries to parse the
> > following continuation bytes until the full byte sequence is read. If
> > an invalid byte is encountered, all previously read bytes are
> > considered one character, and the parsing is started over again at the
> > invalid byte. Let's consider the following example:
> >
> > mb_strpos("\xf0\x9fABCD", "B"); // int(2)

Yes, that's true. Because mb_strpos is convert to UTF-8 in internal.
However, other mbstring function is temporary convert to UTF-32, then
reconvert to original character encoding.

Anyway, I'll wait Alex's reply.

Regards
Yuya

--=20
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------