Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121893 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 56924 invoked from network); 1 Dec 2023 09:59:00 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 1 Dec 2023 09:59:00 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 39E6D180031 for ; Fri, 1 Dec 2023 01:59:08 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f42.google.com (mail-wr1-f42.google.com [209.85.221.42]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 1 Dec 2023 01:59:07 -0800 (PST) Received: by mail-wr1-f42.google.com with SMTP id ffacd0b85a97d-332f90a375eso1347396f8f.3 for ; Fri, 01 Dec 2023 01:58:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701424737; x=1702029537; darn=lists.php.net; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ZclbVJKpMgPaCYvAayjDatGhSUT05FV6PuGSa3rc13I=; b=H33OOxQfvYUBrFRWqtRksXxll6oEDX9nwMdRUtaGJNaBM+TAu/fzn6uJFQUO2zQbLC ecPU3EUCAi+VNkH/XFEEIATRfAOYZlA8bTC452Ozo+ObkZcLEu/bUWwMZ4qet8JYmElg mj1mU0hpMFPBHd2atN0u3DSzkAKUOCxfvol8y8CecDzu6SB7fC/T/fTWoF3wBGR13il9 sv/cpc7ETJkdhGkwRUQWPFhAW9bKK8ZVb0HXf5ke1hOCdcquofqp6fht/CEwsk2aafSd xH0fOII4LD8UEq6Ud6d9jpxEHV4AqpOfw515mpxsR4PKfRVxnqdG4PzjuB8oM1g2aby0 Lozw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701424737; x=1702029537; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZclbVJKpMgPaCYvAayjDatGhSUT05FV6PuGSa3rc13I=; b=ZFWqpjB0egu0mzI1m0HINe+nAEJz/aFftxJCnxXcFqkb/ERLehwohO0YNCSy/GDFAK dcEJSs0yJHiYN5hYAMgXEicUJkl3BCqlJmThx9GRBMnxoPjaHWGUtyB5PGKUsp1fLuaW MLFg78NQsJ7hQ9EQHeSloCFDYuvZ1GZ0M2RKpyqht9Z1wN3if83F+9lcDE4E2xtoC4an +wk77SGfJn0ycpyvTniR88Tzps6vl+sNZ6amdkcvVvn0peMT60Ttw8ezeNmjY5Fjmi3c muABP3ir+YiI+mAnRlq4vE9J3zSqClvaMey+L3OE1G8drPNNQndJAWF7nJe6OeiK9u5K COHQ== X-Gm-Message-State: AOJu0YxfD2ehGemShxUS55ZV04fTXa8Y65b63QhmtcTzCOWjDKB201dX DJcYLT8FakiGe2Y7eLIJ11qH9mQNtAm2z3hnupYQ0ueNGQ== X-Google-Smtp-Source: AGHT+IG3GZUkMNw44nqQQ+SltdCEX8pfrhBDmcPMN0W+pbtpcrgxMhDAe+GEWquru0G0Eszxt01rqikKq/JCEXYXEIQ= X-Received: by 2002:adf:dd90:0:b0:333:2fd2:51e5 with SMTP id x16-20020adfdd90000000b003332fd251e5mr601221wrl.94.1701424736943; Fri, 01 Dec 2023 01:58:56 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: Date: Fri, 1 Dec 2023 18:58:45 +0900 Message-ID: To: internals@lists.php.net Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] Inconsistency mbstring functions From: youkidearitai@gmail.com (youkidearitai) 2023=E5=B9=B412=E6=9C=881=E6=97=A5(=E9=87=91) 18:48 G. P. B. : > > On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals < > internals@lists.php.net> wrote: > > > Hi, > > > > I would like to raise attention to an inconsistency in how mbstring > > functions handle invalid multibyte sequences. When, for example, > > mb_strpos encounters a UTF-8 leading byte, it tries to parse the > > following continuation bytes until the full byte sequence is read. If > > an invalid byte is encountered, all previously read bytes are > > considered one character, and the parsing is started over again at the > > invalid byte. Let's consider the following example: > > > > mb_strpos("\xf0\x9fABCD", "B"); // int(2) > > > > The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following > > byte (0x9f) is a valid continuation byte. The next byte (0x41) is not > > a valid continuation byte. Thus, 0xf0 and 0x9f are considered one > > character, and 0x41 is regarded as another character. Accordingly, the > > resulting index of "B" is 2. > > > > On the other hand, mb_substr, for example, simply skips over > > continuation bytes when encountering a leading byte. Let's consider > > the following example, which uses mb_substr to cut the first two > > characters from the string used in the previous example: > > > > mb_substr("\xf0\x9fABCD", 2); // string(1) "D" > > > > Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This > > time, mb_substr just skips over the next three bytes and considers all > > 4 bytes one character. Next, it continues to process at byte 0x43 > > ("C"), which is regarded as another character. Thus, the resulting > > string is "D". > > > > This inconsistency in handling invalid multibyte sequences not only > > exists between different functions but also affects single functions. > > Let's consider the following example, which uses mb_strstr to > > determine the first occurrence of the string "B" in the same string: > > > > mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D" > > > > The principle is the same, just in a single function call. > > > > This inconsistency may not only lead to an unexpected behavior but can > > also have a security impact when the affected functions are used to > > filter input. > > > > > > Best Regards, > > Stefan Schiller > > > > [1]: https://www.php.net/manual/en/function.mb-strpos.php > > [2]: https://www.php.net/manual/de/function.mb-substr.php > > [3]: https://www.php.net/manual/de/function.mb-strstr.php > > > > This might have been better to raise as a bug, but in any case I am CCing > Alex who's the main maintainer of the mbstring extension so he's aware of > this and can possibly provide some explanations. > > Best regards, > > Gina P. Banyard Hi, > > > > I would like to raise attention to an inconsistency in how mbstring > > functions handle invalid multibyte sequences. When, for example, > > mb_strpos encounters a UTF-8 leading byte, it tries to parse the > > following continuation bytes until the full byte sequence is read. If > > an invalid byte is encountered, all previously read bytes are > > considered one character, and the parsing is started over again at the > > invalid byte. Let's consider the following example: > > > > mb_strpos("\xf0\x9fABCD", "B"); // int(2) Yes, that's true. Because mb_strpos is convert to UTF-8 in internal. However, other mbstring function is temporary convert to UTF-32, then reconvert to original character encoding. Anyway, I'll wait Alex's reply. Regards Yuya --=20 --------------------------- Yuya Hamada (tekimen) - https://tekitoh-memdhoi.info - https://github.com/youkidearitai -----------------------------