Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121892 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 54721 invoked from network); 1 Dec 2023 09:48:06 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 1 Dec 2023 09:48:06 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id AE060180031 for ; Fri, 1 Dec 2023 01:48:14 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-pg1-f172.google.com (mail-pg1-f172.google.com [209.85.215.172]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 1 Dec 2023 01:48:14 -0800 (PST) Received: by mail-pg1-f172.google.com with SMTP id 41be03b00d2f7-5c2b7ec93bbso248075a12.2 for ; Fri, 01 Dec 2023 01:48:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701424084; x=1702028884; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=12aFb2rtAg1IksEBSGCWgnJa8UcmHQ9quqYxOSyNEb8=; b=MJkCuYKvFmn0BstXZwVAJxlFwfHvUdHM1XSK749bc7IYkxJzy8get/46HcD++zp+Zj fR/GUltbWmFgTDsce337E2Y3LTEXmb3g49g6j1tiHRP1ZnfbQW9JQRsBfUz94s6Cf/gd 2mM4I1gjSWVmtl482gXd57EHEvkVvAcPV6JygITpeQCvPYticgJI+lTt3bi8WMY969t4 86vRycrlZEUWVdyp7yQbhRRm91FBcZx50TE61//9YIU+UtlxfjaujCUl6gY1+OxT1Lqw 3KrxJISY5BCqjl4lnxo0eLSD/2JPuOJBi++Eto5/naAg3pQcvUtWQnXb1x/xZQ2FtswU ZzGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701424084; x=1702028884; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=12aFb2rtAg1IksEBSGCWgnJa8UcmHQ9quqYxOSyNEb8=; b=vto9wW60t+DZAkuCKiwEHaECWk4JCGMkqCmi2xw4sYymQ+/J3Be4zZYhJJTIGnXJBc kOrbOB1t8R5eeMdJINcL4S/+hqw/t87v6lrZqp6D6OE12KsxcfpGuc9AeOCT3cWrkgwk sRE1t09oshRpbRdh+TyE7a/GL5lfDqBEg1fK6HNjWbkbwda4wlT3cIJdthGs4TzJihS+ YpSTbs6VbaCWhopHLV+ZWw7osnGARuDKrvk9o0ZM+G43kaBxsGGRmfn4NxuCK8uzJSVV rnAvznQtaAb17fAK8PvUvtlEh8Dobja/R67JysaqQzmDKecI5PdadwsPHsM1HKS195VX CRCg== X-Gm-Message-State: AOJu0YwmY3slR5raOhIbvGNPCURrCREWk9ipML2QW7OezHISI34qaa5D 9ONE8NQfuXw5m3cN/vO+/opmIHD2RDMEWruJ3Cc= X-Google-Smtp-Source: AGHT+IGqFifvfqQQRLxKIpcaAaivd+YceDNP98jo9SlOpqKl0trjPLbkxxs0495hwBGckvcfH1ab5GzghIt2uHc4608= X-Received: by 2002:a17:90a:b00a:b0:280:18bd:ffe7 with SMTP id x10-20020a17090ab00a00b0028018bdffe7mr24205820pjq.48.1701424084087; Fri, 01 Dec 2023 01:48:04 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: Date: Fri, 1 Dec 2023 09:47:52 +0000 Message-ID: To: Stefan Schiller Cc: internals@lists.php.net, alexinbeijing@gmail.com Content-Type: multipart/alternative; boundary="00000000000024fa41060b6faaa8" Subject: Re: [PHP-DEV] Inconsistency mbstring functions From: george.banyard@gmail.com ("G. P. B.") --00000000000024fa41060b6faaa8 Content-Type: text/plain; charset="UTF-8" On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals < internals@lists.php.net> wrote: > Hi, > > I would like to raise attention to an inconsistency in how mbstring > functions handle invalid multibyte sequences. When, for example, > mb_strpos encounters a UTF-8 leading byte, it tries to parse the > following continuation bytes until the full byte sequence is read. If > an invalid byte is encountered, all previously read bytes are > considered one character, and the parsing is started over again at the > invalid byte. Let's consider the following example: > > mb_strpos("\xf0\x9fABCD", "B"); // int(2) > > The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following > byte (0x9f) is a valid continuation byte. The next byte (0x41) is not > a valid continuation byte. Thus, 0xf0 and 0x9f are considered one > character, and 0x41 is regarded as another character. Accordingly, the > resulting index of "B" is 2. > > On the other hand, mb_substr, for example, simply skips over > continuation bytes when encountering a leading byte. Let's consider > the following example, which uses mb_substr to cut the first two > characters from the string used in the previous example: > > mb_substr("\xf0\x9fABCD", 2); // string(1) "D" > > Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This > time, mb_substr just skips over the next three bytes and considers all > 4 bytes one character. Next, it continues to process at byte 0x43 > ("C"), which is regarded as another character. Thus, the resulting > string is "D". > > This inconsistency in handling invalid multibyte sequences not only > exists between different functions but also affects single functions. > Let's consider the following example, which uses mb_strstr to > determine the first occurrence of the string "B" in the same string: > > mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D" > > The principle is the same, just in a single function call. > > This inconsistency may not only lead to an unexpected behavior but can > also have a security impact when the affected functions are used to > filter input. > > > Best Regards, > Stefan Schiller > > [1]: https://www.php.net/manual/en/function.mb-strpos.php > [2]: https://www.php.net/manual/de/function.mb-substr.php > [3]: https://www.php.net/manual/de/function.mb-strstr.php > This might have been better to raise as a bug, but in any case I am CCing Alex who's the main maintainer of the mbstring extension so he's aware of this and can possibly provide some explanations. Best regards, Gina P. Banyard --00000000000024fa41060b6faaa8--