Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121889 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 49091 invoked from network); 1 Dec 2023 09:31:43 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 1 Dec 2023 09:31:43 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 36C9718004D for ; Fri, 1 Dec 2023 01:31:51 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.7 required=5.0 tests=BAYES_05,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 1 Dec 2023 01:31:50 -0800 (PST) Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-2860f7942b0so1835042a91.2 for ; Fri, 01 Dec 2023 01:31:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sonarsource.com; s=google; t=1701423100; x=1702027900; darn=lists.php.net; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=HnmhWt5vZAidrvNFWYcVZDk2xLQlcFDmGZGUwovor54=; b=bJ9IQDmlwUNRuTvwoG19P+inQbsv9cSMZTFNvVQEqHGeW4/Q3W2sEfntMCnm8HaA8i KECf/jwnczKVrWz0N1zhCbZ3DpD73BwgwievC9IOJ26y4NMRdSw5BSdIEiiRagnjDtQv vYSXtH9ENw4dyRnZ0cm1kucRs9YT8WAJlvdtsPsFwnRqxplh0U4sB6+9cS5VmNWlxSHy 53n5071g741sr6A7w2YQH4A2QM17kfRLM2sFblCWSuUmunMlBYmylfe91D6lsu6svetw j7PrD1w+Iy0yu/pFhDm4Ot7lIi1o7J9Ap0S2RK/EUVUpWwxfOl7MlOa0+zY5keMuxI6g +UuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701423100; x=1702027900; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=HnmhWt5vZAidrvNFWYcVZDk2xLQlcFDmGZGUwovor54=; b=FmSp6IE62t3l1nWYsc21rcnTjlASKv2I2N3uQkFxYb7NLfVqdA98xg28utXkvj8jEk WfrNEXs910LBUmIx+Yp11T1nzs2ZPHv9uJV2Z81p30kje7cGvs+pnLUSSKuBnu12CUCn 38U2s89E0JnuJhkeDD7uKIo/oFqbUcBdq0yTuNdAN4VH5LKNx+ggkCKHIw7zDh0yvcXn EXl4rAvmUp12f8xBhooP0nmfWN+OEEdCQV8EhjpFrI3HVqzlY4X1+NvPgFFtJxNYyEUX jdXgj/R3WnjBwLYg+xDwvqc8VMohVUtf/zUdnPOkubeoOlbyu/PqWBWqlQFLMTjS5MvZ VueQ== X-Gm-Message-State: AOJu0Yw6ItnsDKtszYmY2iE2CxoShFOkX95ex03BgUfOyeiuUUwoC+co a+k5/wdHNfP+M4BjRKnbx+gytAxhyp1J2cOHhCCgfDQYwi0uim7LGiNkxPNi2DE= X-Google-Smtp-Source: AGHT+IFovPmHp8K5We+C+mAU+5+ijowxGGlrPKdTItQ/k3ZHU9CH60+dnD9B9C7Ict8uy2TG46MezGhnVWnlm6H/2Vo= X-Received: by 2002:a17:90b:1c83:b0:27d:549b:3e65 with SMTP id oo3-20020a17090b1c8300b0027d549b3e65mr24111348pjb.49.1701423100573; Fri, 01 Dec 2023 01:31:40 -0800 (PST) MIME-Version: 1.0 Reply-To: Stefan Schiller Date: Fri, 1 Dec 2023 10:31:29 +0100 Message-ID: To: internals@lists.php.net Content-Type: text/plain; charset="UTF-8" Subject: Inconsistency mbstring functions From: internals@lists.php.net ("Stefan Schiller via internals") Hi, I would like to raise attention to an inconsistency in how mbstring functions handle invalid multibyte sequences. When, for example, mb_strpos encounters a UTF-8 leading byte, it tries to parse the following continuation bytes until the full byte sequence is read. If an invalid byte is encountered, all previously read bytes are considered one character, and the parsing is started over again at the invalid byte. Let's consider the following example: mb_strpos("\xf0\x9fABCD", "B"); // int(2) The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following byte (0x9f) is a valid continuation byte. The next byte (0x41) is not a valid continuation byte. Thus, 0xf0 and 0x9f are considered one character, and 0x41 is regarded as another character. Accordingly, the resulting index of "B" is 2. On the other hand, mb_substr, for example, simply skips over continuation bytes when encountering a leading byte. Let's consider the following example, which uses mb_substr to cut the first two characters from the string used in the previous example: mb_substr("\xf0\x9fABCD", 2); // string(1) "D" Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This time, mb_substr just skips over the next three bytes and considers all 4 bytes one character. Next, it continues to process at byte 0x43 ("C"), which is regarded as another character. Thus, the resulting string is "D". This inconsistency in handling invalid multibyte sequences not only exists between different functions but also affects single functions. Let's consider the following example, which uses mb_strstr to determine the first occurrence of the string "B" in the same string: mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D" The principle is the same, just in a single function call. This inconsistency may not only lead to an unexpected behavior but can also have a security impact when the affected functions are used to filter input. Best Regards, Stefan Schiller [1]: https://www.php.net/manual/en/function.mb-strpos.php [2]: https://www.php.net/manual/de/function.mb-substr.php [3]: https://www.php.net/manual/de/function.mb-strstr.php