Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121922 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 71229 invoked from network); 4 Dec 2023 14:05:49 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 4 Dec 2023 14:05:49 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8D487180038 for ; Mon, 4 Dec 2023 06:05:56 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f42.google.com (mail-wr1-f42.google.com [209.85.221.42]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 4 Dec 2023 06:05:56 -0800 (PST) Received: by mail-wr1-f42.google.com with SMTP id ffacd0b85a97d-332c46d5988so3428124f8f.1 for ; Mon, 04 Dec 2023 06:05:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701698743; x=1702303543; darn=lists.php.net; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ML4/2nbtqWpIJzQ20UIYDlmTVNYdBiSXcacGj9HHu1E=; b=CSrlQ5l+dCagfJsLe55W89Dilhdnoz7m5R0nheBchuGKEOjml7l0DLVjTLNg90Jyd0 La1QHc12cwuZ789AEfikWiQ/r9xP7YvRfsNZI+proqVde2C4V1h8j0IJRJMUL8rM0zA5 YYlXqHAhyK/zNBtOSo7rAbAJGcEsA5p7HXJWrhLeNMta/iEn1WaO8ryaRx2iXoQxr04/ qvUTEYUEQac7IepDu8N35YSrqpZwKvHCvUZgC+MCWzRxVDcpkK3RQHfHHKq9H8GRQ3zE IDr1kUDDbQGbc+XnOFfArpw8HVKGvqFbEzAeUmNrM24PC3Rzl47KWEo6YlWaWFIxmoDr T1mQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701698743; x=1702303543; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ML4/2nbtqWpIJzQ20UIYDlmTVNYdBiSXcacGj9HHu1E=; b=XMSyjQ2wraBxRlXv8popsbKTuiGtDBtvbcc6+ScjzbhD9Hebkn2EBx9fCaJhO3j1Lv h/LmvmKaSXts36E4Y9q/bVx9mA6R6auGow/gV1HJk5fxbMvhGfnM45v725h6DQGArKmp URquZpRQ26ApacYPtz2dcgVhCVQ2Ymt4j+L9F+mJMvUXgmzFveuw0fhzw3L6SKB5Y4h2 Cc8g3Gq0F6iSqQlZzvdT9aBWGxzTVNTaEl6kK0y7+UYflidHOnu0x+ZEXfvuQ55jHsOq eVFl2solueHW9QJ0VUONk9Zs9/WHjIMRvHJ4FLSbztUyNbn9TJdAESe+LRZTUNEBO1f1 dr/Q== X-Gm-Message-State: AOJu0YwUnjoz7UOMSQTCnvrTuGG0BwH5PzchuEUvQVWZ1PEjTrC4coQD FRPFkvAq1h2rJ7+DxwsYJ5tzc1cuFO4Gu4jUKl3nrevrOw== X-Google-Smtp-Source: AGHT+IGw47Kc8/9JKvUrmep4pTJsqPHzpw0VCexTlC1C+jaVYdI7wtiCQbvs05paj+xXXlcFGMy+SRIG8UL8LXBO8n4= X-Received: by 2002:a5d:4582:0:b0:333:17b7:b3f0 with SMTP id p2-20020a5d4582000000b0033317b7b3f0mr3431538wrq.57.1701698743164; Mon, 04 Dec 2023 06:05:43 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: Date: Mon, 4 Dec 2023 23:05:32 +0900 Message-ID: To: internals@lists.php.net Cc: Alex Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] Inconsistency mbstring functions From: youkidearitai@gmail.com (youkidearitai) 2023=E5=B9=B412=E6=9C=884=E6=97=A5(=E6=9C=88) 22:25 Robert Landers : > > On Mon, Dec 4, 2023 at 1:51=E2=80=AFPM Stefan Schiller via internals > wrote: > > > > On Sat, Dec 2, 2023 at 6:13=E2=80=AFAM Alex w= rote: > > > > > > Dear Stefan, and Dear Gina, > > > > > > Thanks for the message. Yes, Stefan has rediscovered an interesting q= uirk of the mbstring library. I have been aware of this for a long time, an= d other mbstring developers have too. It dates back to the origin of the li= brary; actually, even before the origin of mbstring, since mbstring was bas= ed on another library called libmbfl, and this behavior originates from lib= mbfl. > > > > > > Pull your chair up around the fire and let me tell you the tale of li= bmbfl. Once upon a time, there was a text-processing library called libmbfl= . libmbfl was based on a collection of text-decoding routines (which conver= ted bytes to codepoints) and text-encoding routines (which converted codepo= ints to bytes). Each such routine was structured as a stateful "filter". Th= ese filters could be assembled into "chains", whereby the output values gen= erated by one routine would automatically be passed to the next. libmbfl co= uld perform many wonderful text-processing tasks by substituting a differen= t final filter at the end of the chain. > > > > > > But all was not well. Since libmbfl's filters processed text only one= byte or codepoint at a time, and each routine had to save its state before= returning, and restore its state upon entry, libmbfl was slow. Slow as a t= urtle, slow as a snail, slow as whatever-slowly-moving-thing-you-can-think-= of. Oh, what was libmbfl to do? A clever plan was hatched: give libmbfl a 2= 56-entry table called a "mblen_table" for each supported text encoding with= the property that the byte length of a character can be determined from it= s first byte. Then, text-processing tasks which were not dependent on the a= ctual content of a string, but only on the number of codepoints, could be p= erformed without ever invoking those wonderful, but painfully slow filters!= libmbfl could skip through a string while just examining the first byte of= each character. (Of course, this only worked for text encodings with an mb= len_table.) For valid strings, the new method worked identically to the pre= vious one. For invalid strings, there were significant differences in behav= ior, but libmbfl tried to ignore these and bravely pressed on. > > > > > > The story ends with an ironic twist. Many years later, I became inter= ested in mbstring and reimplemented its internals, replacing the libmbfl co= de with fresh new code which ran many times faster. The new code was so muc= h faster that in some cases, the mblen_table optimization actually became a= pessimization! In other cases, the mblen_table-based code is still faster,= but not by a large amount. But now mbstring was haunted by the spectre of = Hyrum's Law (https://www.hyrumslaw.com/). With a huge body of legacy code r= elying on mbstring, almost any observable behavior change runs a significan= t risk of breaking someone's code. And when this happens, they will not hes= itate to vent their rage on the hapless maintainers. > > > > > > Notwithstanding the rage of the users, about a year ago, I did remove= the mblen_table-based code in one place where benchmarks clearly showed it= was acting as a pessimization. I don't remember which mbstring function wa= s affected and would need to check the commit log to confirm. > > > > Hi Alex, > > > > Thank you very much for sharing this background context. > > > > > > > > Personally, I think the real issue here is not the inconsistency betw= een mbstring functions which are based on the mblen_tables and those which = are not. I think a lot of mbstring operations should not be used on invalid= strings at all, and that for such operations, mbstring would do well to th= row an exception if it receives invalid input. (Like mb_strpos; how do you = define the "position of a UTF-8 substring" when the parent string is not UT= F-8 at all?) But that would be a huge BC break. > > > > > > > My biggest concern is that this quirk can cause security issues in > > user code. I came across this in the first place when discovering an > > exploitable security vulnerability in an application. From my point of > > view, this is not only about inconsistent behavior but also violates > > the documentation for specific functions like mb_strstr. I agree that > > a lot of mbstring operations should not be used on invalid strings, > > and an exception seems to be an appropriate answer despite the huge BC > > impact. > > I think it is only a security issue when people accidentally think > mb_* functions should be used if it is available. I've seen people do > mb_strlen() on binary data, for example, not realizing the differences > between mb_strlen and strlen. Or using mb_* functions and then passing > them off to cryptographic functions. > > Robert Landers > Software Engineer > Utrecht NL > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: https://www.php.net/unsub.php > Hi, Internals. Sorry if I'm off topic. I don't know if it will be helpful, Japanese mbstring user if use these mb_* functions, we use mb_check_encoding. If character encoding is invalid, then occur error. ```