Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121921 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 68377 invoked from network); 4 Dec 2023 13:25:50 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 4 Dec 2023 13:25:50 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 580C5180031 for ; Mon, 4 Dec 2023 05:26:00 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-oo1-f51.google.com (mail-oo1-f51.google.com [209.85.161.51]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 4 Dec 2023 05:25:59 -0800 (PST) Received: by mail-oo1-f51.google.com with SMTP id 006d021491bc7-58db7d8f2ebso2976974eaf.0 for ; Mon, 04 Dec 2023 05:25:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701696348; x=1702301148; darn=lists.php.net; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=JbQBXyyhOGWfRKpDmi7tKSrr6U3IBLShPLnPGQSoM9U=; b=U4WY14CdrQwslGKmniKQAjF4F8NBYUuNGaogm1EJFV5TNdFj4+tM9zV9UeOWjWW4Pn 64Np4n/xrLfZh9cI1TbSgy/SYGp7VMEZE9Z6y6rVVzSuz3a9os+Ps2Y4GAtH9BNAt9Ot 6M/uS0JP7s4Gozn1DAaES2bSBOi0YOM2sGn3sdQi1IhILYU8k8amZH8nBcrhKUdoZvWI +sHJ5JnNAsE/MAg7/Qudg5Ipdk5GdSxckM7qHWdMo3evtcMJv2k3iBy+whPg3fEbSXkF oqtKAFGdclnzkfaS4YtgG4RwZZKHHWm1Dut71SWMNHCuAhbnMaBRmRlQ8RyALSwpBEnC 3FTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701696348; x=1702301148; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JbQBXyyhOGWfRKpDmi7tKSrr6U3IBLShPLnPGQSoM9U=; b=i2PC6sTqQJEYSEMT0jsPKwI9NyjsqPhWe6NLJ0dfk/b1VzLpSPSalh804CSWF++F4w iL93Ac/waxbXXH8ZiWGx9gO4Lb7r5IuztcFE/vhUGS9qtWgNb7gK7KatJuqVhxgXwRwk 7LTtna2G1A1ynM/bb0p7hCkSdlLRB7OsjEGSJNRwrXhNCX329jnqYtx5brAV9HuajASk kgv3siAOL6pty/mpGtbLvclgWpQEXtDVeDs/qjh4f1be7naYovPXOrdZeCnHjhszywFp 8nbi8y6jMCKVoJGX+Uc5ZGIrC2RLDLbB4wVNRTj6w1go4t73d5MCKcISsHOewhbZy7s4 KY2w== X-Gm-Message-State: AOJu0Yyyj603Eruj++lZzhmwFA+daYlYn+p0/hsVRXx3OK7Ac9PdCjHu A0Tn4xBTitoEJL2onfrBBu0OJ+SiCANbx5/D7zc= X-Google-Smtp-Source: AGHT+IErtH+guZ2CB4XnQiF3RwIagmyuXMfBygK1wQfVN9a+D2sG4pqsUwtputymR91SdY99FIi53jgQczkH+xwcT3U= X-Received: by 2002:a05:6870:9720:b0:1fb:75a:c41a with SMTP id n32-20020a056870972000b001fb075ac41amr3674301oaq.67.1701696347694; Mon, 04 Dec 2023 05:25:47 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: Date: Mon, 4 Dec 2023 14:25:36 +0100 Message-ID: To: Stefan Schiller Cc: Alex , "G. P. B." , internals@lists.php.net Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] Inconsistency mbstring functions From: landers.robert@gmail.com (Robert Landers) On Mon, Dec 4, 2023 at 1:51=E2=80=AFPM Stefan Schiller via internals wrote: > > On Sat, Dec 2, 2023 at 6:13=E2=80=AFAM Alex wro= te: > > > > Dear Stefan, and Dear Gina, > > > > Thanks for the message. Yes, Stefan has rediscovered an interesting qui= rk of the mbstring library. I have been aware of this for a long time, and = other mbstring developers have too. It dates back to the origin of the libr= ary; actually, even before the origin of mbstring, since mbstring was based= on another library called libmbfl, and this behavior originates from libmb= fl. > > > > Pull your chair up around the fire and let me tell you the tale of libm= bfl. Once upon a time, there was a text-processing library called libmbfl. = libmbfl was based on a collection of text-decoding routines (which converte= d bytes to codepoints) and text-encoding routines (which converted codepoin= ts to bytes). Each such routine was structured as a stateful "filter". Thes= e filters could be assembled into "chains", whereby the output values gener= ated by one routine would automatically be passed to the next. libmbfl coul= d perform many wonderful text-processing tasks by substituting a different = final filter at the end of the chain. > > > > But all was not well. Since libmbfl's filters processed text only one b= yte or codepoint at a time, and each routine had to save its state before r= eturning, and restore its state upon entry, libmbfl was slow. Slow as a tur= tle, slow as a snail, slow as whatever-slowly-moving-thing-you-can-think-of= . Oh, what was libmbfl to do? A clever plan was hatched: give libmbfl a 256= -entry table called a "mblen_table" for each supported text encoding with t= he property that the byte length of a character can be determined from its = first byte. Then, text-processing tasks which were not dependent on the act= ual content of a string, but only on the number of codepoints, could be per= formed without ever invoking those wonderful, but painfully slow filters! l= ibmbfl could skip through a string while just examining the first byte of e= ach character. (Of course, this only worked for text encodings with an mble= n_table.) For valid strings, the new method worked identically to the previ= ous one. For invalid strings, there were significant differences in behavio= r, but libmbfl tried to ignore these and bravely pressed on. > > > > The story ends with an ironic twist. Many years later, I became interes= ted in mbstring and reimplemented its internals, replacing the libmbfl code= with fresh new code which ran many times faster. The new code was so much = faster that in some cases, the mblen_table optimization actually became a p= essimization! In other cases, the mblen_table-based code is still faster, b= ut not by a large amount. But now mbstring was haunted by the spectre of Hy= rum's Law (https://www.hyrumslaw.com/). With a huge body of legacy code rel= ying on mbstring, almost any observable behavior change runs a significant = risk of breaking someone's code. And when this happens, they will not hesit= ate to vent their rage on the hapless maintainers. > > > > Notwithstanding the rage of the users, about a year ago, I did remove t= he mblen_table-based code in one place where benchmarks clearly showed it w= as acting as a pessimization. I don't remember which mbstring function was = affected and would need to check the commit log to confirm. > > Hi Alex, > > Thank you very much for sharing this background context. > > > > > Personally, I think the real issue here is not the inconsistency betwee= n mbstring functions which are based on the mblen_tables and those which ar= e not. I think a lot of mbstring operations should not be used on invalid s= trings at all, and that for such operations, mbstring would do well to thro= w an exception if it receives invalid input. (Like mb_strpos; how do you de= fine the "position of a UTF-8 substring" when the parent string is not UTF-= 8 at all?) But that would be a huge BC break. > > > > My biggest concern is that this quirk can cause security issues in > user code. I came across this in the first place when discovering an > exploitable security vulnerability in an application. From my point of > view, this is not only about inconsistent behavior but also violates > the documentation for specific functions like mb_strstr. I agree that > a lot of mbstring operations should not be used on invalid strings, > and an exception seems to be an appropriate answer despite the huge BC > impact. I think it is only a security issue when people accidentally think mb_* functions should be used if it is available. I've seen people do mb_strlen() on binary data, for example, not realizing the differences between mb_strlen and strlen. Or using mb_* functions and then passing them off to cryptographic functions. Robert Landers Software Engineer Utrecht NL