Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121920 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 65700 invoked from network); 4 Dec 2023 12:51:47 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 4 Dec 2023 12:51:47 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 85A62180003 for ; Mon, 4 Dec 2023 04:51:54 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,T_SPF_TEMPERROR autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 4 Dec 2023 04:51:53 -0800 (PST) Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-2866951b6e0so2214057a91.2 for ; Mon, 04 Dec 2023 04:51:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sonarsource.com; s=google; t=1701694302; x=1702299102; darn=lists.php.net; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=WcBzsOLd2UP1+itw9wlYUF5Woy8qtIdlNh6z+iZTtgQ=; b=XXQ2Dwt1JtvT6nRfsfV/n+tKY6A3oR809c2by8mdr8kyl+Xf3SpKM8GRCLQ4BLdSZa 8sPrDp4k34DSgrFrErMnTKcHIUJr9xstwwa4B5NOe0f0ADlNiGZEb1tTKJqrv5E3C4El Gg6pjd1RLGYL9Hccwk9Mb7vF/NcTyuNl8Kjt20Cc73MDw9qJ1heL+M/2qrFWDMD2J85u tDvycFTLeOwqBhTr7/YlEoC6cIqu+TLcJ7HUSKI4XxyOGYbCuE5YwkJcWPORnsheofsM DzbKxxqs7hs4C1kfIj6sAUy4j0AFxOrpVpIcuyo2toUUq6YKRTlbeAjaMC9u7BzcPCVz Y2YA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701694302; x=1702299102; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WcBzsOLd2UP1+itw9wlYUF5Woy8qtIdlNh6z+iZTtgQ=; b=g8xqvQeLalyIpycuLXZ8tNI6voh7dbBwTnHzmkXsWRtu+gvfgWTHIEgVPHHS2K7K6n uIObToWHKj8XHzJt2d15f9GXvA//T94uCl6vzvN63XKUIzwogti+iHmRUg5QopIMJfhE VW1fok3Dk/GmSbgYg3367YdtEsslalSEiyCYm15B/WS1q8SqKSjCuMADGSIaNORLG+Ww 8NCtoETZNci32ND4CP5hWfkf4S3c5Y57RRTcA8GgkObplMuC2jcefeTCQIpjljpU4IXw BvvQAweBnFm+c2D6cmAqxzmlgCpfMpZWIgm9WqoV6kYHHUbiNnhU7mK0zoM/JGYdX8AX QD5Q== X-Gm-Message-State: AOJu0YzBYXjhK432AyAYTQSro0cOTjG7z2gp3MyuIRvfF+dN9R0USmh8 2vF6yNm3scmMXHfzT+Bjz9RN/5iKj28Qjo9b8hyZIA== X-Google-Smtp-Source: AGHT+IHnVuOqnT2xTQiEK7xJl4Ie7hNPjsr9IJY2ZXoTVobwjL3F9ICiKSIpcxZu934ggDpTjWditxdseLihGkn6KYU= X-Received: by 2002:a17:90b:4c43:b0:286:7e96:a81e with SMTP id np3-20020a17090b4c4300b002867e96a81emr2522853pjb.9.1701694301783; Mon, 04 Dec 2023 04:51:41 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: Reply-To: Stefan Schiller Date: Mon, 4 Dec 2023 13:51:30 +0100 Message-ID: To: Alex Cc: "G. P. B." , internals@lists.php.net Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] Inconsistency mbstring functions From: internals@lists.php.net ("Stefan Schiller via internals") On Sat, Dec 2, 2023 at 6:13=E2=80=AFAM Alex wrote= : > > Dear Stefan, and Dear Gina, > > Thanks for the message. Yes, Stefan has rediscovered an interesting quirk= of the mbstring library. I have been aware of this for a long time, and ot= her mbstring developers have too. It dates back to the origin of the librar= y; actually, even before the origin of mbstring, since mbstring was based o= n another library called libmbfl, and this behavior originates from libmbfl= . > > Pull your chair up around the fire and let me tell you the tale of libmbf= l. Once upon a time, there was a text-processing library called libmbfl. li= bmbfl was based on a collection of text-decoding routines (which converted = bytes to codepoints) and text-encoding routines (which converted codepoints= to bytes). Each such routine was structured as a stateful "filter". These = filters could be assembled into "chains", whereby the output values generat= ed by one routine would automatically be passed to the next. libmbfl could = perform many wonderful text-processing tasks by substituting a different fi= nal filter at the end of the chain. > > But all was not well. Since libmbfl's filters processed text only one byt= e or codepoint at a time, and each routine had to save its state before ret= urning, and restore its state upon entry, libmbfl was slow. Slow as a turtl= e, slow as a snail, slow as whatever-slowly-moving-thing-you-can-think-of. = Oh, what was libmbfl to do? A clever plan was hatched: give libmbfl a 256-e= ntry table called a "mblen_table" for each supported text encoding with the= property that the byte length of a character can be determined from its fi= rst byte. Then, text-processing tasks which were not dependent on the actua= l content of a string, but only on the number of codepoints, could be perfo= rmed without ever invoking those wonderful, but painfully slow filters! lib= mbfl could skip through a string while just examining the first byte of eac= h character. (Of course, this only worked for text encodings with an mblen_= table.) For valid strings, the new method worked identically to the previou= s one. For invalid strings, there were significant differences in behavior,= but libmbfl tried to ignore these and bravely pressed on. > > The story ends with an ironic twist. Many years later, I became intereste= d in mbstring and reimplemented its internals, replacing the libmbfl code w= ith fresh new code which ran many times faster. The new code was so much fa= ster that in some cases, the mblen_table optimization actually became a pes= simization! In other cases, the mblen_table-based code is still faster, but= not by a large amount. But now mbstring was haunted by the spectre of Hyru= m's Law (https://www.hyrumslaw.com/). With a huge body of legacy code relyi= ng on mbstring, almost any observable behavior change runs a significant ri= sk of breaking someone's code. And when this happens, they will not hesitat= e to vent their rage on the hapless maintainers. > > Notwithstanding the rage of the users, about a year ago, I did remove the= mblen_table-based code in one place where benchmarks clearly showed it was= acting as a pessimization. I don't remember which mbstring function was af= fected and would need to check the commit log to confirm. Hi Alex, Thank you very much for sharing this background context. > > Personally, I think the real issue here is not the inconsistency between = mbstring functions which are based on the mblen_tables and those which are = not. I think a lot of mbstring operations should not be used on invalid str= ings at all, and that for such operations, mbstring would do well to throw = an exception if it receives invalid input. (Like mb_strpos; how do you defi= ne the "position of a UTF-8 substring" when the parent string is not UTF-8 = at all?) But that would be a huge BC break. > My biggest concern is that this quirk can cause security issues in user code. I came across this in the first place when discovering an exploitable security vulnerability in an application. From my point of view, this is not only about inconsistent behavior but also violates the documentation for specific functions like mb_strstr. I agree that a lot of mbstring operations should not be used on invalid strings, and an exception seems to be an appropriate answer despite the huge BC impact. Best Regards, Stefan Schiller