Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124875 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 099EF1A00B7 for ; Sun, 11 Aug 2024 21:36:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723412286; bh=ezx1YDXf+m0lMQsKcPNCZ4cfBxVweCdQDkQiVOIjPL8=; h=Subject:From:To:Date:In-Reply-To:References:From; b=iMFQudZFMUySbW6XDZzCsFlYm/HT058LAn05dV0vMdsTbdesfwSTwNfBVpl3mYAsr XznajhV0zm6u4alKc2tc3jd9ExMZ3exlVSF0G+THnVKs3iy3tJjAy8v4nnHMc6aQhp SucpmyKl/cFELsK4J1ZCWexhCOzj9Xzu4vlN3CYIsYNS7hBTQD25ZpDHP61W3zkmqT nxx0qee+W9uBA0jNk94aHZOMcXsBT7x5/jzYlc4ZifCRQVk6rPjk4QqP60QgpAlRav vEFnzymy6qsZfNxLAb0U3uamMeUOovrZFctFQ+NKFoJiQfdGhgzWUlSDhUncOCRsX/ 5ndN0w9SgnW9w== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 9D2ED180062 for ; Sun, 11 Aug 2024 21:38:05 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_PASS, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from ageofdream.com (ageofdream.com [45.33.21.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 11 Aug 2024 21:38:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ageofdream.com; s=ageofdream; t=1723412179; bh=ezx1YDXf+m0lMQsKcPNCZ4cfBxVweCdQDkQiVOIjPL8=; h=Subject:From:To:Date:In-Reply-To:References:From; b=bseePNe6+OJ6h8qh3V6uEnuo+nSOnVqPN5rkdd46TdES3kqc43y9yeDY8yC7ii/wp IRjNdgollirkf4ULFpb1vlRH91AUtnwDJG33RN8UmuqGWlvkXIzoNoqFdgm7yxU9Li ZlKhaaj8oCw7ZcGQQxEK++cHh3fAk/WxN5tLF63oEuEbeBS681oYkoWp3MD9yhVM0y la5g6Vtji+YzRLfXEg23lpiFVfm0s9/5Doa2QXjELgYWec+953uYR7kBqHb4hfnK0a jydzGBAWuHYGBLh4dlsZ+pEpAAO/sSK7U9E54/GV9Ei3Svd2+Qjel4nmQOFdM+J/07 mjVBujGUezvJQ== Received: from [192.168.1.7] (unknown [72.255.193.122]) by ageofdream.com (Postfix) with ESMTPSA id 73BA425090 for ; Sun, 11 Aug 2024 17:36:19 -0400 (EDT) Message-ID: Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? To: internals@lists.php.net Date: Sun, 11 Aug 2024 17:36:19 -0400 In-Reply-To: <410a8188-06bf-439f-bdab-c47e73d1db70@app.fastmail.com> References: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com> <410a8188-06bf-439f-bdab-c47e73d1db70@app.fastmail.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.46.4-2 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 From: lists@ageofdream.com (Nick Lockheart) >=20 >=20 > Some background and history, for those not familiar... >=20 > After PHP 5.2, there was a huge effort to move PHP to using Unicode > internally.=C2=A0 It was to be released as PHP 6.=C2=A0 Unfortunately, it= ran > into a whole host of problems, among them: >=20 > 1. It tried to use UTF-16 internally, as there were good libraries > for it but it was much much slower than was acceptable. > 2. It required rewriting basically everything. > 3. Trying to support two string variants at the same time (because > binary strings are still very useful) in almost the same syntax > turned out be, um, kinda hard. >=20 > After a number of years of work, it was eventually concluded that it > was a dead end.=C2=A0 So the non-Unicode-related bits of what would have > been PHP 6 got renamed to PHP 5.3 and released to much fanfare, > kicking off the PHP Renaissance Era. >=20 > When PHP 5.6+1 was released, there was a vote to decide if it should > be called 6 or 7.=C2=A0 7 won, mainly on the grounds that a number of ver= y > stupid book publishers had released "PHP 6" books in anticipation of > PHP 6's release that were now completely useless and misleading.=C2=A0 So > we skipped 6 entirely, and PHP 6-compatibility is a running joke > among those who have been around a while. >=20 > Fortunately, the vast majority of single-byte strings are ASCII, and > ASCII is, by design, a strict subset of UTF-8, so in practice the > lack of native UTF-8 strings rarely causes an issue. >=20 > Trying to introduce Unicode strings to the language now as a native > type would... probably break just as much if not more.=C2=A0 If anything > it's probably harder today than it was in 2008, because the engine > and existing code to not-break has grown considerably. >=20 > A much better approach would be something like this RFC from Derick a > few years ago: >=20 > https://wiki.php.net/rfc/unicode_text_processing >=20 > If you need something today, then Symfony has a user-space > approximation of it:=20 >=20 > https://symfony.com/doc/current/string.html >=20 > --Larry Garfield I think that when people think of "strings", they think of human readable text. I wasn't suggesting that unicode strings be a native type, but rather that functions that have "string" in the name should be UTF-8 safe. There's a lot of pitfalls here, and I don't think the documentation clearly calls out which functions are OK to use with UTF-8 and which ones may cause unexpected surprises. The compatibility between ASCII and UTF-8 for Latin characters is both a curse and a blessing. An application may work fine in testing, but then break when a user submits an emoji. It seems like it would be good to have a set of functions, each for an intended use case, that behave in accordance with their intended usage. For example: Math and number functions for calculations; string functions for human readable text (which are UTF-8 safe), and byte functions for binary processing that are binary safe. Using the functions for certain use cases right now requires knowing the internals of the function, where developers should be able to rely on the name to know that it will work for a specific use case. For many functions, the manual doesn't specify if it is safe for multi- byte characters or not. `ltrim` doesn't mention multi-byte: https://www.php.net/manual/en/function.ltrim.php The `trim` page doesn't mention it either, except there is a user contributed note at the bottom: "Note that trim() is not aware of Unicode points that represent whitespace (e.g., in the General Punctuation block), except, of course, for the ones mentioned in this page. There is no Unicode-specific trim function in PHP at the time of writing (July 2023), but you can try some examples of trims using multibyte strings posted on the comments for the mbstring extension: https://www.php.net/manual/en/ref.mbstring.php". So what I would propose is: (1) All string functions should state in the official man page if they are safe for UTF-8 or not. (2) Functions intended for working with text should be made UTF-8 safe. (3) Functions intended for processing binary should be added if necessary, and should be named something like "binary" or "byte".