Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:124875
Message-ID: <c00a7b2b80e83877ef94ebdf6fb9d5c9a8f977ce.camel@ageofdream.com>
Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become
 Multi-Byte Safe?
To: internals@lists.php.net
Date: Sun, 11 Aug 2024 17:36:19 -0400
In-Reply-To: <410a8188-06bf-439f-bdab-c47e73d1db70@app.fastmail.com>
References: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com>
	 <410a8188-06bf-439f-bdab-c47e73d1db70@app.fastmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.46.4-2 
Precedence: bulk
MIME-Version: 1.0
From: lists@ageofdream.com (Nick Lockheart)

>=20
>=20
> Some background and history, for those not familiar...
>=20
> After PHP 5.2, there was a huge effort to move PHP to using Unicode
> internally.=C2=A0 It was to be released as PHP 6.=C2=A0 Unfortunately, it=
 ran
> into a whole host of problems, among them:
>=20
> 1. It tried to use UTF-16 internally, as there were good libraries
> for it but it was much much slower than was acceptable.
> 2. It required rewriting basically everything.
> 3. Trying to support two string variants at the same time (because
> binary strings are still very useful) in almost the same syntax
> turned out be, um, kinda hard.
>=20
> After a number of years of work, it was eventually concluded that it
> was a dead end.=C2=A0 So the non-Unicode-related bits of what would have
> been PHP 6 got renamed to PHP 5.3 and released to much fanfare,
> kicking off the PHP Renaissance Era.
>=20
> When PHP 5.6+1 was released, there was a vote to decide if it should
> be called 6 or 7.=C2=A0 7 won, mainly on the grounds that a number of ver=
y
> stupid book publishers had released "PHP 6" books in anticipation of
> PHP 6's release that were now completely useless and misleading.=C2=A0 So
> we skipped 6 entirely, and PHP 6-compatibility is a running joke
> among those who have been around a while.
>=20
> Fortunately, the vast majority of single-byte strings are ASCII, and
> ASCII is, by design, a strict subset of UTF-8, so in practice the
> lack of native UTF-8 strings rarely causes an issue.
>=20
> Trying to introduce Unicode strings to the language now as a native
> type would... probably break just as much if not more.=C2=A0 If anything
> it's probably harder today than it was in 2008, because the engine
> and existing code to not-break has grown considerably.
>=20
> A much better approach would be something like this RFC from Derick a
> few years ago:
>=20
> https://wiki.php.net/rfc/unicode_text_processing
>=20
> If you need something today, then Symfony has a user-space
> approximation of it:=20
>=20
> https://symfony.com/doc/current/string.html
>=20
> --Larry Garfield


I think that when people think of "strings", they think of human
readable text.

I wasn't suggesting that unicode strings be a native type, but rather
that functions that have "string" in the name should be UTF-8 safe.

There's a lot of pitfalls here, and I don't think the documentation
clearly calls out which functions are OK to use with UTF-8 and which
ones may cause unexpected surprises.

The compatibility between ASCII and UTF-8 for Latin characters is both
a curse and a blessing. An application may work fine in testing, but
then break when a user submits an emoji.

It seems like it would be good to have a set of functions, each for an
intended use case, that behave in accordance with their intended usage.

For example:

Math and number functions for calculations; string functions for human
readable text (which are UTF-8 safe), and byte functions for binary
processing that are binary safe.

Using the functions for certain use cases right now requires knowing
the internals of the function, where developers should be able to rely
on the name to know that it will work for a specific use case.

For many functions, the manual doesn't specify if it is safe for multi-
byte characters or not.

`ltrim` doesn't mention multi-byte:

https://www.php.net/manual/en/function.ltrim.php

The `trim` page doesn't mention it either, except there is a user
contributed note at the bottom: "Note that trim() is not aware of
Unicode points that represent whitespace (e.g., in the General
Punctuation block), except, of course, for the ones mentioned in this
page. There is no Unicode-specific trim function in PHP at the time of
writing (July 2023), but you can try some examples of trims using
multibyte strings posted on the comments for the mbstring extension:
https://www.php.net/manual/en/ref.mbstring.php".


So what I would propose is:

(1) All string functions should state in the official man page if they
are safe for UTF-8 or not.

(2) Functions intended for working with text should be made UTF-8 safe.

(3) Functions intended for processing binary should be added if
necessary, and should be named something like "binary" or "byte".