Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:119173
Date: Fri, 16 Dec 2022 13:55:02 +0000 (GMT)
To: Rowan Tommins <rowan.collins@gmail.com>
cc: internals@lists.php.net
In-Reply-To: <f1ad71e1-dadd-f194-7eb9-68a746792c08@gmail.com>
Message-ID: <alpine.DEB.2.23.453.2212161329110.462551@singlemalt.home.derickrethans.nl>
References: <alpine.DEB.2.23.453.2212151531360.462551@singlemalt.home.derickrethans.nl> <f1ad71e1-dadd-f194-7eb9-68a746792c08@gmail.com>
User-Agent: Alpine 2.23 (DEB 453 2020-06-18)
MIME-Version: 1.0
Content-Type: multipart/mixed; BOUNDARY="8323329-1410336210-1671198036=:462551"
Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing
From: derick@php.net (Derick Rethans)

--8323329-1410336210-1671198036=:462551
Content-Type: text/plain; CHARSET=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Thu, 15 Dec 2022, Rowan Tommins wrote:

> On 15/12/2022 15:34, Derick Rethans wrote:
> > I have just published an initial draft of the "Unicode Text Processing"
> > RFC, a proposal to have performant unicode text processing always
> > available to PHP users, by introducing a new "Text" class.
> >=20
> > You can find it at:
> > https://wiki.php.net/rfc/unicode_text_processing
> >=20
> > I'm looking forwards to hearing your opinions, additions, and
> > suggestions =E2=80=94 the RFC specifically asks for these in places.
>=20
>=20
> As others have said already, thank you for taking a stab at this importan=
t
> topic. I agree that it would be a really useful feature for the language,=
 but
> it's also a really difficult one to get right. Here are my initial though=
ts...
>=20
> # Design Process
>=20
> Rather than designing the whole class "on paper", I think this really nee=
ds to
> be built as a prototype, where we can build up documentation and tests, p=
lug
> variations into some real life scenarios, and have separate discussions a=
bout
> different details. If we limit ourselves initially to features already ex=
posed
> by ext/intl (I think everything proposed so far is?), a prototype doesn't=
 even
> need to be an extension, it can be in pure PHP. Then once the design is
> finalised, you have a ready-made polyfill for older PHP versions, and a s=
et of
> tests for the native version :)

I do not want a polyfill. These already exist for intl and friends. I=20
had no intention to design everything up front though, and it is likely=20
that I missed useful methods. This is not going to be right in a single=20
implementation.

> # UTF-8 on the outside, UTF-16 on the inside
>=20
> I know this will be a very common combination, but it feels odd that an
> application which actually wanted to work with UTF-16 would need to perfo=
rm
> round-trips through UTF-8 just to use this class. It should at least be
> possible to specify the encoding on input and output.

I disgree. Users should not care what is used in the implementation.=20
It's only UTF-16 because that is what ICU's API use. I do not want the=20
complexity of having different in/ex encodings. Perhaps 15 years ago=20
that was useful to have, but right now, everything should be UTF-8 on=20
the interface layer, that is, if you care about internationalisation.

> # Internationalisation
>=20
> Having locale and collation as state on the object, rather than=20
> parameters on relevant methods, feels like muddling responsibilities.=20
> It makes it hard to reason about what exactly some of the methods will=20
> do: Can I trust that this object will give me a sensible result from=20
> compareWith, or has it been assigned a collation somewhere else? What=20
> exactly will be the definition of "replace" or "contains" for this=20
> pair of objects?

A locale/collator is an inherent property of Text (we're dealing with=20
Text here, not strings). I do need to tidy up the wording about what=20
locales and collations are, as I've so far used them sparingly=20
interchangably.

> How users will work with these also needs careful thought - your first li=
sted
> design goal is "keep it simple", but under locales and Internationalisati=
on is
> the worrying sentence "This will require extensive documentation".

This phrase is meant to mean that the *format of the locale/collator=20
name* needs extensive documentation.

> One function that I would really like to see, for instance, is a=20
> grapheme-aware version of mb_strcut, to solve tasks like: "encode this=20
> abstract Unicode string as UTF-16BE, truncated to at most 200 bytes,=20
> without breaking apart any grapheme clusters".

For that to work, you need a methods that instantly returns UTF-8=20
strings, and not UTF-16. In the RFC, the current subString() uses int=20
$length to mean grapheme clusters. Adding another methods to do=20
something else, is of course possible. I'll think about it (and noted in=20
"Open Issues").

cheers,
Derick

--=20
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/suppo=
rt
Host of PHP Internals News: https://phpinternals.news

mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
--8323329-1410336210-1671198036=:462551--