Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:119161
Message-ID: <f1ad71e1-dadd-f194-7eb9-68a746792c08@gmail.com>
Date: Thu, 15 Dec 2022 22:20:15 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.5.1
To: internals@lists.php.net
References: <alpine.DEB.2.23.453.2212151531360.462551@singlemalt.home.derickrethans.nl>
Content-Language: en-GB
In-Reply-To: <alpine.DEB.2.23.453.2212151531360.462551@singlemalt.home.derickrethans.nl>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing
From: rowan.collins@gmail.com (Rowan Tommins)

On 15/12/2022 15:34, Derick Rethans wrote:
> I have just published an initial draft of the "Unicode Text Processing"
> RFC, a proposal to have performant unicode text processing always
> available to PHP users, by introducing a new "Text" class.
>
> You can find it at:
> https://wiki.php.net/rfc/unicode_text_processing
>
> I'm looking forwards to hearing your opinions, additions, and
> suggestions — the RFC specifically asks for these in places.


As others have said already, thank you for taking a stab at this 
important topic. I agree that it would be a really useful feature for 
the language, but it's also a really difficult one to get right. Here 
are my initial thoughts...

# Design Process

Rather than designing the whole class "on paper", I think this really 
needs to be built as a prototype, where we can build up documentation 
and tests, plug variations into some real life scenarios, and have 
separate discussions about different details. If we limit ourselves 
initially to features already exposed by ext/intl (I think everything 
proposed so far is?), a prototype doesn't even need to be an extension, 
it can be in pure PHP. Then once the design is finalised, you have a 
ready-made polyfill for older PHP versions, and a set of tests for the 
native version :)

We might also want to do some general investigation of what other 
languages and frameworks provide, and which decisions have proven good 
or bad in practice.

# Lossy Transforms

Automatic normalisation and stripping of BOMs seems useful, but it 
immediately rules out use of this class for anything where you want to 
get back what you put in. For instance, if an ORM used Text instances 
for strings in data models, it would generate extra Update queries on 
the database even when the string wasn't otherwise changed. I think it 
would be better to make this easy but explicit.

# UTF-8 on the outside, UTF-16 on the inside

I know this will be a very common combination, but it feels odd that an 
application which actually wanted to work with UTF-16 would need to 
perform round-trips through UTF-8 just to use this class. It should at 
least be possible to specify the encoding on input and output.

Ruby takes an interesting approach where strings are tagged with their 
current binary encoding, and only converted to another form if actually 
required. If your input layer says "$name = new Text($_GET['name'], 
'Windows-1252');" and your output layer says "echo 
$name->asBytes('Windows-1252');" the overhead of converting to UTF-16 
can be skipped entirely, unless something in between says "$name = 
$name->wordsToUpper()". This also removes another source of lossy 
transformation, since some encoding conversions aren't perfectly 
reversible (e.g. the source encoding has more than one byte sequence 
mapped to the same Unicode code point).

# Internationalisation

Having locale and collation as state on the object, rather than 
parameters on relevant methods, feels like muddling responsibilities. It 
makes it hard to reason about what exactly some of the methods will do: 
Can I trust that this object will give me a sensible result from 
compareWith, or has it been assigned a collation somewhere else? What 
exactly will be the definition of "replace" or "contains" for this pair 
of objects?

How users will work with these also needs careful thought - your first 
listed design goal is "keep it simple", but under locales and 
Internationalisation is the worrying sentence "This will require 
extensive documentation". This is one of those places where "doing it 
right" is really hard to combine with "making it easy", because language 
is inherently complex, but users will expect a simple answer to "how do 
I make it case-insensitive?"

# Allowing other abstractions

I 100% approve of your use of grapheme clusters, rather than code 
points, as the primary unit; so many implementations get that wrong. 
However, when interacting with other systems, reasoning about bytes (or 
sometimes even codepoints) is essential.

One function that I would really like to see, for instance, is a 
grapheme-aware version of mb_strcut, to solve tasks like: "encode this 
abstract Unicode string as UTF-16BE, truncated to at most 200 bytes, 
without breaking apart any grapheme clusters".


Thanks again for getting the ball rolling, and I look forward to helping 
iterate the design.

Regards,

-- 
Rowan Tommins
[IMSoP]