Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:119172
Date: Fri, 16 Dec 2022 13:54:55 +0000 (GMT)
To: Tim Starling <tstarling@wikimedia.org>
cc: PHP Developers Mailing List <internals@lists.php.net>
In-Reply-To: <9876abdf-a7af-0e53-ee3e-236021d74377@wikimedia.org>
Message-ID: <alpine.DEB.2.23.453.2212161341010.462551@singlemalt.home.derickrethans.nl>
References: <alpine.DEB.2.23.453.2212151531360.462551@singlemalt.home.derickrethans.nl> <9876abdf-a7af-0e53-ee3e-236021d74377@wikimedia.org>
User-Agent: Alpine 2.23 (DEB 453 2020-06-18)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing
From: derick@php.net (Derick Rethans)

On Fri, 16 Dec 2022, Tim Starling wrote:

> On 16/12/22 02:34, Derick Rethans wrote:
> > 
> > I have just published an initial draft of the "Unicode Text 
> > Processing" RFC, a proposal to have performant unicode text 
> > processing always available to PHP users, by introducing a new 
> > "Text" class.
> 
> Using "collator" and "locale" interchangeably seems imprecise. If the 
> input is an ICU locale string, then I think you should just call it 
> locale. Then the user will be armed with the correct terminology when 
> they go looking for more information in the ICU manual. In ICU, case 
> conversion and BreakIterator need a locale, not a collator.

Yeah, the terms are currently used interchangably (sort of). I will 
update that. Although I really would not suggest that users look at the 
ICU manual, as it's really hard to find things in it :-)

> I'm concerned about the time order of using grapheme offsets. For 
> example, is subString() O(N) in $offset?

Yes. It would have to scan the Text.

> I'm probably not the target audience for this class, since I'm 
> generally looking for maximum flexibility, not minimum complexity. As 
> such, I'd like intl to have better documentation and more features. 
> The RFC has a family of locale-aware case conversion functions which 
> do not exist in intl. This was raised as an issue during the 
> discussion on my ASCII case conversion RFC. It would be great if intl 
> could get those functions too.

AFAIK Intl can do all of these things, but yes, its documentation is 
"sparse". However, that's not in scope of this RFC.

> I think you should consider making this Text class a part of the intl 
> extension. You're adding a class which is similar to the classes in 
> that extension. In terms of data, it's like IntlChar, except it's for 
> strings not characters. Its constructor takes an ICU locale string, 
> just like IntlBreakIterator or MessageFormatter.

I did consider that, and rejected that idea. Intl, although powerful, 
does not have an approcable API. It is also not installed or available 
by default, and I am not suggesting we do that. That than means that it 
doesn't fit the design goals here (having it always available).

cheers,
Derick

-- 
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news

mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug