Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:119193
Date: Wed, 21 Dec 2022 11:48:35 +0000 (GMT)
To: Rowan Tommins <rowan.collins@gmail.com>
cc: internals@lists.php.net
In-Reply-To: <CALKiJKr2OyfkLbF6ugDTqztpZED-NgqnkPJPgUoqLc5SBmkCJg@mail.gmail.com>
Message-ID: <alpine.DEB.2.23.453.2212211121590.462551@singlemalt.home.derickrethans.nl>
References: <alpine.DEB.2.23.453.2212151531360.462551@singlemalt.home.derickrethans.nl> <f1ad71e1-dadd-f194-7eb9-68a746792c08@gmail.com> <alpine.DEB.2.23.453.2212161329110.462551@singlemalt.home.derickrethans.nl>
 <CALKiJKr2OyfkLbF6ugDTqztpZED-NgqnkPJPgUoqLc5SBmkCJg@mail.gmail.com>
User-Agent: Alpine 2.23 (DEB 453 2020-06-18)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing
From: derick@php.net (Derick Rethans)

On Fri, 16 Dec 2022, Rowan Tommins wrote:

> On 16 December 2022 13:55:02 GMT, Derick Rethans <derick@php.net> wrote:
> >I do not want a polyfill. These already exist for intl and friends.
>
> I think you misunderstood what I meant by "polyfill"; I meant in the 
> sense that once the real implementation gets included in, say PHP 8.3, 
> users needing to support, say, PHP 8.0, will have a drop-in 
> implementation with exactly the same interface.

I know what a polyfill is, and I still don't want to see this.

> Anyway, that was just an aside; my main point is that a single-page 
> RFC, and a single mailing list thread, are probably not sufficient to 
> iterate on this design. A prototype, or even just a repo with stubs 
> for the methods, would give us better ways to track all the different 
> details and ideas.

I will certainly be prototypign some of this, but not before the general 
idea has been reasonably accepted.

> >I disgree. Users should not care what is used in the implementation. 
> >It's only UTF-16 because that is what ICU's API use. I do not want 
> >the complexity of having different in/ex encodings. Perhaps 15 years 
> >ago that was useful to have, but right now, everything should be 
> >UTF-8 on the interface layer, that is, if you care about 
> >internationalisation.
> 
> UTF-8 should definitely be the default, but I disagree that all other 
> encodings can simply be ignored, and that users should be punished for 
> using them with extra CPU time spent converting to UTF-8 and back 
> again. All it would need is an optional argument on a couple of 
> methods to specify that you want some other encoding.

I know what it would entail, but I am rejecting it regardless. "Just an 
optional argument on a couple of methods" increases the complexity.

> >A locale/collator is an inherent property of Text (we're dealing with 
> >Text here, not strings).
> 
> Is it though? It makes some sense to say "this is a Turkish Text, so
> treat 'i' specially whenever upper-casing". But is there such a thing
> as a "case insensitive piece of text"?

The locale is inherent, the collator not so much. The collator as set on 
a Text object is therefore more of a default. The ''replaceText'' 
method, and the ''Finding Text in Text'' methods all have a way to 
override this default collation.

I have updated the language in the RFC to be more precise.

> 
> If locale is an "inherent property", does it make sense to discard it
> when joining Texts together? At the moment, Text::join([$a,
> $b])->toUpper() can give a different result from
> Text::join([$a->toUpper(), $b->toUpper()]). An implementation that
> truly treated locale as inherent would have to track segments within a
> larger Text, subject to separate locales. (Similar to how HTML allows
> a lang attribute on individual elements.)
> 
> For comparisons, I don't see the value at all - if I'm sorting a list
> of Texts, the sort order is a property of the sort operation, not of
> the individual items. If I have a French Text, a Spanish Text, and an
> English Text, there's no meaningful way to use all three sort orders
> at once, and no particular reason to choose one over the others. In
> the current proposal, using compareWith in a usort callback without
> specifying the collation would result in unstable results, because
> it's not symmetrical - $a->compareWith($b) can use a different
> collation than $b->compareWith($a).

That sounds like an argument for having a sort() method where you can 
override the collator. I would however expect that most people would not 
set a default collation other than "standard" on Text objects though. 
And if something more clever needs to be done, this can be overridden in 
all methods.

> >> the worrying sentence "This will require extensive documentation".
> >
> >This phrase is meant to mean that the *format of the locale/collator 
> >name* needs extensive documentation.
> 
> 
> I know, and I think that's a bad sign - why are we exposing this 
> complexity to users in a class that otherwise holds their hand at 
> every step of the way? I think the parameters should always be a 
> user-friendly collation/locale object, with the ICU strings an 
> optional way for experts to create such an object.

Yes, and that is why the RFC includes a ''TextCollator'' object that 
does precisely that.

cheers,
Derick

-- 
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news

mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug