Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:119179
MIME-Version: 1.0
References: <alpine.DEB.2.23.453.2212151531360.462551@singlemalt.home.derickrethans.nl>
 <f1ad71e1-dadd-f194-7eb9-68a746792c08@gmail.com> <alpine.DEB.2.23.453.2212161329110.462551@singlemalt.home.derickrethans.nl>
In-Reply-To: <alpine.DEB.2.23.453.2212161329110.462551@singlemalt.home.derickrethans.nl>
Date: Fri, 16 Dec 2022 15:59:11 +0000
Message-ID: <CALKiJKr2OyfkLbF6ugDTqztpZED-NgqnkPJPgUoqLc5SBmkCJg@mail.gmail.com>
Cc: internals@lists.php.net
Content-Type: text/plain; charset="UTF-8"
Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing
From: rowan.collins@gmail.com (Rowan Tommins)

On 16 December 2022 13:55:02 GMT, Derick Rethans <derick@php.net> wrote:
>I do not want a polyfill. These already exist for intl and friends.


I think you misunderstood what I meant by "polyfill"; I meant in the
sense that once the real implementation gets included in, say PHP 8.3,
users needing to support, say, PHP 8.0, will have a drop-in
implementation with exactly the same interface.

Anyway, that was just an aside; my main point is that a single-page
RFC, and a single mailing list thread, are probably not sufficient to
iterate on this design. A prototype, or even just a repo with stubs
for the methods, would give us better ways to track all the different
details and ideas.



>I disgree. Users should not care what is used in the implementation.
>It's only UTF-16 because that is what ICU's API use. I do not want the
>complexity of having different in/ex encodings. Perhaps 15 years ago
>that was useful to have, but right now, everything should be UTF-8 on
>the interface layer, that is, if you care about internationalisation.


UTF-8 should definitely be the default, but I disagree that all other
encodings can simply be ignored, and that users should be punished for
using them with extra CPU time spent converting to UTF-8 and back
again. All it would need is an optional argument on a couple of
methods to specify that you want some other encoding.



>A locale/collator is an inherent property of Text (we're dealing with
>Text here, not strings).


Is it though? It makes some sense to say "this is a Turkish Text, so
treat 'i' specially whenever upper-casing". But is there such a thing
as a "case insensitive piece of text"?

If locale is an "inherent property", does it make sense to discard it
when joining Texts together? At the moment, Text::join([$a,
$b])->toUpper() can give a different result from
Text::join([$a->toUpper(), $b->toUpper()]). An implementation that
truly treated locale as inherent would have to track segments within a
larger Text, subject to separate locales. (Similar to how HTML allows
a lang attribute on individual elements.)

For comparisons, I don't see the value at all - if I'm sorting a list
of Texts, the sort order is a property of the sort operation, not of
the individual items. If I have a French Text, a Spanish Text, and an
English Text, there's no meaningful way to use all three sort orders
at once, and no particular reason to choose one over the others. In
the current proposal, using compareWith in a usort callback without
specifying the collation would result in unstable results, because
it's not symmetrical - $a->compareWith($b) can use a different
collation than $b->compareWith($a).


>> the worrying sentence "This will require extensive documentation".
>
>This phrase is meant to mean that the *format of the locale/collator
>name* needs extensive documentation.


I know, and I think that's a bad sign - why are we exposing this
complexity to users in a class that otherwise holds their hand at
every step of the way? I think the parameters should always be a
user-friendly collation/locale object, with the ICU strings an
optional way for experts to create such an object.

Regards,
-- 
Rowan Tommins
[IMSoP]