Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119193 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 27736 invoked from network); 21 Dec 2022 11:48:37 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 21 Dec 2022 11:48:37 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8D5EC1804FF for ; Wed, 21 Dec 2022 03:48:36 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,SPF_HELO_PASS, SPF_NEUTRAL,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS30827 82.113.144.0/20 X-Spam-Virus: No X-Envelope-From: Received: from xdebug.org (xdebug.org [82.113.146.227]) by php-smtp4.php.net (Postfix) with ESMTP for ; Wed, 21 Dec 2022 03:48:35 -0800 (PST) Received: from localhost (localhost [IPv6:::1]) by xdebug.org (Postfix) with ESMTPS id 2EFA610C025; Wed, 21 Dec 2022 11:48:35 +0000 (GMT) Date: Wed, 21 Dec 2022 11:48:35 +0000 (GMT) X-X-Sender: derick@singlemalt.home.derickrethans.nl To: Rowan Tommins cc: internals@lists.php.net In-Reply-To: Message-ID: References: User-Agent: Alpine 2.23 (DEB 453 2020-06-18) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing From: derick@php.net (Derick Rethans) On Fri, 16 Dec 2022, Rowan Tommins wrote: > On 16 December 2022 13:55:02 GMT, Derick Rethans wrote: > >I do not want a polyfill. These already exist for intl and friends. > > I think you misunderstood what I meant by "polyfill"; I meant in the > sense that once the real implementation gets included in, say PHP 8.3, > users needing to support, say, PHP 8.0, will have a drop-in > implementation with exactly the same interface. I know what a polyfill is, and I still don't want to see this. > Anyway, that was just an aside; my main point is that a single-page > RFC, and a single mailing list thread, are probably not sufficient to > iterate on this design. A prototype, or even just a repo with stubs > for the methods, would give us better ways to track all the different > details and ideas. I will certainly be prototypign some of this, but not before the general idea has been reasonably accepted. > >I disgree. Users should not care what is used in the implementation. > >It's only UTF-16 because that is what ICU's API use. I do not want > >the complexity of having different in/ex encodings. Perhaps 15 years > >ago that was useful to have, but right now, everything should be > >UTF-8 on the interface layer, that is, if you care about > >internationalisation. > > UTF-8 should definitely be the default, but I disagree that all other > encodings can simply be ignored, and that users should be punished for > using them with extra CPU time spent converting to UTF-8 and back > again. All it would need is an optional argument on a couple of > methods to specify that you want some other encoding. I know what it would entail, but I am rejecting it regardless. "Just an optional argument on a couple of methods" increases the complexity. > >A locale/collator is an inherent property of Text (we're dealing with > >Text here, not strings). > > Is it though? It makes some sense to say "this is a Turkish Text, so > treat 'i' specially whenever upper-casing". But is there such a thing > as a "case insensitive piece of text"? The locale is inherent, the collator not so much. The collator as set on a Text object is therefore more of a default. The ''replaceText'' method, and the ''Finding Text in Text'' methods all have a way to override this default collation. I have updated the language in the RFC to be more precise. > > If locale is an "inherent property", does it make sense to discard it > when joining Texts together? At the moment, Text::join([$a, > $b])->toUpper() can give a different result from > Text::join([$a->toUpper(), $b->toUpper()]). An implementation that > truly treated locale as inherent would have to track segments within a > larger Text, subject to separate locales. (Similar to how HTML allows > a lang attribute on individual elements.) > > For comparisons, I don't see the value at all - if I'm sorting a list > of Texts, the sort order is a property of the sort operation, not of > the individual items. If I have a French Text, a Spanish Text, and an > English Text, there's no meaningful way to use all three sort orders > at once, and no particular reason to choose one over the others. In > the current proposal, using compareWith in a usort callback without > specifying the collation would result in unstable results, because > it's not symmetrical - $a->compareWith($b) can use a different > collation than $b->compareWith($a). That sounds like an argument for having a sort() method where you can override the collator. I would however expect that most people would not set a default collation other than "standard" on Text objects though. And if something more clever needs to be done, this can be overridden in all methods. > >> the worrying sentence "This will require extensive documentation". > > > >This phrase is meant to mean that the *format of the locale/collator > >name* needs extensive documentation. > > > I know, and I think that's a bad sign - why are we exposing this > complexity to users in a class that otherwise holds their hand at > every step of the way? I think the parameters should always be a > user-friendly collation/locale object, with the ICU strings an > optional way for experts to create such an object. Yes, and that is why the RFC includes a ''TextCollator'' object that does precisely that. cheers, Derick -- https://derickrethans.nl | https://xdebug.org | https://dram.io Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support Host of PHP Internals News: https://phpinternals.news mastodon: @derickr@phpc.social @xdebug@phpc.social twitter: @derickr and @xdebug