Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119172 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 695 invoked from network); 16 Dec 2022 13:54:57 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 16 Dec 2022 13:54:57 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A6B98180556 for ; Fri, 16 Dec 2022 05:54:56 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,SPF_HELO_PASS, SPF_NEUTRAL,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS30827 82.113.144.0/20 X-Spam-Virus: No X-Envelope-From: Received: from xdebug.org (xdebug.org [82.113.146.227]) by php-smtp4.php.net (Postfix) with ESMTP for ; Fri, 16 Dec 2022 05:54:56 -0800 (PST) Received: from localhost (localhost [IPv6:::1]) by xdebug.org (Postfix) with ESMTPS id 1230D10C0AB; Fri, 16 Dec 2022 13:54:55 +0000 (GMT) Date: Fri, 16 Dec 2022 13:54:55 +0000 (GMT) X-X-Sender: derick@singlemalt.home.derickrethans.nl To: Tim Starling cc: PHP Developers Mailing List In-Reply-To: <9876abdf-a7af-0e53-ee3e-236021d74377@wikimedia.org> Message-ID: References: <9876abdf-a7af-0e53-ee3e-236021d74377@wikimedia.org> User-Agent: Alpine 2.23 (DEB 453 2020-06-18) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing From: derick@php.net (Derick Rethans) On Fri, 16 Dec 2022, Tim Starling wrote: > On 16/12/22 02:34, Derick Rethans wrote: > > > > I have just published an initial draft of the "Unicode Text > > Processing" RFC, a proposal to have performant unicode text > > processing always available to PHP users, by introducing a new > > "Text" class. > > Using "collator" and "locale" interchangeably seems imprecise. If the > input is an ICU locale string, then I think you should just call it > locale. Then the user will be armed with the correct terminology when > they go looking for more information in the ICU manual. In ICU, case > conversion and BreakIterator need a locale, not a collator. Yeah, the terms are currently used interchangably (sort of). I will update that. Although I really would not suggest that users look at the ICU manual, as it's really hard to find things in it :-) > I'm concerned about the time order of using grapheme offsets. For > example, is subString() O(N) in $offset? Yes. It would have to scan the Text. > I'm probably not the target audience for this class, since I'm > generally looking for maximum flexibility, not minimum complexity. As > such, I'd like intl to have better documentation and more features. > The RFC has a family of locale-aware case conversion functions which > do not exist in intl. This was raised as an issue during the > discussion on my ASCII case conversion RFC. It would be great if intl > could get those functions too. AFAIK Intl can do all of these things, but yes, its documentation is "sparse". However, that's not in scope of this RFC. > I think you should consider making this Text class a part of the intl > extension. You're adding a class which is similar to the classes in > that extension. In terms of data, it's like IntlChar, except it's for > strings not characters. Its constructor takes an ICU locale string, > just like IntlBreakIterator or MessageFormatter. I did consider that, and rejected that idea. Intl, although powerful, does not have an approcable API. It is also not installed or available by default, and I am not suggesting we do that. That than means that it doesn't fit the design goals here (having it always available). cheers, Derick -- https://derickrethans.nl | https://xdebug.org | https://dram.io Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support Host of PHP Internals News: https://phpinternals.news mastodon: @derickr@phpc.social @xdebug@phpc.social twitter: @derickr and @xdebug