Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119173 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 1122 invoked from network); 16 Dec 2022 13:55:04 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 16 Dec 2022 13:55:04 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A07BF18055D for ; Fri, 16 Dec 2022 05:55:03 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,SPF_HELO_PASS, SPF_NEUTRAL,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS30827 82.113.144.0/20 X-Spam-Virus: No X-Envelope-From: Received: from xdebug.org (xdebug.org [82.113.146.227]) by php-smtp4.php.net (Postfix) with ESMTP for ; Fri, 16 Dec 2022 05:55:03 -0800 (PST) Received: from localhost (localhost [IPv6:::1]) by xdebug.org (Postfix) with ESMTPS id E62CC10C0AB; Fri, 16 Dec 2022 13:55:02 +0000 (GMT) Date: Fri, 16 Dec 2022 13:55:02 +0000 (GMT) X-X-Sender: derick@singlemalt.home.derickrethans.nl To: Rowan Tommins cc: internals@lists.php.net In-Reply-To: Message-ID: References: User-Agent: Alpine 2.23 (DEB 453 2020-06-18) MIME-Version: 1.0 Content-Type: multipart/mixed; BOUNDARY="8323329-1410336210-1671198036=:462551" Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing From: derick@php.net (Derick Rethans) --8323329-1410336210-1671198036=:462551 Content-Type: text/plain; CHARSET=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 15 Dec 2022, Rowan Tommins wrote: > On 15/12/2022 15:34, Derick Rethans wrote: > > I have just published an initial draft of the "Unicode Text Processing" > > RFC, a proposal to have performant unicode text processing always > > available to PHP users, by introducing a new "Text" class. > >=20 > > You can find it at: > > https://wiki.php.net/rfc/unicode_text_processing > >=20 > > I'm looking forwards to hearing your opinions, additions, and > > suggestions =E2=80=94 the RFC specifically asks for these in places. >=20 >=20 > As others have said already, thank you for taking a stab at this importan= t > topic. I agree that it would be a really useful feature for the language,= but > it's also a really difficult one to get right. Here are my initial though= ts... >=20 > # Design Process >=20 > Rather than designing the whole class "on paper", I think this really nee= ds to > be built as a prototype, where we can build up documentation and tests, p= lug > variations into some real life scenarios, and have separate discussions a= bout > different details. If we limit ourselves initially to features already ex= posed > by ext/intl (I think everything proposed so far is?), a prototype doesn't= even > need to be an extension, it can be in pure PHP. Then once the design is > finalised, you have a ready-made polyfill for older PHP versions, and a s= et of > tests for the native version :) I do not want a polyfill. These already exist for intl and friends. I=20 had no intention to design everything up front though, and it is likely=20 that I missed useful methods. This is not going to be right in a single=20 implementation. > # UTF-8 on the outside, UTF-16 on the inside >=20 > I know this will be a very common combination, but it feels odd that an > application which actually wanted to work with UTF-16 would need to perfo= rm > round-trips through UTF-8 just to use this class. It should at least be > possible to specify the encoding on input and output. I disgree. Users should not care what is used in the implementation.=20 It's only UTF-16 because that is what ICU's API use. I do not want the=20 complexity of having different in/ex encodings. Perhaps 15 years ago=20 that was useful to have, but right now, everything should be UTF-8 on=20 the interface layer, that is, if you care about internationalisation. > # Internationalisation >=20 > Having locale and collation as state on the object, rather than=20 > parameters on relevant methods, feels like muddling responsibilities.=20 > It makes it hard to reason about what exactly some of the methods will=20 > do: Can I trust that this object will give me a sensible result from=20 > compareWith, or has it been assigned a collation somewhere else? What=20 > exactly will be the definition of "replace" or "contains" for this=20 > pair of objects? A locale/collator is an inherent property of Text (we're dealing with=20 Text here, not strings). I do need to tidy up the wording about what=20 locales and collations are, as I've so far used them sparingly=20 interchangably. > How users will work with these also needs careful thought - your first li= sted > design goal is "keep it simple", but under locales and Internationalisati= on is > the worrying sentence "This will require extensive documentation". This phrase is meant to mean that the *format of the locale/collator=20 name* needs extensive documentation. > One function that I would really like to see, for instance, is a=20 > grapheme-aware version of mb_strcut, to solve tasks like: "encode this=20 > abstract Unicode string as UTF-16BE, truncated to at most 200 bytes,=20 > without breaking apart any grapheme clusters". For that to work, you need a methods that instantly returns UTF-8=20 strings, and not UTF-16. In the RFC, the current subString() uses int=20 $length to mean grapheme clusters. Adding another methods to do=20 something else, is of course possible. I'll think about it (and noted in=20 "Open Issues"). cheers, Derick --=20 https://derickrethans.nl | https://xdebug.org | https://dram.io Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/suppo= rt Host of PHP Internals News: https://phpinternals.news mastodon: @derickr@phpc.social @xdebug@phpc.social twitter: @derickr and @xdebug --8323329-1410336210-1671198036=:462551--