Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119197 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 81384 invoked from network); 22 Dec 2022 02:27:48 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 22 Dec 2022 02:27:48 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 77DFB180089 for ; Wed, 21 Dec 2022 18:27:47 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS29838 64.147.123.0/24 X-Spam-Virus: No X-Envelope-From: Received: from wout4-smtp.messagingengine.com (wout4-smtp.messagingengine.com [64.147.123.20]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 21 Dec 2022 18:27:46 -0800 (PST) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.west.internal (Postfix) with ESMTP id 26C4C3200914 for ; Wed, 21 Dec 2022 21:27:43 -0500 (EST) Received: from imap50 ([10.202.2.100]) by compute4.internal (MEProxy); Wed, 21 Dec 2022 21:27:43 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= garfieldtech.com; h=cc:content-transfer-encoding:content-type :date:date:from:from:in-reply-to:in-reply-to:message-id :mime-version:references:reply-to:sender:subject:subject:to:to; s=fm2; t=1671676062; x=1671762462; bh=NW5B0pM9sj/VkNWslRd+T/evB 7vMSjjj30PBgjeAZIc=; b=Z5zM4p3GIdZ3PogBxKrLbJXhuTaUR0gBsVO7QJSIj rPA3zf+NkXJU4MolifjQ40YXiWcvzya8pPmZMij36RTyWRqnBXtTy6yARBcWNzRK 4v+Y5eg9J3BbfxsH1GGlbqKyNEPNikgadKHU06S5YqKG4pG4GIygD0o9PP4k1t32 l+gMH7gG/yYseB078v4j7lPmH6ZLjtOTeAasUERLrkLTxJbDi4qAoAcAqby9xQWJ 7o3+GyUer5AtMHRS7n74IKuW/ergtEgTJ4UvhCxwXoBb19+hRqoG5OHCOHfoWn5Y 1CGnzdJ7JK8c6EViBKyEd8h604fyDuVYYfain6b7gXepA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1671676062; x=1671762462; bh=N W5B0pM9sj/VkNWslRd+T/evB7vMSjjj30PBgjeAZIc=; b=QoQTB9RHNYlD0wZn9 uvhvirmNGnzSyUs9L1c/hpj/7QQL2qQET1HAFVCDN66LkDIAu/tEG6ZxBfB3pOOo BP8v8Rc6m2lpPZf3NOcKrc9wLx7O3e4lAXyLHS2X7gG0eGyhmZLj0N5jsH1J4ErQ VOAgtV126mFFJdkJhXv7TGGxwsLMW7FR0nndhTiBVnjM9Oh0Ld3EQyMdBvCGSEYk ldqU7a4g3OwOpFLW2kOTC8K0qib0ESE7XbODTka//4bacMisZvf1Ah5xplq0jL8Y H9Zd+9PF2BMcQlMBV0fPv8Xf+GMieIVwUuACphA8WkFdg+/ehLHKkDaWlk0qYHEZ o3+1g== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrgeelgdegkecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenog fuuhhsphgvtghtffhomhgrihhnucdlgeelmdenucfjughrpefofgggkfgjfhffhffvufgt gfesthhqredtreerjeenucfhrhhomhepfdfnrghrrhihucfirghrfhhivghlugdfuceolh grrhhrhiesghgrrhhfihgvlhguthgvtghhrdgtohhmqeenucggtffrrghtthgvrhhnpeej teeggeduveekveegudffgeelkedtteelleevvdefgfekheevvdeiiefglefffeenucffoh hmrghinhepphhhphdrnhgvthdpuggvrhhitghkrhgvthhhrghnshdrnhhlpdiguggvsghu ghdrohhrghdpughrrghmrdhiohdpphhhphhinhhtvghrnhgrlhhsrdhnvgifshdpfehvge hlrdhorhhgnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhho mheplhgrrhhrhiesghgrrhhfihgvlhguthgvtghhrdgtohhm X-ME-Proxy: Feedback-ID: i8414410d:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 65FED1700089; Wed, 21 Dec 2022 21:27:42 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.7.0-alpha0-1185-g841157300a-fm-20221208.002-g84115730 Mime-Version: 1.0 Message-ID: In-Reply-To: References: Date: Wed, 21 Dec 2022 20:27:22 -0600 To: "php internals" Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing From: larry@garfieldtech.com ("Larry Garfield") On Thu, Dec 15, 2022, at 9:34 AM, Derick Rethans wrote: > Hi, > > I have just published an initial draft of the "Unicode Text Processing= "=20 > RFC, a proposal to have performant unicode text processing always=20 > available to PHP users, by introducing a new "Text" class. > > You can find it at: > https://wiki.php.net/rfc/unicode_text_processing > > I'm looking forwards to hearing your opinions, additions, and=20 > suggestions =E2=80=94 the RFC specifically asks for these in places. > > cheers, > Derick > > > --=20 > https://derickrethans.nl | https://xdebug.org | https://dram.io > > Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/= support > Host of PHP Internals News: https://phpinternals.news > > mastodon: @derickr@phpc.social @xdebug@phpc.social > twitter: @derickr and @xdebug Derick, thank you for tackling this. It's a decidedly not-simple proble= m space and I'm glad someone like you is looking into it. I'm overall in favor of the RFC, though I have some comments/pushback. = First, here's my notes just as I'm reading through it: Re Text::create, which the RFC suggests can be aliased as a function for= easier use: Are you sure? Symfony has stand-alone wrapper functions. = You cannot alias a static method directly to a function. cf: https://3v= 4l.org/V4kP2 (It would be really nice if that worked! But it doesn't s= eem to.) Text::concat: What happens if different Text objects passed to that have= different collations? Which gets used, the first, last, "best fit"...? Text->wrap: " If $cutLongWords is set, no Text element will be larger th= an $maxWidth." Can you include an example here? I have to mentally noo= dle through what this means, which means it could use an example. Text->getPositionOfFirstOccurrence() et al: These should not return fals= e on not-found. That is an anti-pattern. That PHP's existing libraries= do that is a bug, not a feature. It has caused no end of bugs. null i= s the correct thing to return here, especially with the new null-handlin= g syntax we have now in PHP 8. Do not use "false" as a not-found return= , ever. Another option to consider is if some of them should return an = empty Text object. (That may not be the best answer, but it's one worth= considering.) Also, all of those names are very long. :-( Text->returnFromFirstOccurence(): I much prefer startingWith(). It's ha= lf the length and just as if not more descriptive. It also implies, to = me, that the $search will be included in the result, whereas startingAt(= ), for whatever reason, doesn't seem like it does. Text->contains(): The header is missing the defined return type. Comparing Text Objects: Oh, for being able to overload those operators. = This would be a great use case for it. :-( The examples in case-conversion are hard to follow, because the font of = code samples is not that different from normal text. Could you perhaps = multi-line them, to make it clearer where the "in" text ends and the "ou= t" text starts? How do toTitle() and wordsToUpper() differ? They sound like the same th= ing... (Please note the difference in the RFC.) Why two methods for length? And why confuse it with "character" when th= e text has been very consistent about using grapheme to this point. getCodePointCount(): I... don't understand how this is different from le= ngth, so I don't see why we'd use it. If it's kept, please include a be= tter explanation of what it is or why I'd care. getWordCount(): The example uses getWordIterator as a property, when I t= hink it's supposed to be a method. Also, it's not syntax highlighted. "The return of the iterators are effected by the text's locale." - affec= ted, not effected. getCharacterIterator(): Again, dropping in the word character here. Cal= ling it getGraphemeIterator() would be terrible, of course. :-) This fe= els like an older part of the text that wasn't updated when most of it s= tarted using grapheme. Perhaps skip the explanation here and move a cen= tral definition of "character" to the start of the RFC? (Which could be= "means the same as grapheme in this case, NOT the same as byte.") getWordIterator(): It's not clear to me if this includes whitespace as i= ts own Text objects. Would the string "Mr. Smith Goes to Washington" be= a word iterator of ["Mr.", "Smith", "Goes", "to", "Washington"]? Or ["= Mr", ".", "Smith", "Goes", "to", "Washington"]? Or ["Mr", ".", " ", "Sm= ith", " ", "Goes", " ", "to", " ", "Washington"]? I'm not clear which i= s the intent here. (Feel free to steal this example for the RFC.) getLineIterator(): I do not understand this description at all. From th= e name, I'd expect it to break the string at newline characters. The de= scription seems like it's something completely different I do not unders= tand. getTitleIterator(): What's a title, in this context? Transliteration section: The formatting here seems wonky and confusing. = Please clean up. ----- Second, there is a PHP-FIG Working Group on translation. It's mostly id= le at the moment, as we're waiting on the MessageFormat working group at= the W3C to stabilize their next version so we can just steal it. I don= 't know that there's any direct overlap between this RFC and that WG, bu= t I'm mentioning it for transparency, and to encourage people to think a= bout how they could both be developed to play nicely together, whatever = that means. Third, is there some way to say "this string, but in some other collatio= n?" It looks like the only way to do that is via Text::create($txt, 'ne= w-collation') / new Text($txt, 'new-collation'). A ->withCollation('new= -collation') method would be very helpful, especially as so many methods= rely on the collation for things like case insensitivity. That way, we= could do $txt->withCollation('case-insensitive-english')->split(',') (o= r similar). Fourth, that brings me to my biggest concern. "The format of this local= e/collation name needs extensive documentation." - This line scares the = ever-loving crap out of me. :-) We know from experience that complex fo= rmatting strings are trivially screwed up, mistyped, or otherwise gotten= wrong. Especially when they're not self-evident. ("ks" means "case-in= sensitive"? I would never have guessed that in a million years.) The l= inks provided to the Unicode sites don't really illuminate anything for = me. This to me sounds like it cries out for either a builder object, an enum= eration (or multiple), or some combination of those. It sounds like the= TextCoallator class is maybe trying to be that, but it's still under-de= scribed, especially for anyone who hasn't already used intl. It also lo= oks like it's just producing a string, rather than, what I'd consider pr= obably more ergonomic, using the object directly like DateTimeZone. I'm not sure what the best answer here is since I don't know the problem= space well enough, but I do know that "here's a string, GL" is insuffic= ient. We should noodle on how to make that more ergonomic so that casua= l developers don't get it horribly subtly wrong. (Because you know they= /we/I will on this topic.) That may also include extra utility methods = like ->withCaseInsensitiveCollation() (which changes only the case sensi= tivity marker but leaves the rest of the collation alone), or something = like that. Again, I'm not sure what the best design here is. Everything i just said also applies to the "$transliterationString", whi= ch is mentioned in passing but no description is provided. I have no cl= ue what the syntax for that even is. And of course it would be off brand for me to not note that issues with = a fixed set of methods on an object like this, rather than pipe-friendly= functions that are innately more extensible. I know, I know, we don't = have pipes yet, but I have to mention it anyway or people would be worri= ed about me. :-) --Larry Garfield