Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119168 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 93755 invoked from network); 16 Dec 2022 13:25:41 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 16 Dec 2022 13:25:41 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 493E4180339 for ; Fri, 16 Dec 2022 05:25:40 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,SPF_HELO_PASS, SPF_NEUTRAL,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS30827 82.113.144.0/20 X-Spam-Virus: No X-Envelope-From: Received: from xdebug.org (xdebug.org [82.113.146.227]) by php-smtp4.php.net (Postfix) with ESMTP for ; Fri, 16 Dec 2022 05:25:39 -0800 (PST) Received: from localhost (localhost [IPv6:::1]) by xdebug.org (Postfix) with ESMTPS id 3367D10C0AB; Fri, 16 Dec 2022 13:25:39 +0000 (GMT) Date: Fri, 16 Dec 2022 13:25:39 +0000 (GMT) X-X-Sender: derick@singlemalt.home.derickrethans.nl To: Paul Crovella cc: internals@lists.php.net In-Reply-To: Message-ID: References: User-Agent: Alpine 2.23 (DEB 453 2020-06-18) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="8323329-1215614653-1671197139=:462551" Subject: Re: [PHP-DEV] Re: [RFC] Unicode Text Processing From: derick@php.net (Derick Rethans) --8323329-1215614653-1671197139=:462551 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 15 Dec 2022, Paul Crovella wrote: > On 12/15/2022 7:34 AM, Derick Rethans wrote: > > https://wiki.php.net/rfc/unicode_text_processing >=20 > A few quick thoughts: >=20 > > The constructor will also convert the given text to Unicode Canonical F= orm. >=20 > By this do you mean Normalization Form C (NFC)? "Unicode Canonical Form" = isn't > a phrase I'm familiar with. Yes. I've seen both phrases used, so I'll add NFC in brackets. > Assuming so, are modified texts (e.g. via join, replaceText, reverse) > re-normalized? Yes =E2=80=94 although I do not expect that to change anything, as normalis= ation=20 usually happens *in* a grapheme, and not between them. I suspect there=20 might be some indian languages where that is proven wrong though. > > The constructor will also strip out a BOM (Byte-Order-Mark)=20 > > character, if present. >=20 > This is also known as ZWNBSP (Zero Width No-Break Space). Will only a=20 > leading instance be stripped? If so, how can someone search for it (or=20 > a substring beginning with it) given that: >=20 > > If an argument to any of the methods is listed as string|Text, passing = in a > > string value will have the same semantics as replacing the passed value= with > > new Text($string). >=20 > and all the search methods take `string|Text $search`. I hadn't realised this is now used for both use cases. I've just read[1] "If the BOM character appears in the middle of a data stream, Unicode=20 says it should be interpreted as a "zero-width non-breaking space"=20 (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage=20 is deprecated in favor of the "Word Joiner" character, U+2060.[1] This=20 allows U+FEFF to be used only as a BOM. " This indicates that this might not be a problem. Would you have a better=20 suggestion? > --- >=20 > Why is this being introduced directly into PHP core rather than first an > extension where it's easier to shake out the interface and behavior? It will be developed as an extension inside the ext/ branch, pretty much=20 like ext/standard or ext/date; but if it is not in core, very few people=20 will use it, defeauting the whole point of the effort. cheers, Derick [1] https://en.wikipedia.org/wiki/Byte_order_mark#Usage --=20 https://derickrethans.nl | https://xdebug.org | https://dram.io Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/suppo= rt Host of PHP Internals News: https://phpinternals.news mastodon: @derickr@phpc.social @xdebug@phpc.social twitter: @derickr and @xdebug --8323329-1215614653-1671197139=:462551--