Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119154 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 93894 invoked from network); 15 Dec 2022 17:15:19 -0000 Received: from unknown (HELO localhost.localdomain) (76.75.200.58) by pb1.pair.com with SMTP; 15 Dec 2022 17:15:19 -0000 To: internals@lists.php.net Date: Thu, 15 Dec 2022 09:15:19 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 References: Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Posted-By: 172.113.137.202 Subject: Re: [RFC] Unicode Text Processing From: paul.crovella@gmail.com (Paul Crovella) Message-ID: On 12/15/2022 7:34 AM, Derick Rethans wrote: > https://wiki.php.net/rfc/unicode_text_processing A few quick thoughts: > The constructor will also convert the given text to Unicode Canonical Form. By this do you mean Normalization Form C (NFC)? "Unicode Canonical Form" isn't a phrase I'm familiar with. Assuming so, are modified texts (e.g. via join, replaceText, reverse) re-normalized? --- > The constructor will also strip out a BOM (Byte-Order-Mark) character, if present. This is also known as ZWNBSP (Zero Width No-Break Space). Will only a leading instance be stripped? If so, how can someone search for it (or a substring beginning with it) given that: > If an argument to any of the methods is listed as string|Text, passing in a string value will have the same semantics as replacing the passed value with new Text($string). and all the search methods take `string|Text $search`. --- Why is this being introduced directly into PHP core rather than first an extension where it's easier to shake out the interface and behavior?