Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119161 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 23191 invoked from network); 15 Dec 2022 22:20:19 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 15 Dec 2022 22:20:19 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8F5A31804FF for ; Thu, 15 Dec 2022 14:20:18 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f46.google.com (mail-wr1-f46.google.com [209.85.221.46]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 15 Dec 2022 14:20:18 -0800 (PST) Received: by mail-wr1-f46.google.com with SMTP id h16so733600wrz.12 for ; Thu, 15 Dec 2022 14:20:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=6ejvVeWh5pCHzP/q4YBlZ9dsmTMECCJ4WrEF2u0vtAw=; b=X3P8Kge+b4lOUmfvGwu+678MSgEaJluYGAeyPIEV0rcvMirvvxSaJ2wBJg76hd+oph D8xO7cHciDroKkQZ9IhN9hfDX1Q1I87PlD4vf6Sh8DRS/K/+dqXMkoqIo0zaYb4iFWWv ZdltWdKD/BvVcKBN8bOJucv/418XnFs6JW1asRS+TWdbSwOIawsA3+1GfIylpaTWx+9S eufR7md1Hbbv+fhpFdHSf3IrGTRNDHlXvy0tOsBhNZSxKZGOmuLPqgqOmNHQzdBYSShG 2q7wz+MzG+/DdrpWUt6C8QB/CcamfuE7K9BMMw6mWk6UFaHYZusQRRPNZIrsYprPguyJ bKGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6ejvVeWh5pCHzP/q4YBlZ9dsmTMECCJ4WrEF2u0vtAw=; b=aVy852KVIVLjsMcKfj/mjjSIV35iwjluJwaI6/JzNNKVphl789TPdWTWKRd7ziSK2/ yiLEqEzcrki7egpqVnsjSHd5OVS20gWoJt/D9xupX0TwlOIuw8MC+JGZa25ZKLIx3VQq VjKJgsQDJb2sCwlOFMu9obgJxqJd1ENPOdyFZzRzr2Z/pZwQerqkvI6sj0jcB60nxAhe UyAjEyWoes41h0ImBuFLYOBZCP8HGm22AQ/4h7nrYEsRgzGxlwKTFb/CPlGytYVz8Vwg scJ77smDkeEITrvIo+AA0o+kzeEflUUpIwMHGedog9dQBPt4R0UPUTNx5MD9SGFCXaKs EI5w== X-Gm-Message-State: ANoB5pnNoHwM5ZoAq40RYi67F5TryS8djL4bL5EHmj+gWeSKGM2y2xHL 4jXDeaud/VrZ3La2HP84qSSapXYmBqo= X-Google-Smtp-Source: AA0mqf5pkZp1wOXTQe7zg4/Ks38RLFenhYDVgiJ/oHlb7spqjqPyd+5qUS/IA4TFBpZ08LaCq5elOA== X-Received: by 2002:a5d:55c4:0:b0:242:19d6:da77 with SMTP id i4-20020a5d55c4000000b0024219d6da77mr18675021wrw.15.1671142816894; Thu, 15 Dec 2022 14:20:16 -0800 (PST) Received: from [192.168.0.22] (cpc104104-brig22-2-0-cust548.3-3.cable.virginm.net. [82.10.58.37]) by smtp.googlemail.com with ESMTPSA id q9-20020a05600000c900b0024228b0b932sm627367wrx.27.2022.12.15.14.20.15 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 15 Dec 2022 14:20:16 -0800 (PST) Message-ID: Date: Thu, 15 Dec 2022 22:20:15 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 To: internals@lists.php.net References: Content-Language: en-GB In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing From: rowan.collins@gmail.com (Rowan Tommins) On 15/12/2022 15:34, Derick Rethans wrote: > I have just published an initial draft of the "Unicode Text Processing" > RFC, a proposal to have performant unicode text processing always > available to PHP users, by introducing a new "Text" class. > > You can find it at: > https://wiki.php.net/rfc/unicode_text_processing > > I'm looking forwards to hearing your opinions, additions, and > suggestions — the RFC specifically asks for these in places. As others have said already, thank you for taking a stab at this important topic. I agree that it would be a really useful feature for the language, but it's also a really difficult one to get right. Here are my initial thoughts... # Design Process Rather than designing the whole class "on paper", I think this really needs to be built as a prototype, where we can build up documentation and tests, plug variations into some real life scenarios, and have separate discussions about different details. If we limit ourselves initially to features already exposed by ext/intl (I think everything proposed so far is?), a prototype doesn't even need to be an extension, it can be in pure PHP. Then once the design is finalised, you have a ready-made polyfill for older PHP versions, and a set of tests for the native version :) We might also want to do some general investigation of what other languages and frameworks provide, and which decisions have proven good or bad in practice. # Lossy Transforms Automatic normalisation and stripping of BOMs seems useful, but it immediately rules out use of this class for anything where you want to get back what you put in. For instance, if an ORM used Text instances for strings in data models, it would generate extra Update queries on the database even when the string wasn't otherwise changed. I think it would be better to make this easy but explicit. # UTF-8 on the outside, UTF-16 on the inside I know this will be a very common combination, but it feels odd that an application which actually wanted to work with UTF-16 would need to perform round-trips through UTF-8 just to use this class. It should at least be possible to specify the encoding on input and output. Ruby takes an interesting approach where strings are tagged with their current binary encoding, and only converted to another form if actually required. If your input layer says "$name = new Text($_GET['name'], 'Windows-1252');" and your output layer says "echo $name->asBytes('Windows-1252');" the overhead of converting to UTF-16 can be skipped entirely, unless something in between says "$name = $name->wordsToUpper()". This also removes another source of lossy transformation, since some encoding conversions aren't perfectly reversible (e.g. the source encoding has more than one byte sequence mapped to the same Unicode code point). # Internationalisation Having locale and collation as state on the object, rather than parameters on relevant methods, feels like muddling responsibilities. It makes it hard to reason about what exactly some of the methods will do: Can I trust that this object will give me a sensible result from compareWith, or has it been assigned a collation somewhere else? What exactly will be the definition of "replace" or "contains" for this pair of objects? How users will work with these also needs careful thought - your first listed design goal is "keep it simple", but under locales and Internationalisation is the worrying sentence "This will require extensive documentation". This is one of those places where "doing it right" is really hard to combine with "making it easy", because language is inherently complex, but users will expect a simple answer to "how do I make it case-insensitive?" # Allowing other abstractions I 100% approve of your use of grapheme clusters, rather than code points, as the primary unit; so many implementations get that wrong. However, when interacting with other systems, reasoning about bytes (or sometimes even codepoints) is essential. One function that I would really like to see, for instance, is a grapheme-aware version of mb_strcut, to solve tasks like: "encode this abstract Unicode string as UTF-16BE, truncated to at most 200 bytes, without breaking apart any grapheme clusters". Thanks again for getting the ball rolling, and I look forward to helping iterate the design. Regards, -- Rowan Tommins [IMSoP]