Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119162 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 41427 invoked from network); 16 Dec 2022 03:21:12 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 16 Dec 2022 03:21:12 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 9E1E418033A for ; Thu, 15 Dec 2022 19:21:09 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 15 Dec 2022 19:21:09 -0800 (PST) Received: by mail-pl1-f174.google.com with SMTP id d15so1035578pls.6 for ; Thu, 15 Dec 2022 19:21:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=UCy6hjBGwBWOrA1ORK+pFhbIPQZpmIeSACj0N7Q7j90=; b=l25LpB6Es3iVooDgZWMXmScuxKa7y2ResDw0b6duMsuIF3ZK+Pfuntb9Q5ewyPYGFl +5tAJQX+Bta2ToxdJB+cK7vX7m4AmG56msD0Rc/14nTP76Yyp+/ONdk2rQ/fuOBrEj9P wMQ7rPyDDKRnOJCeK0W0F/5GxftNxKlEvbB14= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UCy6hjBGwBWOrA1ORK+pFhbIPQZpmIeSACj0N7Q7j90=; b=6AFNfMDhwoazYy4crrfp0WicOVX4d5RmxVLidjL4hGXXhPKY5nTl7OkfdWPqqAxH6b hPL5R5tEpQ44y1boRuiTBe2Y/Gc6oAoMNaF3nYZgEI9Mup3EXxIX3Y1MM5jShfEObqyB FfQDR7iC4FJnPz7xca3+YrzOKU0gvQfRrP3y3WzHkHvn8XgBTEXOByeoZTZZ6vsnQUmZ Sc9kZtFHSoVoZgPmxmfSFCpe4qehetMK7j3ccj6p8qkteuRvO4/kXEmlMZLGvNrOqfIy mhBKrBQtWwRV5hC6d2mlT7Kjm+YqEJny/yd3hP6XPty/uciQH0T84zMYFlWEYyjNsf7n nHKQ== X-Gm-Message-State: ANoB5pkCayxRpe15tpasD/Z84Rnix+Yx7X5EONQQHn1YDIQ5NcR/o6eu w3/LJACOvjICtOtG5gn4KpMysG90fBEgawXs X-Google-Smtp-Source: AA0mqf7+5+jDdak36uKToi3zmJKiSc4KOxWha7JMeulXBx7iA/9estB7AvKH6bsRO/UYitbDrL5TrA== X-Received: by 2002:a05:6a20:3a8a:b0:ac:f68:d0f8 with SMTP id d10-20020a056a203a8a00b000ac0f68d0f8mr32268359pzh.23.1671160868190; Thu, 15 Dec 2022 19:21:08 -0800 (PST) Received: from [10.1.1.45] (124-149-190-232.dyn.iinet.net.au. [124.149.190.232]) by smtp.gmail.com with ESMTPSA id o11-20020a170902778b00b0018941395c40sm345280pll.285.2022.12.15.19.21.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 15 Dec 2022 19:21:07 -0800 (PST) Message-ID: <9876abdf-a7af-0e53-ee3e-236021d74377@wikimedia.org> Date: Fri, 16 Dec 2022 14:21:03 +1100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.2 Content-Language: en-US To: Derick Rethans , PHP Developers Mailing List References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing From: tstarling@wikimedia.org (Tim Starling) On 16/12/22 02:34, Derick Rethans wrote: > Hi, > > I have just published an initial draft of the "Unicode Text Processing" > RFC, a proposal to have performant unicode text processing always > available to PHP users, by introducing a new "Text" class. Using "collator" and "locale" interchangeably seems imprecise. If the input is an ICU locale string, then I think you should just call it locale. Then the user will be armed with the correct terminology when they go looking for more information in the ICU manual. In ICU, case conversion and BreakIterator need a locale, not a collator. I'm concerned about the time order of using grapheme offsets. For example, is subString() O(N) in $offset? If the idea is to be easy to use and performant, you don't want to have subtle algorithmic complexity traps. I'm probably not the target audience for this class, since I'm generally looking for maximum flexibility, not minimum complexity. As such, I'd like intl to have better documentation and more features. The RFC has a family of locale-aware case conversion functions which do not exist in intl. This was raised as an issue during the discussion on my ASCII case conversion RFC. It would be great if intl could get those functions too. I think you should consider making this Text class a part of the intl extension. You're adding a class which is similar to the classes in that extension. In terms of data, it's like IntlChar, except it's for strings not characters. Its constructor takes an ICU locale string, just like IntlBreakIterator or MessageFormatter. I can understand if you don't want to follow all the existing conventions of the intl extension. But if that is the rationale for the RFC, I'd like to see a discussion of the specific usability problems with the intl extension. -- Tim Starling