Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119179 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 21495 invoked from network); 16 Dec 2022 15:59:26 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 16 Dec 2022 15:59:26 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 3AF7A1804A7 for ; Fri, 16 Dec 2022 07:59:25 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.3 required=5.0 tests=BAYES_05,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,MISSING_HEADERS, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 16 Dec 2022 07:59:24 -0800 (PST) Received: by mail-wm1-f54.google.com with SMTP id o15so2177753wmr.4 for ; Fri, 16 Dec 2022 07:59:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=e48GRR9IbIDoolHlY7RC+43hO76a3ZY9M/9HOGKUVyw=; b=Zvst1Wd2gmbdKFJlcm3aXmLey5TTVqP0Ig6S2/+SFsBXlvqWnByMM4AjHmT/Qwwy+o KypHbnwquKmgaCh7RdemT2NlfEGT7lrsvJArgDwgNUya8Xru059w8mmOU3/OLkckWYT1 gMY+PnSPgWuqFZGMy75xwM1gYJQ6nwmzt/bOtHLieU1kW3gnAeHZ2D+Xlm8O1GPOM8o9 bKdMcjwfXga9F5bLPAyx3j3XUsV5IJXQwZGebGUm4JbrSAXH+LMC5GxTI5+ySt16zo2W uKIg85mi2vBvu40WZytUAsGi6ePPBxzshy+oubIZ4Pv4sedhcoMh4AjXWqcfiy18YmON 0O6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=e48GRR9IbIDoolHlY7RC+43hO76a3ZY9M/9HOGKUVyw=; b=yftBL1kd716pu+NaU5FvCAPxuQDWHjJXS/Xkq2hKJtiDOheoZVKba4EPNoNDOKJofh nbcUSQbl2H5kojmZN3j4LGcN5aBu2E6M7WoBVqcUCTzLOGY8fawyRaKFwJPpyyp39fgt x1OPJiER9EpOa2i1uW/bHn0q9JT7gmM7R8s6yxCazVghiS4Csnk37cvVkQ31PCb9AfyX 6dSAkECCLe+2Gh7yNdfnmTACyFj8CPgcTCftv74PrtUb4JaEPyGRoZXKU/bPXwKOQAXb T9kkdAawiJTGFBh6EJc47c69TZixeesdP5ue8NwfN9D3ZYcLLRR8Cit5DERdY2ZFwmMA sJ+A== X-Gm-Message-State: ANoB5pmGDpKhXQx8XwNjNN1Sf+1w/OKwIyZkrSvfc5xcdhhY6LEMDMdu JON/KurQQakY2Vd9fIcCf/nJnzqGAXjAtWLia65m+JPLb4o= X-Google-Smtp-Source: AA0mqf4/38P60oI3qJDVQ5SYfYTkMCgc/g1hjO1b0OjgDGHuxdAwe2IvRN5p34reMo4O3c/+Ya8IT/UTfbE0P3IepQ8= X-Received: by 2002:a05:600c:4f81:b0:3d2:9093:d83e with SMTP id n1-20020a05600c4f8100b003d29093d83emr361050wmq.122.1671206363621; Fri, 16 Dec 2022 07:59:23 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: Date: Fri, 16 Dec 2022 15:59:11 +0000 Message-ID: Cc: internals@lists.php.net Content-Type: text/plain; charset="UTF-8" Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing From: rowan.collins@gmail.com (Rowan Tommins) On 16 December 2022 13:55:02 GMT, Derick Rethans wrote: >I do not want a polyfill. These already exist for intl and friends. I think you misunderstood what I meant by "polyfill"; I meant in the sense that once the real implementation gets included in, say PHP 8.3, users needing to support, say, PHP 8.0, will have a drop-in implementation with exactly the same interface. Anyway, that was just an aside; my main point is that a single-page RFC, and a single mailing list thread, are probably not sufficient to iterate on this design. A prototype, or even just a repo with stubs for the methods, would give us better ways to track all the different details and ideas. >I disgree. Users should not care what is used in the implementation. >It's only UTF-16 because that is what ICU's API use. I do not want the >complexity of having different in/ex encodings. Perhaps 15 years ago >that was useful to have, but right now, everything should be UTF-8 on >the interface layer, that is, if you care about internationalisation. UTF-8 should definitely be the default, but I disagree that all other encodings can simply be ignored, and that users should be punished for using them with extra CPU time spent converting to UTF-8 and back again. All it would need is an optional argument on a couple of methods to specify that you want some other encoding. >A locale/collator is an inherent property of Text (we're dealing with >Text here, not strings). Is it though? It makes some sense to say "this is a Turkish Text, so treat 'i' specially whenever upper-casing". But is there such a thing as a "case insensitive piece of text"? If locale is an "inherent property", does it make sense to discard it when joining Texts together? At the moment, Text::join([$a, $b])->toUpper() can give a different result from Text::join([$a->toUpper(), $b->toUpper()]). An implementation that truly treated locale as inherent would have to track segments within a larger Text, subject to separate locales. (Similar to how HTML allows a lang attribute on individual elements.) For comparisons, I don't see the value at all - if I'm sorting a list of Texts, the sort order is a property of the sort operation, not of the individual items. If I have a French Text, a Spanish Text, and an English Text, there's no meaningful way to use all three sort orders at once, and no particular reason to choose one over the others. In the current proposal, using compareWith in a usort callback without specifying the collation would result in unstable results, because it's not symmetrical - $a->compareWith($b) can use a different collation than $b->compareWith($a). >> the worrying sentence "This will require extensive documentation". > >This phrase is meant to mean that the *format of the locale/collator >name* needs extensive documentation. I know, and I think that's a bad sign - why are we exposing this complexity to users in a class that otherwise holds their hand at every step of the way? I think the parameters should always be a user-friendly collation/locale object, with the ICU strings an optional way for experts to create such an object. Regards, -- Rowan Tommins [IMSoP]