Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119165 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 81396 invoked from network); 16 Dec 2022 11:22:00 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 16 Dec 2022 11:22:00 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id E391A1804F7 for ; Fri, 16 Dec 2022 03:21:59 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f41.google.com (mail-wr1-f41.google.com [209.85.221.41]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 16 Dec 2022 03:21:59 -0800 (PST) Received: by mail-wr1-f41.google.com with SMTP id f18so2168723wrj.5 for ; Fri, 16 Dec 2022 03:21:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=oRQhDK5bR3v856S2fwprRdYRhQTpEawrtMHzYo939Rs=; b=T5NZCcVDL3+lqULcjCIi1/1zjUZ42STwWMsISWo8oTsjPJTJBIuoROXyBWpO41yg2A mK4ENNVHUUfCWOeoY+Ka7Bn/GWk5l+lqTLhS0JHsXZ6n/8cTDaChLaxy9c5ob4z9L2R5 AeJ7j8vwBkJtyaCvN3GRqVNQUGj7Un9RJ0clz7rv1AtQBUlwADXJtQ5prE3haxCnW4D7 b6Ef+urRiu6ZjEWAeh11LZlTFbcS0kssaIKBjMtRZeI4IHxhicRjFvRrxBgW73dM8nUV MBTXxe33yig9fEYYI0TZjKivVtjDXheRYoJQi2KfckGiM7YD9KaF4M3ywhUJ4GW4lPVy KgkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oRQhDK5bR3v856S2fwprRdYRhQTpEawrtMHzYo939Rs=; b=VEk5XSaI4zjTDIWf/8fqKi0pGNTqrdzNHWS6FVddUWgPLZNM9MHskMWZahCRpGu+Fh QJzHE0J7gZdNMm0wqdVvOyBb7Q/Zlqq3QFbVXkZsWfYFuvlCijGqz2mYTn5h42H1jk++ RwT+Zf/D3lWmpKe/rm62O4dAmFzQaiEtWJQGW3eMroeG7bCML23I7178Jm4R1t0v4mtj NP8qvT8eDMdfOBhSEbQ33yvcZaA+bxjebGGM+HTbYPuNIYvgE+GHowFCCMi0C17KG6mt B/eMtw1O1unkqWkHXFgSjt5jnDcmXbwwQzIGAF3F8nOkNNps1Xb5JL3XAx4zbm0gvuQi RD7Q== X-Gm-Message-State: ANoB5pmOAW0L/2N3PtpTkq/Brj/j+oOReOfM6lI1bZ31QTtnn6ztWwQM ScnHvqpHlsymvue0s1X+1TJbzTypcRr6vIUZv8pE0M6K3mU= X-Google-Smtp-Source: AA0mqf6fbKz03YaOAZJDdTzKDfiMmLBpPQ0Ys5hIwmExWrz6xYGzFTbGlCUtod65M4sOvvqpTxIakr4rBy4MuOvTkwU= X-Received: by 2002:a5d:4985:0:b0:242:4c61:271f with SMTP id r5-20020a5d4985000000b002424c61271fmr14332982wrq.236.1671189718248; Fri, 16 Dec 2022 03:21:58 -0800 (PST) MIME-Version: 1.0 References: <9876abdf-a7af-0e53-ee3e-236021d74377@wikimedia.org> In-Reply-To: <9876abdf-a7af-0e53-ee3e-236021d74377@wikimedia.org> Date: Fri, 16 Dec 2022 11:21:46 +0000 Message-ID: To: PHP Developers Mailing List Content-Type: multipart/alternative; boundary="00000000000082308805eff02d81" Subject: Re: [PHP-DEV] [RFC] Unicode Text Processing From: rowan.collins@gmail.com (Rowan Tommins) --00000000000082308805eff02d81 Content-Type: text/plain; charset="UTF-8" On Fri, 16 Dec 2022 at 03:21, Tim Starling wrote: > > I'm concerned about the time order of using grapheme offsets. For > example, is subString() O(N) in $offset? If the idea is to be easy to > use and performant, you don't want to have subtle algorithmic > complexity traps. > This is a good point; it's certainly true of existing functions, like grapheme_strlen(), and indeed mb_strlen(), which has to iterate variable width code points. Perhaps we could take advantage of having a stateful object and internally optimise this in some way, such as caching a partial lookup table of graphemes to byte offsets. For instance, the table might look like this: 10: 22 20: 50 30: 70 35: 82; LAST Then $string->subString(23, 20) would: * take a pointer to byte 50 * pass it to the ICU grapheme iterator to skip over 3 graphemes; let's say that takes us to byte 58 * since 23 + 20 > 35, the rest of the string is included * the new object could construct an offset table without examining the string: 7: 12 (grapheme 30 - 23; byte 70 - 58) 12: 24; LAST (grapheme 35 - 23; byte 82 - 58) Whether this complexity would pay off in real-world scenarios, I don't know, but if people started using this for all the text on an application, I can see longer strings becoming a more common use case. Regards, -- Rowan Tommins [IMSoP] --00000000000082308805eff02d81--