Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:111985 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 89495 invoked from network); 2 Oct 2020 13:49:25 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 2 Oct 2020 13:49:25 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A9CFC18050B for ; Fri, 2 Oct 2020 06:01:46 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: ** X-Spam-Status: No, score=2.6 required=5.0 tests=BAYES_20,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-il1-f178.google.com (mail-il1-f178.google.com [209.85.166.178]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 2 Oct 2020 06:01:43 -0700 (PDT) Received: by mail-il1-f178.google.com with SMTP id q1so1142104ilt.6 for ; Fri, 02 Oct 2020 06:01:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=wRFAf1H+U751AlCSiOn9JfIZOpi1Pi89N2ryA0rPWXU=; b=MvfQr29J+TLE47LvGqHPnzhc7snvM8ur9QfdbzwUeThTK4BSjoE2fC5TG0DLulEoZU EiBf86KBftU9t73Jm6PdGd+SocklCCwZMX1J6odXm5EsPmV2OQFVvvBWw/ICzAcMOoG5 lpT70jbHREaSQqDD5d4hQv5lNi3ze9xqc3Q3Vepp0ItzCFtf7aRhOy0G6NL4uV9MbVKS C76AOfiE8o5/ZxR3lbyUy9t+yDpCEtMTQFba5BmwxXAbjkuLCch9kNfMS9y1lExh8Ioo iOMMNDpcO3hhj5OebjX50+2wIXk53EKf+NKcAgtH9mttWuUSvNMFDvpS1F+XSha134XV XC8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=wRFAf1H+U751AlCSiOn9JfIZOpi1Pi89N2ryA0rPWXU=; b=uFhJWkUvX2zpLkCXapy10sUvFdmo5soJwM5XZjF/IggAmt1ZKrzLLGX6ULch084nJp ZGhiBrp/owkRhGTayteAdsYsHBt8D0tFo7BTSE2sQiLJIHA707GdV8fmYwNXNr/3f10q 70JdBx9aCjICgWJ/7+RyxMPuzMfoO0ScISV0okksEvMCWcLEHyrcgAqtAoR329xlWDLo GK7WkzZgBT9rjE73HIUMk7PYqE6fu31aSLntu/aV4wUg2Cmzpca+x2jI8DEE6GZuDpPr F1tNbGUmabJLjhHQMfzavw2LnCWOXTZx/1lWmOedd51/O1be949UxH9Tx9Axb0HVPPPb ns2Q== X-Gm-Message-State: AOAM533uwCv/RwNLsmMbn6jLxy1M4lwRE3RXeBuylgHA82rBKnWjKQBC XoVi7STD2Wcm++cUlApdqV0wbrbVceASR8MEVy5bwJD1XT+uDw== X-Google-Smtp-Source: ABdhPJxA9vb32fLkOa1ANN7vqahCeCjvFgwuNoNw+vQJ5bnLglZV4XalrNokjelHZsH72nxt8T7De3yqfv99mz6V6yk= X-Received: by 2002:a92:c212:: with SMTP id j18mr1653589ilo.244.1601643701067; Fri, 02 Oct 2020 06:01:41 -0700 (PDT) MIME-Version: 1.0 References: <2c54d906-35c7-6ed6-ac47-649b7bcde2a2@gmx.de> In-Reply-To: <2c54d906-35c7-6ed6-ac47-649b7bcde2a2@gmx.de> Date: Fri, 2 Oct 2020 09:01:29 -0400 Message-ID: To: "Christoph M. Becker" Cc: Thomas Landauer , PHP Internals Content-Type: multipart/alternative; boundary="000000000000dbe58205b0afbbce" Subject: Re: [PHP-DEV] Re: Suggestion: Make all PCRE functions return *character* offsets,rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given From: colinodell@gmail.com ("Colin O'Dell") --000000000000dbe58205b0afbbce Content-Type: text/plain; charset="UTF-8" The ability to receive the "character" offset would be extremely useful to the league/commonmark project. This project is a Markdown parser which conforms to the CommonMark spec which defines all behavior with regards to Unicode code points: On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker wrote: > While it is trivial to get the next code point using index+1, this is > not necessarily the next character, as perceived by a human. Using > mb_substr(), you may even break "characters", e.g. >. > In my particular use case, this is entirely acceptable per the spec linked to above. Because the CommonMark spec is "character"-centric, we do have a need to keep track of character positions within strings when parsing forwards, and while also allowing for regular expressions to be matched against UTF-8 strings. As Thomas noted, using PREG_OFFSET_CAPTURE provides us with the byte offset, not the "character" offset. We therefore must do additional work to calculate the latter from the former: $offset = \mb_strlen(\substr($subject, 0, $matches[0][1]), 'UTF-8'); This code is frequently executed and therefore leads to worse performance than if preg_match() could simply return the offsets we need. Would I be correct in assuming that preg_match() already has some knowledge or awareness about codepoints / "characters" when matching against UTF-8 strings and capturing offsets? If so, I think it would be very beneficial to provide that information to userland to avoid unnecessary re-calculations. I'd therefore like to propose a third alternative option: a new flag like PREG_OFFSET_CODEPOINT. When used in combination with PREG_OFFSET_CAPTURE, it would return the offset position in terms of "characters", not bytes. This could also be used to interpret any $offset argument as "characters" instead of bytes. The reason I prefer this option is that it doesn't break BC and is entirely opt-in. If a developer wants this behavior and understands the implications they can use it. Nobody else is affected otherwise. On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker wrote: > If mbstring functions are used to find some offset, they always have to > traverse the string from the beginning, even if you are just interested > in the last code point of a long string. If you have byte offsets, that > code point can be accessed directly. Of course, that may not suit any > possible scenario, but I still don't think that the PCRE functions > should deal with code point offset instead of byte offsets. > I'll admit that I don't have the best understanding of how PCRE works under-the-hood, but I do believe that because it offers some functionality for working with codepoints, having it also work with codepoint-based offsets seems like a natural extension. And while it may not be the most optimal or common way of working with strings, I do believe there are some valid use cases for it. If placing this within PCRE violates some principles of the library then I'd be okay placing similar functionality elsewhere. -- Colin O'Dell colinodell@gmail.com --000000000000dbe58205b0afbbce--