Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:111985
MIME-Version: 1.0
References: <cd1a914c-aeb2-9c73-147c-bb49a78318a5@landauer.at> <2c54d906-35c7-6ed6-ac47-649b7bcde2a2@gmx.de>
In-Reply-To: <2c54d906-35c7-6ed6-ac47-649b7bcde2a2@gmx.de>
Date: Fri, 2 Oct 2020 09:01:29 -0400
Message-ID: <CAJaRsPtHSJodGFbu870gOZVdXZVSHX9e0jc7GTF2gswm+OO6vw@mail.gmail.com>
To: "Christoph M. Becker" <cmbecker69@gmx.de>
Cc: Thomas Landauer <thomas@landauer.at>, PHP Internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="000000000000dbe58205b0afbbce"
Subject: Re: [PHP-DEV] Re: Suggestion: Make all PCRE functions return
 *character* offsets,rather than *byte* offsets if the modifier `u`
 (PCRE_UTF8) is given
From: colinodell@gmail.com ("Colin O'Dell")

--000000000000dbe58205b0afbbce
Content-Type: text/plain; charset="UTF-8"

The ability to receive the "character" offset would be extremely useful to
the league/commonmark project.  This project is a Markdown parser which
conforms to the CommonMark spec which defines all behavior with regards to
Unicode code points: <https://spec.commonmark.org/0.29/#character>

On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker <cmbecker69@gmx.de>
wrote:

> While it is trivial to get the next code point using index+1, this is
> not necessarily the next character, as perceived by a human.  Using
> mb_substr(), you may even break "characters", e.g. <https://3v4l.org/5geOr
> >.
>

In my particular use case, this is entirely acceptable per the spec linked
to above.

Because the CommonMark spec is "character"-centric, we do have a need to
keep track of character positions within strings when parsing forwards, and
while also allowing for regular expressions to be matched against UTF-8
strings.  As Thomas noted, using PREG_OFFSET_CAPTURE provides us with the
byte offset, not the "character" offset.  We therefore must do additional
work to calculate the latter from the former:

            $offset = \mb_strlen(\substr($subject, 0, $matches[0][1]),
'UTF-8');

This code is frequently executed and therefore leads to worse performance
than if preg_match() could simply return the offsets we need.

Would I be correct in assuming that preg_match() already has some knowledge
or awareness about codepoints / "characters" when matching against UTF-8
strings and capturing offsets?  If so, I think it would be very beneficial
to provide that information to userland to avoid unnecessary
re-calculations.

I'd therefore like to propose a third alternative option: a new flag like
PREG_OFFSET_CODEPOINT.  When used in combination with PREG_OFFSET_CAPTURE,
it would return the offset position in terms of "characters", not bytes.
This could also be used to interpret any $offset argument as "characters"
instead of bytes.

The reason I prefer this option is that it doesn't break BC and is entirely
opt-in.  If a developer wants this behavior and understands the
implications they can use it.  Nobody else is affected otherwise.

On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker <cmbecker69@gmx.de>
wrote:

> If mbstring functions are used to find some offset, they always have to
> traverse the string from the beginning, even if you are just interested
> in the last code point of a long string.  If you have byte offsets, that
> code point can be accessed directly.  Of course, that may not suit any
> possible scenario, but I still don't think that the PCRE functions
> should deal with code point offset instead of byte offsets.
>

I'll admit that I don't have the best understanding of how PCRE works
under-the-hood, but I do believe that because it offers some functionality
for working with codepoints, having it also work with codepoint-based
offsets seems like a natural extension.  And while it may not be the most
optimal or common way of working with strings, I do believe there are some
valid use cases for it.  If placing this within PCRE violates some
principles of the library then I'd be okay placing similar functionality
elsewhere.


-- 
Colin O'Dell
colinodell@gmail.com

--000000000000dbe58205b0afbbce--