Suggestion: Make all PCRE functions return *character* offsets, rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given

4 years ago by Thomas Landauer — view source

unread

Hi,

this is a follow-up of a bug I opened, and cmb suggested to continue
here: https://bugs.php.net/bug.php?id=80166

Advantages:

1: Easier string manipulation:
If somebody does (as in my case) preg_match_all() with
PREG_OFFSET_CAPTURE, what will they probably use those returned
numbers/offsets for?
My answer: For splitting the string - in some way or the other. Now,
with byte offsets, I can't do such basic things as just +1 to get to
the next character. Or extract exactly 3 characters.

2: Better performance:
This may sound odd, since cmb said the exact opposite ;-) (sequential
access vs. random access). However, if I need character offsets (see 1),
what can I do? I'm forced to use some workaround on top - as e.g.
https://www.php.net/manual/en/function.preg-match-all.php#71572 - which
is certainly way slower than any native implementation.

3: Consistency with users' expectations:
The current behavior is causing confusion and is perceived as
counter-intuitive, see
https://www.php.net/manual/en/function.preg-match-all.php#61426 and
https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php

So I'm suggesting:

Either do the BC break, and just return byte offsets if the modifier
u is given.
Or create new functions for it: mb_preg_match_all() etc.

Cheers,
Thomas

4 years ago by Christoph M. Becker — view source

unread

this is a follow-up of a bug I opened, and cmb suggested to continue
here: https://bugs.php.net/bug.php?id=80166

Indeed, thanks!

Advantages:

1: Easier string manipulation:
If somebody does (as in my case) preg_match_all() with
PREG_OFFSET_CAPTURE, what will they probably use those returned
numbers/offsets for?
My answer: For splitting the string - in some way or the other. Now,
with byte offsets, I can't do such basic things as just +1 to get to
the next character. Or extract exactly 3 characters.

The term "character" is ambiguous wrt. Unicode. The mbstring functions
work on Unicode code points, so it's probably better to use that term
instead.

While it is trivial to get the next code point using index+1, this is
not necessarily the next character, as perceived by a human. Using
mb_substr(), you may even break "characters", e.g. https://3v4l.org/5geOr.

2: Better performance:
This may sound odd, since cmb said the exact opposite ;-) (sequential
access vs. random access). However, if I need character offsets (see 1),
what can I do? I'm forced to use some workaround on top - as e.g.
https://www.php.net/manual/en/function.preg-match-all.php#71572 - which
is certainly way slower than any native implementation.

If mbstring functions are used to find some offset, they always have to
traverse the string from the beginning, even if you are just interested
in the last code point of a long string. If you have byte offsets, that
code point can be accessed directly. Of course, that may not suit any
possible scenario, but I still don't think that the PCRE functions
should deal with code point offset instead of byte offsets.

Regards,
Christoph

4 years ago by Colin O'Dell — view source

unread

The ability to receive the "character" offset would be extremely useful to
the league/commonmark project. This project is a Markdown parser which
conforms to the CommonMark spec which defines all behavior with regards to
Unicode code points: https://spec.commonmark.org/0.29/#character

On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker cmbecker69@gmx.de
wrote:

While it is trivial to get the next code point using index+1, this is
not necessarily the next character, as perceived by a human. Using
mb_substr(), you may even break "characters", e.g. <https://3v4l.org/5geOr

.

In my particular use case, this is entirely acceptable per the spec linked
to above.

Because the CommonMark spec is "character"-centric, we do have a need to
keep track of character positions within strings when parsing forwards, and
while also allowing for regular expressions to be matched against UTF-8
strings. As Thomas noted, using PREG_OFFSET_CAPTURE provides us with the
byte offset, not the "character" offset. We therefore must do additional
work to calculate the latter from the former:

        $offset = \mb_strlen(\substr($subject, 0, $matches[0][1]),

'UTF-8');

This code is frequently executed and therefore leads to worse performance
than if preg_match() could simply return the offsets we need.

Would I be correct in assuming that preg_match() already has some knowledge
or awareness about codepoints / "characters" when matching against UTF-8
strings and capturing offsets? If so, I think it would be very beneficial
to provide that information to userland to avoid unnecessary
re-calculations.

I'd therefore like to propose a third alternative option: a new flag like
PREG_OFFSET_CODEPOINT. When used in combination with PREG_OFFSET_CAPTURE,
it would return the offset position in terms of "characters", not bytes.
This could also be used to interpret any $offset argument as "characters"
instead of bytes.

The reason I prefer this option is that it doesn't break BC and is entirely
opt-in. If a developer wants this behavior and understands the
implications they can use it. Nobody else is affected otherwise.

On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker cmbecker69@gmx.de
wrote:

If mbstring functions are used to find some offset, they always have to
traverse the string from the beginning, even if you are just interested
in the last code point of a long string. If you have byte offsets, that
code point can be accessed directly. Of course, that may not suit any
possible scenario, but I still don't think that the PCRE functions
should deal with code point offset instead of byte offsets.

I'll admit that I don't have the best understanding of how PCRE works
under-the-hood, but I do believe that because it offers some functionality
for working with codepoints, having it also work with codepoint-based
offsets seems like a natural extension. And while it may not be the most
optimal or common way of working with strings, I do believe there are some
valid use cases for it. If placing this within PCRE violates some
principles of the library then I'd be okay placing similar functionality
elsewhere.

--
Colin O'Dell
colinodell@gmail.com

4 years ago by Claude Pache — view source

unread

Hi,

Working with UTF-8-encoded strings does not implies working with mb_string functions or with code-point counts. Personnally, I work with standard string functions, plus [Grapheme functions] (https://www.php.net/manual/en/ref.intl.grapheme.php https://www.php.net/manual/en/ref.intl.grapheme.php) when I need to split my string between “characters” (which means for me “grapheme clusters”, not “code points”, so that mb_string functions are useless for me). In particular, PREG_OFFSET_CAPTURE does always what I need, even when using the /u flag.

If this is a feature that you want to implement, I suggests adding a flag PREG_UTF8_CODEPOINT_OFFSET_CAPTURE.

—Claude