Offset-only results from preg_match

5 years ago by C. Scott Ananian — view source

unread

I'm floating an idea for an RFC here.

I'm working on the wikimedia/remex-html library for high-performance
PHP-native HTML5 parsing. When creating a high-performance lexer, it is
worthwhile to try to reduce the number of string copies made. You can
generally perform matches using offsets into your master source string.
However, preg_match* will copy a substring for the entire matched region
($matches[0]) as well as for all captured patterns ($matches[1...n]).
These substring copies can get expensive if the matched region/captured
patterns are very large.

It would be helpful if PHP's preg_match* functions offered a flag, say
PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
matched/captured string. In combination,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
element 0 and the numeric offset in element 1, and avoid the need to copy
the matched substring unnecessarily. This would allow greatly reducing the
number of substring copies made during lexing.

Thoughts?
--scott

ps. more ambitious would be to introduce a new "substring" type, which
would share the allocation of a parent string with its own offset and
length fields. That would probably be as invasive as the ZVAL_INTERNED_STR
type, though -- a much much bigger project.

pps. while I'm wishing -- preg_replace would benefit from some way to pass
options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get
the offset for each replacement match. Knowing the offset of the match
allows you to do (for example) error reporting from the callback function.

--
(http://cscott.net)

5 years ago by Markus Fischer — view source

unread

ps. more ambitious would be to introduce a new "substring" type, which
would share the allocation of a parent string with its own offset and
length fields. That would probably be as invasive as the ZVAL_INTERNED_STR
type, though -- a much much bigger project.

That feature would be incredible to have.

Erlang/Elixir use these concept and it's incredible what they can do
with this.
However, the data types there are also immutable so it's whole other league…

Markus

5 years ago by Nikita Popov — view source

unread

On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian cananian@wikimedia.org
wrote:

I'm floating an idea for an RFC here.

I'm working on the wikimedia/remex-html library for high-performance
PHP-native HTML5 parsing. When creating a high-performance lexer, it is
worthwhile to try to reduce the number of string copies made. You can
generally perform matches using offsets into your master source string.
However, preg_match* will copy a substring for the entire matched region
($matches[0]) as well as for all captured patterns ($matches[1...n]).
These substring copies can get expensive if the matched region/captured
patterns are very large.

It would be helpful if PHP's preg_match* functions offered a flag, say
PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
matched/captured string. In combination,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
element 0 and the numeric offset in element 1, and avoid the need to copy
the matched substring unnecessarily. This would allow greatly reducing the
number of substring copies made during lexing.

Generally sounds reasonable to me. Do you maybe have a sample input and
regular expression where you suspect this is a particularly large problem,
so we can test how much of a difference this makes?

Thoughts?
--scott

ps. more ambitious would be to introduce a new "substring" type, which
would share the allocation of a parent string with its own offset and
length fields. That would probably be as invasive as the ZVAL_INTERNED_STR
type, though -- a much much bigger project.

pps. while I'm wishing -- preg_replace would benefit from some way to pass
options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get
the offset for each replacement match. Knowing the offset of the match
allows you to do (for example) error reporting from the callback function.

I've implemented this bit in https://github.com/php/php-src/pull/3958.

Nikita

5 years ago by Nikita Popov — view source

unread

On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian cananian@wikimedia.org
wrote:

I'm floating an idea for an RFC here.

I'm working on the wikimedia/remex-html library for high-performance
PHP-native HTML5 parsing. When creating a high-performance lexer, it is
worthwhile to try to reduce the number of string copies made. You can
generally perform matches using offsets into your master source string.
However, preg_match* will copy a substring for the entire matched region
($matches[0]) as well as for all captured patterns ($matches[1...n]).
These substring copies can get expensive if the matched region/captured
patterns are very large.

It would be helpful if PHP's preg_match* functions offered a flag, say
PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
matched/captured string. In combination,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
element 0 and the numeric offset in element 1, and avoid the need to copy
the matched substring unnecessarily. This would allow greatly reducing
the
number of substring copies made during lexing.

Generally sounds reasonable to me. Do you maybe have a sample input and
regular expression where you suspect this is a particularly large problem,
so we can test how much of a difference this makes?

After thinking about this some more, while this may be a minor performance
improvement, it still does more work than necessary. In particular the use
of OFFSET_CAPTURE (which would be pretty much required here) needs one new
two-element array for each subpattern. If the captured strings are short,
this is where the main cost is going to be.

I'm wondering if we shouldn't consider a new object oriented API for PCRE
which can return a match object where subpattern positions and contents can
be queried via method calls, so you only pay for the parts that you do
access.

Nikita

5 years ago by Christoph M. Becker — view source

unread

I'm wondering if we shouldn't consider a new object oriented API for PCRE
which can return a match object where subpattern positions and contents can
be queried via method calls, so you only pay for the parts that you do
access.

+1

--
Christoph M. Becker

5 years ago by Markus Fischer — view source

unread

I'm wondering if we shouldn't consider a new object oriented API for PCRE
which can return a match object where subpattern positions and contents can
be queried via method calls, so you only pay for the parts that you do
access.

Or also a literal syntax would be very very nice :-)

Markus

5 years ago by C. Scott Ananian — view source

unread

After thinking about this some more, while this may be a minor performance
improvement, it still does more work than necessary. In particular the use
of OFFSET_CAPTURE (which would be pretty much required here) needs one new
two-element array for each subpattern. If the captured strings are short,
this is where the main cost is going to be.

The primary use of this feature is when the captured strings are long, as
that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for PCRE
which can return a match object where subpattern positions and contents can
be queried via method calls, so you only pay for the parts that you do
access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE :
0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming very
hard to write backward-compatible code.
--scott

5 years ago by C. Scott Ananian — view source

unread

On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian cananian@wikimedia.org
wrote:

I'm floating an idea for an RFC here.

I'm working on the wikimedia/remex-html library for high-performance
PHP-native HTML5 parsing. When creating a high-performance lexer, it is
worthwhile to try to reduce the number of string copies made. You can
generally perform matches using offsets into your master source string.
However, preg_match* will copy a substring for the entire matched region
($matches[0]) as well as for all captured patterns ($matches[1...n]).
These substring copies can get expensive if the matched region/captured
patterns are very large.

It would be helpful if PHP's preg_match* functions offered a flag, say
PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
matched/captured string. In combination,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
element 0 and the numeric offset in element 1, and avoid the need to copy
the matched substring unnecessarily. This would allow greatly reducing
the
number of substring copies made during lexing.

Generally sounds reasonable to me. Do you maybe have a sample input and
regular expression where you suspect this is a particularly large problem,
so we can test how much of a difference this makes?

I'm going to work on emulating this today by changing as many of the
captures in remex-html to zero-length captures at the start/end of the
region; that should give me a reasonable idea of performance gain.

pps. while I'm wishing -- preg_replace would benefit from some way to pass

options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get
the offset for each replacement match. Knowing the offset of the match
allows you to do (for example) error reporting from the callback function.

I've implemented this bit in https://github.com/php/php-src/pull/3958.

I notice that this has been merged already. Looks great, especially
breaking out the creation of the matches array into a reusable function.
That would make future additions easier/more consistent.
--scott

--
(http://cscott.net)

5 years ago by Nikita Popov — view source

unread

On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian cananian@wikimedia.org
wrote:

On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov nikita.ppv@gmail.com
wrote:

After thinking about this some more, while this may be a minor
performance improvement, it still does more work than necessary. In
particular the use of OFFSET_CAPTURE (which would be pretty much required
here) needs one new two-element array for each subpattern. If the captured
strings are short, this is where the main cost is going to be.

The primary use of this feature is when the captured strings are long,
as that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for PCRE
which can return a match object where subpattern positions and contents can
be queried via method calls, so you only pay for the parts that you do
access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE :
0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming very
hard to write backward-compatible code.
--scott

Fair enough. I've created https://github.com/php/php-src/pull/3971 to
implement this feature. It would be good to have some confirmation that
this is really a significant performance improvement before we land it
though.

Nikita

5 years ago by C. Scott Ananian — view source

unread

Quick status update. I tried to prototype this in pure PHP in the
wikimedia/remex-html library using (?= .. ) around each regexp and ()...()
around each captured expression (replacing the capture parens) to
effectively bypass the string copy and return a bunch of zero-length
strings. That didn't succeed in speeding up remex-html on my pet benchmark
because (1) the (?= ... ) appears to deoptimize the regexp match, and (2)
it turns out there's a substantial cost to each capture (presumably all
those two-element arrays which Nikita flagged before as a future issue) and
so doubling the total number of captures by using () () instead of (....)
slowed the match down.

So bad news: my benchmarking shortcut didn't work. Potential good news: I
guess that underlines why this feature is necessary and can't just be
emulated.

I'm going to try this benchmark again tomorrow but by rebuilding PHP from
source using Nikita's proposed patch so that I can actually get an
apples-to-apples comparison.
--scott

On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian cananian@wikimedia.org
wrote:

On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov nikita.ppv@gmail.com
wrote:

After thinking about this some more, while this may be a minor
performance improvement, it still does more work than necessary. In
particular the use of OFFSET_CAPTURE (which would be pretty much required
here) needs one new two-element array for each subpattern. If the captured
strings are short, this is where the main cost is going to be.

The primary use of this feature is when the captured strings are long,
as that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for
PCRE which can return a match object where subpattern positions and
contents can be queried via method calls, so you only pay for the parts
that you do access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE
: 0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming very
hard to write backward-compatible code.
--scott

Fair enough. I've created https://github.com/php/php-src/pull/3971 to
implement this feature. It would be good to have some confirmation that
this is really a significant performance improvement before we land it
though.

Nikita

--
(http://cscott.net)

5 years ago by C. Scott Ananian — view source

unread

ps. Just to put some numbers to it, using psysh on $html100 which
contains the (Parsoid format) HTML for the [[en:Barack Obama]] article on
Wikipedia.

>>> strlen($html100)
=> 2592386
>>> timeit -n1000 preg_match_all( '/(b)/', $html100, $m,
 `PREG_OFFSET_CAPTURE` );
=> 22062
Command took 0.008648 seconds on average (0.008236 median; 8.648343 total)
to complete.
>>> timeit -n1000 preg_match_all( '/b()/', $html100, $m,
 `PREG_OFFSET_CAPTURE` );
=> 22062
Command took 0.008438 seconds on average (0.008127 median; 8.437881 total)
to complete.
>>> timeit -n1000 preg_match_all( '/b()()/', $html100, $m,
 `PREG_OFFSET_CAPTURE` );
=> 22062
Command took 0.012069 seconds on average (0.011589 median; 12.069407 total)
to complete.
>>> timeit -n1000 preg_match_all( '/(?=(b))/', $html100, $m,
 `PREG_OFFSET_CAPTURE` );
=> 22062
Command took 0.012134 seconds on average (0.011483 median; 12.134265 total)
to complete.
>>> timeit -n1000 preg_match_all( '/(?=()b())/', $html100, $m,
 `PREG_OFFSET_CAPTURE` );
=> 22062
Command took 0.016513 seconds on average (0.016039 median; 16.513011 total)
to complete.

So this isn't a good way to determine the cost of the string copy in the
$matches array. (The string copy is really trivial in this particular case
anyway.)
--scott

On Thu, Mar 21, 2019 at 5:16 PM C. Scott Ananian cananian@wikimedia.org
wrote:

Quick status update. I tried to prototype this in pure PHP in the
wikimedia/remex-html library using (?= .. ) around each regexp and ()...()
around each captured expression (replacing the capture parens) to
effectively bypass the string copy and return a bunch of zero-length
strings. That didn't succeed in speeding up remex-html on my pet benchmark
because (1) the (?= ... ) appears to deoptimize the regexp match, and (2)
it turns out there's a substantial cost to each capture (presumably all
those two-element arrays which Nikita flagged before as a future issue) and
so doubling the total number of captures by using () () instead of (....)
slowed the match down.

So bad news: my benchmarking shortcut didn't work. Potential good news: I
guess that underlines why this feature is necessary and can't just be
emulated.

I'm going to try this benchmark again tomorrow but by rebuilding PHP from
source using Nikita's proposed patch so that I can actually get an
apples-to-apples comparison.
--scott

On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian cananian@wikimedia.org
wrote:

On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov nikita.ppv@gmail.com
wrote:

After thinking about this some more, while this may be a minor
performance improvement, it still does more work than necessary. In
particular the use of OFFSET_CAPTURE (which would be pretty much required
here) needs one new two-element array for each subpattern. If the captured
strings are short, this is where the main cost is going to be.

The primary use of this feature is when the captured strings are long,
as that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for
PCRE which can return a match object where subpattern positions and
contents can be queried via method calls, so you only pay for the parts
that you do access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE
: 0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming very
hard to write backward-compatible code.
--scott

Fair enough. I've created https://github.com/php/php-src/pull/3971 to
implement this feature. It would be good to have some confirmation that
this is really a significant performance improvement before we land it
though.

Nikita

--
(http://cscott.net)

--
(http://cscott.net)

5 years ago by C. Scott Ananian — view source

unread

So...

In microbenchmarks you can clearly see the improvement:

>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE);
=> 39
Command took 0.001709 seconds on average (0.001654 median; 0.854503 total)
to complete.
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE);
=> 39
Command took 0.000991 seconds on average (0.000965 median; 0.495287 total)
to complete.

$html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]].

But unless you're matching 64k strings like this, it's hard to see a
practical improvement. In my remex-html html parsing benchmark, although
LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant
performance improvement. I built php statically and ran it through
cachegrind to try to figure out why, and found:

2,567,559,670 cycles: total spent executing the tokenizer benchmark
(including reading the file from disk)
1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important
for tokenizers! Of these, we spend
575,478,637 doing the actual match (preg_pcre_match_impl) and
435,162,131 getting the regexp from the cache (!)
(pcre_get_compiled_regex_cache)

This is for 128,794 total regexp matches performed by the tokenizer on this
input.

Of those cycles getting the regex from cache, only 24,116,319 are spent on
cache misses where we are actually compiling regexps (sum of
php_pcre2_jit_compile and php_pcre2_compile).
Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% of
the total time spent executing preg_match.

The LENGTH_CAPTURE optimization does reduce the total time spent in
populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the
remex-html tokenizer on this benchmark, but that difference is overwhelmed
by (for example) the time spent in zend_hash_find().

The slowdown in zend_hash_find() appears to be due to
https://bugs.php.net/bug.php?id=63180 which disabled interned keys in the
pcre_cache table. Because of this, even if the regexs are interned, we
must pay for a complete zend_string_equal_content() on each match, which
takes time proportional to the length of the regex. This can be quite
large -- for example, for the HTML5 character reference regex in
remex-html, which contains every valid entity name and is 26,137 bytes
long, and we need to do a zend_string_equal_content() on the 26,137 byte
regexp for every ~5 byte entity in the parsed HTML.
--scott

On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian cananian@wikimedia.org
wrote:

On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov nikita.ppv@gmail.com
wrote:

After thinking about this some more, while this may be a minor
performance improvement, it still does more work than necessary. In
particular the use of OFFSET_CAPTURE (which would be pretty much required
here) needs one new two-element array for each subpattern. If the captured
strings are short, this is where the main cost is going to be.

The primary use of this feature is when the captured strings are long,
as that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for
PCRE which can return a match object where subpattern positions and
contents can be queried via method calls, so you only pay for the parts
that you do access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE
: 0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming very
hard to write backward-compatible code.
--scott

Fair enough. I've created https://github.com/php/php-src/pull/3971 to
implement this feature. It would be good to have some confirmation that
this is really a significant performance improvement before we land it
though.

Nikita

--
(http://cscott.net)

5 years ago by Nikita Popov — view source

unread

On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian cananian@wikimedia.org
wrote:

So...

In microbenchmarks you can clearly see the improvement:
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE);
=> 39
Command took 0.001709 seconds on average (0.001654 median; 0.854503 total)
to complete.
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE);
=> 39
Command took 0.000991 seconds on average (0.000965 median; 0.495287 total)
to complete.
$html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]].

But unless you're matching 64k strings like this, it's hard to see a
practical improvement. In my remex-html html parsing benchmark, although
LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant
performance improvement. I built php statically and ran it through
cachegrind to try to figure out why, and found:

2,567,559,670 cycles: total spent executing the tokenizer benchmark
(including reading the file from disk)
1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important
for tokenizers! Of these, we spend
575,478,637 doing the actual match (preg_pcre_match_impl) and
435,162,131 getting the regexp from the cache (!)
(pcre_get_compiled_regex_cache)

This is for 128,794 total regexp matches performed by the tokenizer on
this input.

Of those cycles getting the regex from cache, only 24,116,319 are spent on
cache misses where we are actually compiling regexps (sum of
php_pcre2_jit_compile and php_pcre2_compile).
Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% of
the total time spent executing preg_match.

The LENGTH_CAPTURE optimization does reduce the total time spent in
populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the
remex-html tokenizer on this benchmark, but that difference is overwhelmed
by (for example) the time spent in zend_hash_find().

The slowdown in zend_hash_find() appears to be due to
https://bugs.php.net/bug.php?id=63180 which disabled interned keys in the
pcre_cache table. Because of this, even if the regexs are interned, we
must pay for a complete zend_string_equal_content() on each match, which
takes time proportional to the length of the regex. This can be quite
large -- for example, for the HTML5 character reference regex in
remex-html, which contains every valid entity name and is 26,137 bytes
long, and we need to do a zend_string_equal_content() on the 26,137 byte
regexp for every ~5 byte entity in the parsed HTML.
--scott

Thanks for testing! That's an interesting result. We should be able to do
something about this. There are basically three cases:

CLI (presumably what you're testing). Strings are interned per-request,
but there is only one request.
Server w/o opcache. Strings are interned per-request and there may be
multiple requests.
Server with opcache. Strings are interned permanently in opcache.

Case 3 should already be fast, because permanent interned strings are
allowed into the regex cache. We can optimize case 1 by simply allowing
arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not
be needed anymore anyway. Case 2 would remain slow, but it's slow anyway...

Nikita

On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian cananian@wikimedia.org
wrote:

On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov nikita.ppv@gmail.com
wrote:

After thinking about this some more, while this may be a minor
performance improvement, it still does more work than necessary. In
particular the use of OFFSET_CAPTURE (which would be pretty much required
here) needs one new two-element array for each subpattern. If the captured
strings are short, this is where the main cost is going to be.

The primary use of this feature is when the captured strings are long,
as that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for
PCRE which can return a match object where subpattern positions and
contents can be queried via method calls, so you only pay for the parts
that you do access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE
: 0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming very
hard to write backward-compatible code.
--scott

Fair enough. I've created https://github.com/php/php-src/pull/3971 to
implement this feature. It would be good to have some confirmation that
this is really a significant performance improvement before we land it
though.

Nikita

--
(http://cscott.net)

5 years ago by C. Scott Ananian — view source

unread

Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x
with opcache ( https://phabricator.wikimedia.org/T176370 /
https://phabricator.wikimedia.org/T211964 ). It would be nice to fix the
CLI to behave more like the server wrt interned strings. It certainly
would make benchmarking easier/more representative, but we also have a
bunch of testing/CI stuff running in CLI mode so speedups there are
definitely useful, even if they don't "speed up Wikipedia" per se.

Ignoring the zend_hash_find costs for the moment, php_pcre_match_impl costs
492,077,935 cycles in this benchmark, of which only 211,626,095 are spent
doing the actual php_pcre2_jit_match. 140,000,069 cycles are spent in
populate_subpat_array and 108,742,309 are spent in the zval_ptr_dtor call
on the penultimate line -- basically freeing the $matches array from the
previous call to pcre_match (and this is with PREG_LENGTH_CAPTURE). So
there's still (in theory) a >2x speedup in preg matching available by
tuning how the subpat_array works and making it less costly to
allocate/free $matches.
--scott

On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian cananian@wikimedia.org
wrote:
So...

In microbenchmarks you can clearly see the improvement:
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE);
=> 39
Command took 0.001709 seconds on average (0.001654 median; 0.854503
total) to complete.
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE);
=> 39
Command took 0.000991 seconds on average (0.000965 median; 0.495287
total) to complete.
$html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]].

But unless you're matching 64k strings like this, it's hard to see a
practical improvement. In my remex-html html parsing benchmark, although
LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant
performance improvement. I built php statically and ran it through
cachegrind to try to figure out why, and found:

2,567,559,670 cycles: total spent executing the tokenizer benchmark
(including reading the file from disk)
1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important
for tokenizers! Of these, we spend
575,478,637 doing the actual match (preg_pcre_match_impl) and
435,162,131 getting the regexp from the cache (!)
(pcre_get_compiled_regex_cache)

This is for 128,794 total regexp matches performed by the tokenizer on
this input.

Of those cycles getting the regex from cache, only 24,116,319 are spent
on cache misses where we are actually compiling regexps (sum of
php_pcre2_jit_compile and php_pcre2_compile).
Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% of
the total time spent executing preg_match.

The LENGTH_CAPTURE optimization does reduce the total time spent in
populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the
remex-html tokenizer on this benchmark, but that difference is overwhelmed
by (for example) the time spent in zend_hash_find().

The slowdown in zend_hash_find() appears to be due to
https://bugs.php.net/bug.php?id=63180 which disabled interned keys in
the pcre_cache table. Because of this, even if the regexs are interned, we
must pay for a complete zend_string_equal_content() on each match, which
takes time proportional to the length of the regex. This can be quite
large -- for example, for the HTML5 character reference regex in
remex-html, which contains every valid entity name and is 26,137 bytes
long, and we need to do a zend_string_equal_content() on the 26,137 byte
regexp for every ~5 byte entity in the parsed HTML.
--scott
Thanks for testing! That's an interesting result. We should be able to do
something about this. There are basically three cases:

CLI (presumably what you're testing). Strings are interned per-request,
but there is only one request.

Server w/o opcache. Strings are interned per-request and there may be
multiple requests.

Server with opcache. Strings are interned permanently in opcache.

Case 3 should already be fast, because permanent interned strings are
allowed into the regex cache. We can optimize case 1 by simply allowing
arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not
be needed anymore anyway. Case 2 would remain slow, but it's slow anyway...

Nikita

On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov nikita.ppv@gmail.com
wrote:

On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian cananian@wikimedia.org
wrote:

On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov nikita.ppv@gmail.com
wrote:

After thinking about this some more, while this may be a minor
performance improvement, it still does more work than necessary. In
particular the use of OFFSET_CAPTURE (which would be pretty much required
here) needs one new two-element array for each subpattern. If the captured
strings are short, this is where the main cost is going to be.

The primary use of this feature is when the captured strings are
long, as that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for
PCRE which can return a match object where subpattern positions and
contents can be queried via method calls, so you only pay for the parts
that you do access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ?
PREG_LENGTH_CAPTURE : 0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE |
$hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming
very hard to write backward-compatible code.
--scott

Fair enough. I've created https://github.com/php/php-src/pull/3971 to
implement this feature. It would be good to have some confirmation that
this is really a significant performance improvement before we land it
though.

Nikita

--
(http://cscott.net)

--
(http://cscott.net)

5 years ago by Nikita Popov — view source

unread

On Sat, Mar 23, 2019 at 4:07 PM C. Scott Ananian cananian@wikimedia.org
wrote:

Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x
with opcache ( https://phabricator.wikimedia.org/T176370 /
https://phabricator.wikimedia.org/T211964 ). It would be nice to fix the
CLI to behave more like the server wrt interned strings. It certainly
would make benchmarking easier/more representative, but we also have a
bunch of testing/CI stuff running in CLI mode so speedups there are
definitely useful, even if they don't "speed up Wikipedia" per se.

Ignoring the zend_hash_find costs for the moment, php_pcre_match_impl
costs 492,077,935 cycles in this benchmark, of which only 211,626,095 are
spent doing the actual php_pcre2_jit_match. 140,000,069 cycles are spent
in populate_subpat_array and 108,742,309 are spent in the zval_ptr_dtor
call on the penultimate line -- basically freeing the $matches array from
the previous call to pcre_match (and this is with PREG_LENGTH_CAPTURE). So
there's still (in theory) a >2x speedup in preg matching available by
tuning how the subpat_array works and making it less costly to
allocate/free $matches.
--scott

The aforementioned optimization to the cache lookup on CLI is now
implemented via https://github.com/php/php-src/pull/3985.

There is more that can be improved here, but it falls more into the realm
of micro-optimization... For example, we can specialize the creation of the
offset pairs to reduce the overhead (
https://github.com/php/php-src/pull/3990), but this still remains the most
expensive part of the subpat population...

Nikita

On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian cananian@wikimedia.org
wrote:
So...

In microbenchmarks you can clearly see the improvement:
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE);
=> 39
Command took 0.001709 seconds on average (0.001654 median; 0.854503
total) to complete.
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE);
=> 39
Command took 0.000991 seconds on average (0.000965 median; 0.495287
total) to complete.
$html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]].

But unless you're matching 64k strings like this, it's hard to see a
practical improvement. In my remex-html html parsing benchmark, although
LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant
performance improvement. I built php statically and ran it through
cachegrind to try to figure out why, and found:

2,567,559,670 cycles: total spent executing the tokenizer benchmark
(including reading the file from disk)
1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important
for tokenizers! Of these, we spend
575,478,637 doing the actual match (preg_pcre_match_impl) and
435,162,131 getting the regexp from the cache (!)
(pcre_get_compiled_regex_cache)

This is for 128,794 total regexp matches performed by the tokenizer on
this input.

Of those cycles getting the regex from cache, only 24,116,319 are spent
on cache misses where we are actually compiling regexps (sum of
php_pcre2_jit_compile and php_pcre2_compile).
Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40%
of the total time spent executing preg_match.

The LENGTH_CAPTURE optimization does reduce the total time spent in
populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the
remex-html tokenizer on this benchmark, but that difference is overwhelmed
by (for example) the time spent in zend_hash_find().

The slowdown in zend_hash_find() appears to be due to
https://bugs.php.net/bug.php?id=63180 which disabled interned keys in
the pcre_cache table. Because of this, even if the regexs are interned, we
must pay for a complete zend_string_equal_content() on each match, which
takes time proportional to the length of the regex. This can be quite
large -- for example, for the HTML5 character reference regex in
remex-html, which contains every valid entity name and is 26,137 bytes
long, and we need to do a zend_string_equal_content() on the 26,137 byte
regexp for every ~5 byte entity in the parsed HTML.
--scott
Thanks for testing! That's an interesting result. We should be able to do
something about this. There are basically three cases:

CLI (presumably what you're testing). Strings are interned
per-request, but there is only one request.

Server w/o opcache. Strings are interned per-request and there may be
multiple requests.

Server with opcache. Strings are interned permanently in opcache.

Case 3 should already be fast, because permanent interned strings are
allowed into the regex cache. We can optimize case 1 by simply allowing
arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not
be needed anymore anyway. Case 2 would remain slow, but it's slow anyway...

Nikita

On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov nikita.ppv@gmail.com
wrote:

On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <
cananian@wikimedia.org> wrote:

On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov nikita.ppv@gmail.com
wrote:

After thinking about this some more, while this may be a minor
performance improvement, it still does more work than necessary. In
particular the use of OFFSET_CAPTURE (which would be pretty much required
here) needs one new two-element array for each subpattern. If the captured
strings are short, this is where the main cost is going to be.

The primary use of this feature is when the captured strings are
long, as that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for
PCRE which can return a match object where subpattern positions and
contents can be queried via method calls, so you only pay for the parts
that you do access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ?
PREG_LENGTH_CAPTURE : 0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE |
$hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming
very hard to write backward-compatible code.
--scott

Fair enough. I've created https://github.com/php/php-src/pull/3971 to
implement this feature. It would be good to have some confirmation that
this is really a significant performance improvement before we land it
though.

Nikita

--
(http://cscott.net)
--
(http://cscott.net)

5 years ago by C. Scott Ananian — view source

unread

Continuing this saga: I'm still having performance problems on character
entity expansion. Here's the baseline code:
https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881
Of note: the regular expression is quite large -- around 26kB -- because it
needs to include the complete table of all HTML5 entities, which it gets
from a separate file of tables, HTMLData.php.

Recapping briefly: we established before that it is very important that
large regex strings been interned, otherwise pcre_get_compiled_regex_cache
needs to do a full zend_string_equal_content() on every call to
preg_match*, and since the strings will match, that costs a complete
traversal of the 26kB regexp string.

If I inline the char ref table directly into the regexp as a single huge
literal string, that string is interned and (with Nikita's recent fixes for
the CLI) things are ok.

But that's bad for code maintainability; it violates Do Not Repeat Yourself
and now it's much harder to see what the character reference regexp is
doing because it's got this huge 26k table embedded in the middle of it.

PHP will let me initialize the string as:

const CHAR_REF_REGEXP = ' ... ' . HTMLData::NAMED_ENTITY_REGEX . "...";

that is, it recognizes this as a compile-time constant -- but it doesn't
actually intern the resulting string. The code in
zend_declare_class_constant_ex interns most constant strings, but in this
case because there is a reference to another constant, the Z_TYPE_P(value)
== IS_STRING check in zend_declare_class_constant_ex fails (the actual type
is IS_CONSTANT_AST) presumably because we don't want to autoload HTMLData
too soon. (But this also seems to happen even if I use
self::NAMED_ENTITY_REGEX here, which wouldn't trigger the autoloader.)

I think the proper fix is to intern the string lazily when it is finally
evaluated, in ZEND_FETCH_CLASS_CONSTANT_SPEC_CONST_CONST_HANDLER around the
point where we check Z_TYPE_P(value) == IS_CONSTANT_AST -- probably by
tweaking zval_update_constant_ex to intern any string result?

Thought I'd run that thought process by folks before I go further.
--scott

On Sat, Mar 23, 2019 at 4:07 PM C. Scott Ananian cananian@wikimedia.org
wrote:

Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x
with opcache ( https://phabricator.wikimedia.org/T176370 /
https://phabricator.wikimedia.org/T211964 ). It would be nice to fix
the CLI to behave more like the server wrt interned strings. It certainly
would make benchmarking easier/more representative, but we also have a
bunch of testing/CI stuff running in CLI mode so speedups there are
definitely useful, even if they don't "speed up Wikipedia" per se.

Ignoring the zend_hash_find costs for the moment, php_pcre_match_impl
costs 492,077,935 cycles in this benchmark, of which only 211,626,095 are
spent doing the actual php_pcre2_jit_match. 140,000,069 cycles are spent
in populate_subpat_array and 108,742,309 are spent in the zval_ptr_dtor
call on the penultimate line -- basically freeing the $matches array from
the previous call to pcre_match (and this is with PREG_LENGTH_CAPTURE). So
there's still (in theory) a >2x speedup in preg matching available by
tuning how the subpat_array works and making it less costly to
allocate/free $matches.
--scott

The aforementioned optimization to the cache lookup on CLI is now
implemented via https://github.com/php/php-src/pull/3985.

There is more that can be improved here, but it falls more into the realm
of micro-optimization... For example, we can specialize the creation of the
offset pairs to reduce the overhead (
https://github.com/php/php-src/pull/3990), but this still remains the
most expensive part of the subpat population...

Nikita
On Sat, Mar 23, 2019 at 6:13 AM Nikita Popov nikita.ppv@gmail.com
wrote:
On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian cananian@wikimedia.org
wrote:
So...

In microbenchmarks you can clearly see the improvement:
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE);
=> 39
Command took 0.001709 seconds on average (0.001654 median; 0.854503
total) to complete.
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE);
=> 39
Command took 0.000991 seconds on average (0.000965 median; 0.495287
total) to complete.
$html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]].

But unless you're matching 64k strings like this, it's hard to see a
practical improvement. In my remex-html html parsing benchmark, although
LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant
performance improvement. I built php statically and ran it through
cachegrind to try to figure out why, and found:

2,567,559,670 cycles: total spent executing the tokenizer benchmark
(including reading the file from disk)
1,018,845,290 cycles in zif_preg_match. Optimizing regexps is
important for tokenizers! Of these, we spend
575,478,637 doing the actual match (preg_pcre_match_impl) and
435,162,131 getting the regexp from the cache (!)
(pcre_get_compiled_regex_cache)

This is for 128,794 total regexp matches performed by the tokenizer on
this input.

Of those cycles getting the regex from cache, only 24,116,319 are spent
on cache misses where we are actually compiling regexps (sum of
php_pcre2_jit_compile and php_pcre2_compile).
Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40%
of the total time spent executing preg_match.

The LENGTH_CAPTURE optimization does reduce the total time spent in
populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the
remex-html tokenizer on this benchmark, but that difference is overwhelmed
by (for example) the time spent in zend_hash_find().

The slowdown in zend_hash_find() appears to be due to
https://bugs.php.net/bug.php?id=63180 which disabled interned keys in
the pcre_cache table. Because of this, even if the regexs are interned, we
must pay for a complete zend_string_equal_content() on each match, which
takes time proportional to the length of the regex. This can be quite
large -- for example, for the HTML5 character reference regex in
remex-html, which contains every valid entity name and is 26,137 bytes
long, and we need to do a zend_string_equal_content() on the 26,137 byte
regexp for every ~5 byte entity in the parsed HTML.
--scott
Thanks for testing! That's an interesting result. We should be able to
do something about this. There are basically three cases:

CLI (presumably what you're testing). Strings are interned
per-request, but there is only one request.

Server w/o opcache. Strings are interned per-request and there may be
multiple requests.

Server with opcache. Strings are interned permanently in opcache.

Case 3 should already be fast, because permanent interned strings are
allowed into the regex cache. We can optimize case 1 by simply allowing
arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not
be needed anymore anyway. Case 2 would remain slow, but it's slow anyway...

Nikita

On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov nikita.ppv@gmail.com
wrote:

On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <
cananian@wikimedia.org> wrote:

On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov nikita.ppv@gmail.com
wrote:

After thinking about this some more, while this may be a minor
performance improvement, it still does more work than necessary. In
particular the use of OFFSET_CAPTURE (which would be pretty much required
here) needs one new two-element array for each subpattern. If the captured
strings are short, this is where the main cost is going to be.

The primary use of this feature is when the captured strings are
long, as that's when we most want to avoid copying a substring.

I'm wondering if we shouldn't consider a new object oriented API for
PCRE which can return a match object where subpattern positions and
contents can be queried via method calls, so you only pay for the parts
that you do access.

Seems like this is letting the perfect be the enemy of the good. The
LENGTH_CAPTURE significantly reduces allocation for long match strings, and
it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
just stores an integer where there would otherwise be an expensive
substring. Furthermore, since the array structure is left mostly alone, it
would be not-too-hard to support earlier-PHP versions, with something like:

$hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ?
PREG_LENGTH_CAPTURE : 0;
$r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE |
$hasLengthCapture);
$matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
$matchOneOffset = $m[1][1];

If you introduce a whole new OO accessor object, it starts becoming
very hard to write backward-compatible code.
--scott

Fair enough. I've created https://github.com/php/php-src/pull/3971 to
implement this feature. It would be good to have some confirmation that
this is really a significant performance improvement before we land it
though.

Nikita

--
(http://cscott.net)
--
(http://cscott.net)

--
(http://cscott.net)

5 years ago by C. Scott Ananian — view source

unread

On Wed, Mar 27, 2019 at 2:30 PM C. Scott Ananian cananian@wikimedia.org
wrote:

Continuing this saga: I'm still having performance problems on character
entity expansion. Here's the baseline code:
https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881
Of note: the regular expression is quite large -- around 26kB -- because
it needs to include the complete table of all HTML5 entities, which it gets
from a separate file of tables, HTMLData.php.

Recapping briefly: we established before that it is very important that
large regex strings been interned, otherwise pcre_get_compiled_regex_cache
needs to do a full zend_string_equal_content() on every call to
preg_match*, and since the strings will match, that costs a complete
traversal of the 26kB regexp string.

If I inline the char ref table directly into the regexp as a single huge
literal string, that string is interned and (with Nikita's recent fixes for
the CLI) things are ok.

But that's bad for code maintainability; it violates Do Not Repeat
Yourself and now it's much harder to see what the character reference
regexp is doing because it's got this huge 26k table embedded in the middle
of it.

PHP will let me initialize the string as:

const CHAR_REF_REGEXP = ' ... ' . HTMLData::NAMED_ENTITY_REGEX . "...";

that is, it recognizes this as a compile-time constant -- but it doesn't
actually intern the resulting string. The code in
zend_declare_class_constant_ex interns most constant strings, but in this
case because there is a reference to another constant, the Z_TYPE_P(value)
== IS_STRING check in zend_declare_class_constant_ex fails (the actual type
is IS_CONSTANT_AST) presumably because we don't want to autoload HTMLData
too soon. (But this also seems to happen even if I use
self::NAMED_ENTITY_REGEX here, which wouldn't trigger the autoloader.)

I think the proper fix is to intern the string lazily when it is finally
evaluated, in ZEND_FETCH_CLASS_CONSTANT_SPEC_CONST_CONST_HANDLER around the
point where we check Z_TYPE_P(value) == IS_CONSTANT_AST -- probably by
tweaking zval_update_constant_ex to intern any string result?

I've created https://github.com/php/php-src/pull/3994 implementing this
fix, and confirmed that it is sufficient to get my large regexp interned
when it is rewritten as a class constant referencing
HTMLData::NAMED_ENTITY_REGEX.
--scott

5 years ago by C. Scott Ananian — view source

unread

On Wed, Mar 27, 2019 at 5:41 PM C. Scott Ananian cananian@wikimedia.org
wrote:

I've created https://github.com/php/php-src/pull/3994 implementing this
fix, and confirmed that it is sufficient to get my large regexp interned
when it is rewritten as a class constant referencing
HTMLData::NAMED_ENTITY_REGEX.

After some discussion on PR #3994, it was determined that splitting the
pcre cache to allow per-request caching would be a better solution. I'm
closing #3994 and I've created https://github.com/php/php-src/pull/4004
with an initial implementation of the split PCRE cache.
--scott

--
(http://cscott.net)