Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:104885
MIME-Version: 1.0
References: <CAK5kH3x-P+eDXtf0wsm-=sffMq6y_6QGAdavikMEOGu=R1ddeA@mail.gmail.com>
 <CAK5kH3zGox7e9o3NhQav2hutk3hZoKAeXPkH7dNK9+aSDmbemw@mail.gmail.com>
 <CAF+90c89oR5OhC6ihAiSrQZtPfCEUStevKp8jg00HVE1gcpKDg@mail.gmail.com>
 <CAF+90c_0+eDvxuNi1NyMN=K39MtaNH7qXV=WrH5D0CB943VJng@mail.gmail.com>
 <CAK5kH3xk=xNR2+5McWVaOfo9_MRS=JydTpraSB1DpQXNjkvGUQ@mail.gmail.com>
 <CAF+90c_fvgMo+Et_RDK_P-fudVQO+ZGen=F5gGxvhRsy2ntkrQ@mail.gmail.com> <CAK5kH3yJuh-_vkpiBNjXyr-+E_zA+bFy=cZvEVcWSb3R4qLp7g@mail.gmail.com>
In-Reply-To: <CAK5kH3yJuh-_vkpiBNjXyr-+E_zA+bFy=cZvEVcWSb3R4qLp7g@mail.gmail.com>
Date: Sat, 23 Mar 2019 11:13:11 +0100
Message-ID: <CAF+90c_O_FLw9Ms7s34+Lqv5CvMyNi8qWW9NKn9gnh99ash+Aw@mail.gmail.com>
To: "C. Scott Ananian" <cananian@wikimedia.org>
Cc: PHP internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="000000000000ffec3c0584c0385e"
Subject: Re: [PHP-DEV] Offset-only results from preg_match
From: nikita.ppv@gmail.com (Nikita Popov)

--000000000000ffec3c0584c0385e
Content-Type: text/plain; charset="UTF-8"

On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian <cananian@wikimedia.org>
wrote:

> So...
>
> In microbenchmarks you can clearly see the improvement:
> ```
> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
> PREG_OFFSET_CAPTURE);
> => 39
> Command took 0.001709 seconds on average (0.001654 median; 0.854503 total)
> to complete.
> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE);
> => 39
> Command took 0.000991 seconds on average (0.000965 median; 0.495287 total)
> to complete.
> ```
> $html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]].
>
> But unless you're matching 64k strings like this, it's hard to see a
> practical improvement.  In my remex-html html parsing benchmark, although
> LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant
> performance improvement.  I built php statically and ran it through
> cachegrind to try to figure out why, and found:
>
> 2,567,559,670 cycles: total spent executing the tokenizer benchmark
> (including reading the file from disk)
> 1,018,845,290 cycles in zif_preg_match.  Optimizing regexps is important
> for tokenizers! Of these, we spend
>   575,478,637 doing the actual match (preg_pcre_match_impl) and
>   435,162,131 getting the regexp from the cache (!)
> (pcre_get_compiled_regex_cache)
>
> This is for 128,794 total regexp matches performed by the tokenizer on
> this input.
>
> Of those cycles getting the regex from cache, only 24,116,319 are spent on
> cache misses where we are actually compiling regexps (sum of
> php_pcre2_jit_compile and php_pcre2_compile).
> Instead, 406,007,383 cycles are spent in zend_hash_find().  That's 40% of
> the total time spent executing preg_match.
>
> The LENGTH_CAPTURE optimization does reduce the total time spent in
> populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the
> remex-html tokenizer on this benchmark, but that difference is overwhelmed
> by (for example) the time spent in zend_hash_find().
>
> The slowdown in zend_hash_find() appears to be due to
> https://bugs.php.net/bug.php?id=63180 which disabled interned keys in the
> pcre_cache table.  Because of this, even if the regexs are interned, we
> must pay for a complete zend_string_equal_content() on each match, which
> takes time proportional to the length of the regex.  This can be quite
> large -- for example, for the HTML5 character reference regex in
> remex-html, which contains every valid entity name and is 26,137 bytes
> long, and we need to do a zend_string_equal_content() on the 26,137 byte
> regexp for every ~5 byte entity in the parsed HTML.
>   --scott
>

Thanks for testing! That's an interesting result. We should be able to do
something about this. There are basically three cases:

1. CLI (presumably what you're testing). Strings are interned per-request,
but there is only one request.
2. Server w/o opcache. Strings are interned per-request and there may be
multiple requests.
3. Server with opcache. Strings are interned permanently in opcache.

Case 3 should already be fast, because permanent interned strings are
allowed into the regex cache. We can optimize case 1 by simply allowing
arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not
be needed anymore anyway. Case 2 would remain slow, but it's slow anyway...

Nikita


> On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov <nikita.ppv@gmail.com> wrote:
>
>> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <cananian@wikimedia.org>
>> wrote:
>>
>>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov <nikita.ppv@gmail.com>
>>> wrote:
>>>
>>>> After thinking about this some more, while this may be a minor
>>>> performance improvement, it still does more work than necessary. In
>>>> particular the use of OFFSET_CAPTURE (which would be pretty much required
>>>> here) needs one new two-element array for each subpattern. If the captured
>>>> strings are short, this is where the main cost is going to be.
>>>>
>>>
>>> The primary use of this feature is when the captured strings are *long*,
>>> as that's when we most want to avoid copying a substring.
>>>
>>>
>>>> I'm wondering if we shouldn't consider a new object oriented API for
>>>> PCRE which can return a match object where subpattern positions and
>>>> contents can be queried via method calls, so you only pay for the parts
>>>> that you do access.
>>>>
>>>
>>> Seems like this is letting the perfect be the enemy of the good.  The
>>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and
>>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
>>> just stores an integer where there would otherwise be an expensive
>>> substring.  Furthermore, since the array structure is left mostly alone, it
>>> would be not-too-hard to support earlier-PHP versions, with something like:
>>>
>>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE
>>> : 0;
>>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
>>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
>>> $matchOneOffset = $m[1][1];
>>>
>>> If you introduce a whole new OO accessor object, it starts becoming very
>>> hard to write backward-compatible code.
>>>  --scott
>>>
>>
>> Fair enough. I've created https://github.com/php/php-src/pull/3971 to
>> implement this feature. It would be good to have some confirmation that
>> this is really a significant performance improvement before we land it
>> though.
>>
>> Nikita
>>
>
>
> --
> (http://cscott.net)
>

--000000000000ffec3c0584c0385e--