Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104957 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 12296 invoked from network); 26 Mar 2019 12:37:36 -0000 Received: from unknown (HELO mail-it1-f169.google.com) (209.85.166.169) by pb1.pair.com with SMTP; 26 Mar 2019 12:37:36 -0000 Received: by mail-it1-f169.google.com with SMTP id 139so18797641ita.4 for ; Tue, 26 Mar 2019 02:30:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Q4tlYowGgTuMZp33PnlMRXuBau/oOBoPraRL9m0loCs=; b=lofjfajaQ+g1Tf8EXSx18Kuy9lNkawRnFCXcvsQ9YGxTqHOo93gisoVoADkxsdjlrs RqulNq1XhsyLXshNON3kIS8y4//Axgpy8MebsMkuR/BlIf2S8PnNcduOzQRlod8LzlKQ mAeoGqMZu4PHcTNq3C/4vBmT+WVijJiaGQrv7VUZxG4jhS09p/WhJ5AqH1TfcNazwhtj oW1W6H8iC9uPahf4CSz+vskBnjOmjCUeFVN+JDiGatq5p1h1BkzEY2yiAjovHe19cr9R 4BOk5c22hqVkguumU4XJivz0/6k/dlE6z1aHrpe4MXHU0WGpzR4E9Ej/Lf1mUnNjsVD/ fYTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Q4tlYowGgTuMZp33PnlMRXuBau/oOBoPraRL9m0loCs=; b=Qu97Ba0BGyYUn7P4AmB3eSf0FgrmwhiBMjTHeDF7ngPjz1AV9uIuD9RW3RS8k1UBqQ WMKw3Gjbeub+UhWnT0N3GtiqsLTE+/kA6wPqJFAh9jA52sVqp5Y2vwpBJgir8qiPt8Yc TD6A4xImQE7YKnL5RKS8GDoX9oa92TtrX1M/olvDtwDXnpzPicVg/cKSqzSwWMaplODC UKsqOZWskcvOk73LbyzYauoMId3Hnw+2P+sNx3QubTAxr74Da+uJ3SGt+sqcseedLY3N ZIIF+d76LduXldBmmyTzVnuiW7NuIOvZYbwLGe0Tq4uKEhiRhusos1U6mMOltxs0IEW7 Ez7A== X-Gm-Message-State: APjAAAVwZ56QihslIOVa2Q+HColj8RZvF2sCxQvQI1Huxuk+4pb9+nlr te1sMMilImsoTpINH5+BPZIRqwIpWJl+CjP1DbA= X-Google-Smtp-Source: APXvYqw1+i+W9WMspoKHBPndLUAwTeyQnhmg4ROhHdrXaM6lwSaKyh3NYbrDJOQr9fnSHaLqIa9Wf4ymcalnabif8Kg= X-Received: by 2002:a24:7542:: with SMTP id y63mr2676532itc.70.1553592656895; Tue, 26 Mar 2019 02:30:56 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Tue, 26 Mar 2019 10:30:40 +0100 Message-ID: To: "C. Scott Ananian" Cc: PHP internals Content-Type: multipart/alternative; boundary="00000000000070a8940584fbfa6d" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: nikita.ppv@gmail.com (Nikita Popov) --00000000000070a8940584fbfa6d Content-Type: text/plain; charset="UTF-8" On Sat, Mar 23, 2019 at 4:07 PM C. Scott Ananian wrote: > Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x > with opcache ( https://phabricator.wikimedia.org/T176370 / > https://phabricator.wikimedia.org/T211964 ). It would be nice to fix the > CLI to behave more like the server wrt interned strings. It certainly > would make benchmarking easier/more representative, but we also have a > bunch of testing/CI stuff running in CLI mode so speedups there are > definitely useful, even if they don't "speed up Wikipedia" per se. > > Ignoring the zend_hash_find costs for the moment, php_pcre_match_impl > costs 492,077,935 cycles in this benchmark, of which only 211,626,095 are > spent doing the actual php_pcre2_jit_match. 140,000,069 cycles are spent > in populate_subpat_array and 108,742,309 are spent in the zval_ptr_dtor > call on the penultimate line -- basically freeing the $matches array from > the previous call to pcre_match (and this is with PREG_LENGTH_CAPTURE). So > there's still (in theory) a >2x speedup in preg matching available by > tuning how the subpat_array works and making it less costly to > allocate/free $matches. > --scott > The aforementioned optimization to the cache lookup on CLI is now implemented via https://github.com/php/php-src/pull/3985. There is more that can be improved here, but it falls more into the realm of micro-optimization... For example, we can specialize the creation of the offset pairs to reduce the overhead ( https://github.com/php/php-src/pull/3990), but this still remains the most expensive part of the subpat population... Nikita > On Sat, Mar 23, 2019 at 6:13 AM Nikita Popov wrote: > >> On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian >> wrote: >> >>> So... >>> >>> In microbenchmarks you can clearly see the improvement: >>> ``` >>> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, >>> PREG_OFFSET_CAPTURE); >>> => 39 >>> Command took 0.001709 seconds on average (0.001654 median; 0.854503 >>> total) to complete. >>> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, >>> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE); >>> => 39 >>> Command took 0.000991 seconds on average (0.000965 median; 0.495287 >>> total) to complete. >>> ``` >>> $html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]]. >>> >>> But unless you're matching 64k strings like this, it's hard to see a >>> practical improvement. In my remex-html html parsing benchmark, although >>> LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant >>> performance improvement. I built php statically and ran it through >>> cachegrind to try to figure out why, and found: >>> >>> 2,567,559,670 cycles: total spent executing the tokenizer benchmark >>> (including reading the file from disk) >>> 1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important >>> for tokenizers! Of these, we spend >>> 575,478,637 doing the actual match (preg_pcre_match_impl) and >>> 435,162,131 getting the regexp from the cache (!) >>> (pcre_get_compiled_regex_cache) >>> >>> This is for 128,794 total regexp matches performed by the tokenizer on >>> this input. >>> >>> Of those cycles getting the regex from cache, only 24,116,319 are spent >>> on cache misses where we are actually compiling regexps (sum of >>> php_pcre2_jit_compile and php_pcre2_compile). >>> Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% >>> of the total time spent executing preg_match. >>> >>> The LENGTH_CAPTURE optimization does reduce the total time spent in >>> populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the >>> remex-html tokenizer on this benchmark, but that difference is overwhelmed >>> by (for example) the time spent in zend_hash_find(). >>> >>> The slowdown in zend_hash_find() appears to be due to >>> https://bugs.php.net/bug.php?id=63180 which disabled interned keys in >>> the pcre_cache table. Because of this, even if the regexs are interned, we >>> must pay for a complete zend_string_equal_content() on each match, which >>> takes time proportional to the length of the regex. This can be quite >>> large -- for example, for the HTML5 character reference regex in >>> remex-html, which contains every valid entity name and is 26,137 bytes >>> long, and we need to do a zend_string_equal_content() on the 26,137 byte >>> regexp for every ~5 byte entity in the parsed HTML. >>> --scott >>> >> >> Thanks for testing! That's an interesting result. We should be able to do >> something about this. There are basically three cases: >> >> 1. CLI (presumably what you're testing). Strings are interned >> per-request, but there is only one request. >> 2. Server w/o opcache. Strings are interned per-request and there may be >> multiple requests. >> 3. Server with opcache. Strings are interned permanently in opcache. >> >> Case 3 should already be fast, because permanent interned strings are >> allowed into the regex cache. We can optimize case 1 by simply allowing >> arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not >> be needed anymore anyway. Case 2 would remain slow, but it's slow anyway... >> >> Nikita >> >> >>> On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov >>> wrote: >>> >>>> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian < >>>> cananian@wikimedia.org> wrote: >>>> >>>>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov >>>>> wrote: >>>>> >>>>>> After thinking about this some more, while this may be a minor >>>>>> performance improvement, it still does more work than necessary. In >>>>>> particular the use of OFFSET_CAPTURE (which would be pretty much required >>>>>> here) needs one new two-element array for each subpattern. If the captured >>>>>> strings are short, this is where the main cost is going to be. >>>>>> >>>>> >>>>> The primary use of this feature is when the captured strings are >>>>> *long*, as that's when we most want to avoid copying a substring. >>>>> >>>>> >>>>>> I'm wondering if we shouldn't consider a new object oriented API for >>>>>> PCRE which can return a match object where subpattern positions and >>>>>> contents can be queried via method calls, so you only pay for the parts >>>>>> that you do access. >>>>>> >>>>> >>>>> Seems like this is letting the perfect be the enemy of the good. The >>>>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and >>>>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >>>>> just stores an integer where there would otherwise be an expensive >>>>> substring. Furthermore, since the array structure is left mostly alone, it >>>>> would be not-too-hard to support earlier-PHP versions, with something like: >>>>> >>>>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? >>>>> PREG_LENGTH_CAPTURE : 0; >>>>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | >>>>> $hasLengthCapture); >>>>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >>>>> $matchOneOffset = $m[1][1]; >>>>> >>>>> If you introduce a whole new OO accessor object, it starts becoming >>>>> very hard to write backward-compatible code. >>>>> --scott >>>>> >>>> >>>> Fair enough. I've created https://github.com/php/php-src/pull/3971 to >>>> implement this feature. It would be good to have some confirmation that >>>> this is really a significant performance improvement before we land it >>>> though. >>>> >>>> Nikita >>>> >>> >>> >>> -- >>> (http://cscott.net) >>> >> > > -- > (http://cscott.net) > --00000000000070a8940584fbfa6d--