Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104899 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 52987 invoked from network); 23 Mar 2019 18:14:23 -0000 Received: from unknown (HELO mail-oi1-f178.google.com) (209.85.167.178) by pb1.pair.com with SMTP; 23 Mar 2019 18:14:23 -0000 Received: by mail-oi1-f178.google.com with SMTP id i21so3863481oib.11 for ; Sat, 23 Mar 2019 08:07:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gxOJKtbK0ACAaw/a8JiCk0eRsTF1gsAj6e5OVoBOCi4=; b=O+zCccIJKadDdt9BQS4IgH+sy249g5jnjajSyAzUNELliTho2JXBaeqbCyv54XJZO1 YYO/hS44Sk4qwKxe6nWiUUQXPITT0VrOzdveV5VS2tRwy+cuK76LQ92WzN/woeH5bSTG cmRulQQoW8tBHJhdPUo/PDOntH/+ClT4hmqRw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gxOJKtbK0ACAaw/a8JiCk0eRsTF1gsAj6e5OVoBOCi4=; b=S+Uj/YyiMxSQ3v5uRJ8POd5uJwOhf1img7JJ9EMXPmCLvo+hzaKjz/4BjepxxcI0Uj Xt9ZlTinTtITDrQWg4H4UZ1Tx4imlYbcMiUkGi2tlGODnH7OwSfmjsh5DGIBaVojnrDT 5tPgzyP9Rt/kGiyWpQMjDeq20MFZ+RyxKYYBAZ+MHyZp15u/oNrYhjjAqhT7nURwpZbb lFgdhmIn+cmKBq+Cyv04r7a33PY98yo2T93Fr2g0zNkvhDV22523sBPydVPGQlKloC7c eRohRSi94+vap9XyHbSXh7Eekxo0f5bgpgtA2Y18xiSnOW/ib6TZnamSWtrnrJf8LaoG Cd3w== X-Gm-Message-State: APjAAAVk8PpBMV7UaMuJ+RByxyBPyWtt226PNoKax+bv8+Q4kkaT84tP RX9NuShYHvaVYGMoP/wQOycfbtEA90T7UJoYfj81iw== X-Google-Smtp-Source: APXvYqyUF0S86TuIQ+ahiIIc+r2HsWE2FffOvK32egaFRLESUxlp8YNHUazE2qdbZIfAXRa/u43ksBL707Jlkd/AESE= X-Received: by 2002:aca:4c88:: with SMTP id z130mr5772202oia.170.1553353623145; Sat, 23 Mar 2019 08:07:03 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Sat, 23 Mar 2019 11:06:51 -0400 Message-ID: To: Nikita Popov Cc: PHP internals Content-Type: multipart/alternative; boundary="000000000000eb24ff0584c45262" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: cananian@wikimedia.org ("C. Scott Ananian") --000000000000eb24ff0584c45262 Content-Type: text/plain; charset="UTF-8" Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x with opcache ( https://phabricator.wikimedia.org/T176370 / https://phabricator.wikimedia.org/T211964 ). It would be nice to fix the CLI to behave more like the server wrt interned strings. It certainly would make benchmarking easier/more representative, but we also have a bunch of testing/CI stuff running in CLI mode so speedups there are definitely useful, even if they don't "speed up Wikipedia" per se. Ignoring the zend_hash_find costs for the moment, php_pcre_match_impl costs 492,077,935 cycles in this benchmark, of which only 211,626,095 are spent doing the actual php_pcre2_jit_match. 140,000,069 cycles are spent in populate_subpat_array and 108,742,309 are spent in the zval_ptr_dtor call on the penultimate line -- basically freeing the $matches array from the previous call to pcre_match (and this is with PREG_LENGTH_CAPTURE). So there's still (in theory) a >2x speedup in preg matching available by tuning how the subpat_array works and making it less costly to allocate/free $matches. --scott On Sat, Mar 23, 2019 at 6:13 AM Nikita Popov wrote: > On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian > wrote: > >> So... >> >> In microbenchmarks you can clearly see the improvement: >> ``` >> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, >> PREG_OFFSET_CAPTURE); >> => 39 >> Command took 0.001709 seconds on average (0.001654 median; 0.854503 >> total) to complete. >> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, >> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE); >> => 39 >> Command took 0.000991 seconds on average (0.000965 median; 0.495287 >> total) to complete. >> ``` >> $html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]]. >> >> But unless you're matching 64k strings like this, it's hard to see a >> practical improvement. In my remex-html html parsing benchmark, although >> LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant >> performance improvement. I built php statically and ran it through >> cachegrind to try to figure out why, and found: >> >> 2,567,559,670 cycles: total spent executing the tokenizer benchmark >> (including reading the file from disk) >> 1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important >> for tokenizers! Of these, we spend >> 575,478,637 doing the actual match (preg_pcre_match_impl) and >> 435,162,131 getting the regexp from the cache (!) >> (pcre_get_compiled_regex_cache) >> >> This is for 128,794 total regexp matches performed by the tokenizer on >> this input. >> >> Of those cycles getting the regex from cache, only 24,116,319 are spent >> on cache misses where we are actually compiling regexps (sum of >> php_pcre2_jit_compile and php_pcre2_compile). >> Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% of >> the total time spent executing preg_match. >> >> The LENGTH_CAPTURE optimization does reduce the total time spent in >> populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the >> remex-html tokenizer on this benchmark, but that difference is overwhelmed >> by (for example) the time spent in zend_hash_find(). >> >> The slowdown in zend_hash_find() appears to be due to >> https://bugs.php.net/bug.php?id=63180 which disabled interned keys in >> the pcre_cache table. Because of this, even if the regexs are interned, we >> must pay for a complete zend_string_equal_content() on each match, which >> takes time proportional to the length of the regex. This can be quite >> large -- for example, for the HTML5 character reference regex in >> remex-html, which contains every valid entity name and is 26,137 bytes >> long, and we need to do a zend_string_equal_content() on the 26,137 byte >> regexp for every ~5 byte entity in the parsed HTML. >> --scott >> > > Thanks for testing! That's an interesting result. We should be able to do > something about this. There are basically three cases: > > 1. CLI (presumably what you're testing). Strings are interned per-request, > but there is only one request. > 2. Server w/o opcache. Strings are interned per-request and there may be > multiple requests. > 3. Server with opcache. Strings are interned permanently in opcache. > > Case 3 should already be fast, because permanent interned strings are > allowed into the regex cache. We can optimize case 1 by simply allowing > arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not > be needed anymore anyway. Case 2 would remain slow, but it's slow anyway... > > Nikita > > >> On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov >> wrote: >> >>> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian >>> wrote: >>> >>>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov >>>> wrote: >>>> >>>>> After thinking about this some more, while this may be a minor >>>>> performance improvement, it still does more work than necessary. In >>>>> particular the use of OFFSET_CAPTURE (which would be pretty much required >>>>> here) needs one new two-element array for each subpattern. If the captured >>>>> strings are short, this is where the main cost is going to be. >>>>> >>>> >>>> The primary use of this feature is when the captured strings are >>>> *long*, as that's when we most want to avoid copying a substring. >>>> >>>> >>>>> I'm wondering if we shouldn't consider a new object oriented API for >>>>> PCRE which can return a match object where subpattern positions and >>>>> contents can be queried via method calls, so you only pay for the parts >>>>> that you do access. >>>>> >>>> >>>> Seems like this is letting the perfect be the enemy of the good. The >>>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and >>>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >>>> just stores an integer where there would otherwise be an expensive >>>> substring. Furthermore, since the array structure is left mostly alone, it >>>> would be not-too-hard to support earlier-PHP versions, with something like: >>>> >>>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? >>>> PREG_LENGTH_CAPTURE : 0; >>>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | >>>> $hasLengthCapture); >>>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >>>> $matchOneOffset = $m[1][1]; >>>> >>>> If you introduce a whole new OO accessor object, it starts becoming >>>> very hard to write backward-compatible code. >>>> --scott >>>> >>> >>> Fair enough. I've created https://github.com/php/php-src/pull/3971 to >>> implement this feature. It would be good to have some confirmation that >>> this is really a significant performance improvement before we land it >>> though. >>> >>> Nikita >>> >> >> >> -- >> (http://cscott.net) >> > -- (http://cscott.net) --000000000000eb24ff0584c45262--