Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104885 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 86047 invoked from network); 23 Mar 2019 13:20:52 -0000 Received: from unknown (HELO mail-it1-f175.google.com) (209.85.166.175) by pb1.pair.com with SMTP; 23 Mar 2019 13:20:52 -0000 Received: by mail-it1-f175.google.com with SMTP id e24so7194384itl.1 for ; Sat, 23 Mar 2019 03:13:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=vpSQcnIX5vCHqE6wAGlsPeVv+00khbaJ2yZeSa4sLTU=; b=GWmnTi7dOjV0xpQHq33G7+ZgXBFOAi/fgjGb9ifu1/YYxk6OCgA3J54Wtjgr3/0evJ WV2n+E52aOB2BlcHiNj8fc1np+mAKBIkX6AGdOxktjpqkxtqRviKT/BFFOYilZX4rcm3 S6D3AqFUXA5+Guj3Otaxf/4XRfdAQIHhKQRL/U1vl7LIG0s36NKOORpFl5L8K7GNLIbL Jm3QJDLR4aSt/0Qn8/mN8Mc9dPjF4+hU3ZT2zivGbuTwIqWkEc1ZLfr39lwpjc+vL0WR JSy0KuhqK/1m39RbTA3fB0AJlD5wOkFUCq8xzTaexKluDsiZti6G1jPHWqngZS7JejCe iFQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=vpSQcnIX5vCHqE6wAGlsPeVv+00khbaJ2yZeSa4sLTU=; b=Hk/+NPS0iGXtqiIcpHytjLh13ZC08cidxvMmkv+qv9MCQnh8XqlPeVBeicg/MKNqTw L5U1lfPo7oh9FMBQ64fvsSTHngdPtRGvFVPsm7btlXuCg09+BvDRsxvg2NLhxxLnasiN YajxuxQEFITusdvEJwD/r9NJq/ReakFEGJGoJ3d+e+WfsO/qD/5h1SHc0BmD3Zz2X5bI WhGpRe1eVMq05MaFvEPgrmegQG6mEIgPeCFNyTLEHJmIUUc+d3FXjeQExTSH/h4BuOhT gp4u0rlUf0RBp4ddoygvO/LHxKxydEup9Bo1iEGb5kXBE8dHA715oFUtgzfMhs+e4d8w xTgg== X-Gm-Message-State: APjAAAVqvMHx5ZJ0cdG+4fElc/O5kSP36J6hK48N8Cs7J75q9qWf4RTw R1rpfQEMwyEPAL2E/BqAcg23bAMBFX1I07mLU3zY0xdt X-Google-Smtp-Source: APXvYqz4yT2aVQbE6a/yqkYTxGTdsvE2qeoRqabxfWTmPq+CPhkF4UkIKzJUt9A3375cmbwGtSBVc5iCoM0wEVrqHTI= X-Received: by 2002:a24:78ca:: with SMTP id p193mr1508242itc.27.1553336008434; Sat, 23 Mar 2019 03:13:28 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Sat, 23 Mar 2019 11:13:11 +0100 Message-ID: To: "C. Scott Ananian" Cc: PHP internals Content-Type: multipart/alternative; boundary="000000000000ffec3c0584c0385e" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: nikita.ppv@gmail.com (Nikita Popov) --000000000000ffec3c0584c0385e Content-Type: text/plain; charset="UTF-8" On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian wrote: > So... > > In microbenchmarks you can clearly see the improvement: > ``` > >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, > PREG_OFFSET_CAPTURE); > => 39 > Command took 0.001709 seconds on average (0.001654 median; 0.854503 total) > to complete. > >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, > PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE); > => 39 > Command took 0.000991 seconds on average (0.000965 median; 0.495287 total) > to complete. > ``` > $html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]]. > > But unless you're matching 64k strings like this, it's hard to see a > practical improvement. In my remex-html html parsing benchmark, although > LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant > performance improvement. I built php statically and ran it through > cachegrind to try to figure out why, and found: > > 2,567,559,670 cycles: total spent executing the tokenizer benchmark > (including reading the file from disk) > 1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important > for tokenizers! Of these, we spend > 575,478,637 doing the actual match (preg_pcre_match_impl) and > 435,162,131 getting the regexp from the cache (!) > (pcre_get_compiled_regex_cache) > > This is for 128,794 total regexp matches performed by the tokenizer on > this input. > > Of those cycles getting the regex from cache, only 24,116,319 are spent on > cache misses where we are actually compiling regexps (sum of > php_pcre2_jit_compile and php_pcre2_compile). > Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% of > the total time spent executing preg_match. > > The LENGTH_CAPTURE optimization does reduce the total time spent in > populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the > remex-html tokenizer on this benchmark, but that difference is overwhelmed > by (for example) the time spent in zend_hash_find(). > > The slowdown in zend_hash_find() appears to be due to > https://bugs.php.net/bug.php?id=63180 which disabled interned keys in the > pcre_cache table. Because of this, even if the regexs are interned, we > must pay for a complete zend_string_equal_content() on each match, which > takes time proportional to the length of the regex. This can be quite > large -- for example, for the HTML5 character reference regex in > remex-html, which contains every valid entity name and is 26,137 bytes > long, and we need to do a zend_string_equal_content() on the 26,137 byte > regexp for every ~5 byte entity in the parsed HTML. > --scott > Thanks for testing! That's an interesting result. We should be able to do something about this. There are basically three cases: 1. CLI (presumably what you're testing). Strings are interned per-request, but there is only one request. 2. Server w/o opcache. Strings are interned per-request and there may be multiple requests. 3. Server with opcache. Strings are interned permanently in opcache. Case 3 should already be fast, because permanent interned strings are allowed into the regex cache. We can optimize case 1 by simply allowing arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not be needed anymore anyway. Case 2 would remain slow, but it's slow anyway... Nikita > On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov wrote: > >> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian >> wrote: >> >>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov >>> wrote: >>> >>>> After thinking about this some more, while this may be a minor >>>> performance improvement, it still does more work than necessary. In >>>> particular the use of OFFSET_CAPTURE (which would be pretty much required >>>> here) needs one new two-element array for each subpattern. If the captured >>>> strings are short, this is where the main cost is going to be. >>>> >>> >>> The primary use of this feature is when the captured strings are *long*, >>> as that's when we most want to avoid copying a substring. >>> >>> >>>> I'm wondering if we shouldn't consider a new object oriented API for >>>> PCRE which can return a match object where subpattern positions and >>>> contents can be queried via method calls, so you only pay for the parts >>>> that you do access. >>>> >>> >>> Seems like this is letting the perfect be the enemy of the good. The >>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and >>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >>> just stores an integer where there would otherwise be an expensive >>> substring. Furthermore, since the array structure is left mostly alone, it >>> would be not-too-hard to support earlier-PHP versions, with something like: >>> >>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE >>> : 0; >>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture); >>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >>> $matchOneOffset = $m[1][1]; >>> >>> If you introduce a whole new OO accessor object, it starts becoming very >>> hard to write backward-compatible code. >>> --scott >>> >> >> Fair enough. I've created https://github.com/php/php-src/pull/3971 to >> implement this feature. It would be good to have some confirmation that >> this is really a significant performance improvement before we land it >> though. >> >> Nikita >> > > > -- > (http://cscott.net) > --000000000000ffec3c0584c0385e--