Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104882 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 36013 invoked from network); 23 Mar 2019 08:39:27 -0000 Received: from unknown (HELO mail-ot1-f53.google.com) (209.85.210.53) by pb1.pair.com with SMTP; 23 Mar 2019 08:39:27 -0000 Received: by mail-ot1-f53.google.com with SMTP id o74so3810798ota.3 for ; Fri, 22 Mar 2019 22:32:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8HBe/ifitJN0UD7ESEqvQcaArmZBczTEcE+wvFtp43s=; b=pNWglFCEiVSbWSC+aa5nSjiFTe89SJFFWo+5TG0JJN5MeVdwH0DUbqob0D+2+2RzN4 HFENCYpIQ0+gMziEUWCdlSxsZ9uWEJX7qgfAuGzrTY9dZArjv4QEBdPeR2V56CcTRrYm NrVlPMH95MjlvAeeROlOqy+0Ef6/3rQjKShko= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8HBe/ifitJN0UD7ESEqvQcaArmZBczTEcE+wvFtp43s=; b=AABSck3S3ZMbfIWLWibf950NNx9LzjJV9ptwEKoI4LRJiYCkTHf+rVb/Me7V6eBF4X KULdivig0HLdAsFpag7lc26cZta4ycrmisVkYECh/uUwIlHbxOdAcb7PFBzv1mGTE2fo kPE16geDomDNKc/zyWf96qHrghnnNOx/YmMKCCt4HYFv9U/KveACkrW2PrWxKWYsv/Jk sL+H6fzplqovDrve9bS/PQuoiGPtn19JWJ8rkRQa/r3jcV4MJo6hwGic2gyvWfiE5dAJ zM/hztpGPojkD0bMHs02uVToOKgeBfSkZ/2+TX0nW5/5EDIHdWSxHg+MPvTWM0KoVqIK eN+g== X-Gm-Message-State: APjAAAWKR7FSVkK2EjwVdm9u1ofeFOJWnLnfVIceHH75MZGFlHGOmhFk 6wGTapdato0jkI3ic58IInhVh7Ko9ivLkngbbY3wCA== X-Google-Smtp-Source: APXvYqzqG+v5r6R3n4tpy7JmCMaegof1OJONncsAV9W4frJJibIT9BS4DpdHeZiSICmKdpP+LwyGIftL9OnhxoCNlN8= X-Received: by 2002:a9d:604c:: with SMTP id v12mr3856161otj.247.1553319120545; Fri, 22 Mar 2019 22:32:00 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Sat, 23 Mar 2019 01:31:47 -0400 Message-ID: To: Nikita Popov Cc: PHP internals Content-Type: multipart/alternative; boundary="000000000000673cb00584bc4a5c" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: cananian@wikimedia.org ("C. Scott Ananian") --000000000000673cb00584bc4a5c Content-Type: text/plain; charset="UTF-8" So... In microbenchmarks you can clearly see the improvement: ``` >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, PREG_OFFSET_CAPTURE); => 39 Command took 0.001709 seconds on average (0.001654 median; 0.854503 total) to complete. >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE); => 39 Command took 0.000991 seconds on average (0.000965 median; 0.495287 total) to complete. ``` $html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]]. But unless you're matching 64k strings like this, it's hard to see a practical improvement. In my remex-html html parsing benchmark, although LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant performance improvement. I built php statically and ran it through cachegrind to try to figure out why, and found: 2,567,559,670 cycles: total spent executing the tokenizer benchmark (including reading the file from disk) 1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important for tokenizers! Of these, we spend 575,478,637 doing the actual match (preg_pcre_match_impl) and 435,162,131 getting the regexp from the cache (!) (pcre_get_compiled_regex_cache) This is for 128,794 total regexp matches performed by the tokenizer on this input. Of those cycles getting the regex from cache, only 24,116,319 are spent on cache misses where we are actually compiling regexps (sum of php_pcre2_jit_compile and php_pcre2_compile). Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% of the total time spent executing preg_match. The LENGTH_CAPTURE optimization does reduce the total time spent in populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the remex-html tokenizer on this benchmark, but that difference is overwhelmed by (for example) the time spent in zend_hash_find(). The slowdown in zend_hash_find() appears to be due to https://bugs.php.net/bug.php?id=63180 which disabled interned keys in the pcre_cache table. Because of this, even if the regexs are interned, we must pay for a complete zend_string_equal_content() on each match, which takes time proportional to the length of the regex. This can be quite large -- for example, for the HTML5 character reference regex in remex-html, which contains every valid entity name and is 26,137 bytes long, and we need to do a zend_string_equal_content() on the 26,137 byte regexp for every ~5 byte entity in the parsed HTML. --scott On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov wrote: > On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian > wrote: > >> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov >> wrote: >> >>> After thinking about this some more, while this may be a minor >>> performance improvement, it still does more work than necessary. In >>> particular the use of OFFSET_CAPTURE (which would be pretty much required >>> here) needs one new two-element array for each subpattern. If the captured >>> strings are short, this is where the main cost is going to be. >>> >> >> The primary use of this feature is when the captured strings are *long*, >> as that's when we most want to avoid copying a substring. >> >> >>> I'm wondering if we shouldn't consider a new object oriented API for >>> PCRE which can return a match object where subpattern positions and >>> contents can be queried via method calls, so you only pay for the parts >>> that you do access. >>> >> >> Seems like this is letting the perfect be the enemy of the good. The >> LENGTH_CAPTURE significantly reduces allocation for long match strings, and >> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >> just stores an integer where there would otherwise be an expensive >> substring. Furthermore, since the array structure is left mostly alone, it >> would be not-too-hard to support earlier-PHP versions, with something like: >> >> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE >> : 0; >> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture); >> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >> $matchOneOffset = $m[1][1]; >> >> If you introduce a whole new OO accessor object, it starts becoming very >> hard to write backward-compatible code. >> --scott >> > > Fair enough. I've created https://github.com/php/php-src/pull/3971 to > implement this feature. It would be good to have some confirmation that > this is really a significant performance improvement before we land it > though. > > Nikita > -- (http://cscott.net) --000000000000673cb00584bc4a5c--