Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104864 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 6053 invoked from network); 22 Mar 2019 00:29:37 -0000 Received: from unknown (HELO mail-ot1-f49.google.com) (209.85.210.49) by pb1.pair.com with SMTP; 22 Mar 2019 00:29:37 -0000 Received: by mail-ot1-f49.google.com with SMTP id c16so114756otn.4 for ; Thu, 21 Mar 2019 14:21:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=3+pBhuzJQD8HnFgVHPbjuxKnfnpJeHrOOf03q3PzDys=; b=QP7V+YtwYv9RAKdMM7AffXigRyggS7sKTtTQsJnFX7TJ1VPJNYwcw+O9tHNnAT2qSS 12j0n+88oI89wvdbpBudxI2UNJ7U+ttm2aU36kSPTMIEkIBwAh02pYsx030UUtPvxU/2 pCQ5RSjh+KWZFGWOIxtl/1oMeTky8gYKXW0is= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=3+pBhuzJQD8HnFgVHPbjuxKnfnpJeHrOOf03q3PzDys=; b=Ncs33bqE81fRcZl+OzoieAoky1EjfEetjxEgZLgDBytRsGIVdb7ocSe0uj7mAtnaVG oOcw5byEHF7TEdfYVg0Pc5x7dijIfMvz23yW8whHJyG6LYfgRmdTm3Efps+ENmB6ReAk 2bUZEkq7afJzSSwT6OezeqnJ9oy0HKabH8qNJOn1f0C24mCuXKUdXZ+ei/QJFhbatmaY ZiQNuorTfUWZrMIV1buS5d4Tnlr63QiB9nb5i416+5IpMqQliiytW72SsMrTFcQ5XfsG w2HE0cR/i7XLCVHXet6onQWvzkCLmsi1sKodRG3Hueyw1FvZxACezFIzIgjZJQ7aMqHB Vscg== X-Gm-Message-State: APjAAAX3tZ4ouIAbG+vpp0FSOvupcjSwEsaHwMtgde1wrFRU0oYLWRB1 wAHQjjednzSDDbukfCMk33/ebGqhqGNJVmJDVNMe9Q== X-Google-Smtp-Source: APXvYqwdvJffACKh3b8MMEweRTyqH+h+zXofudgH2+/Pu/6B267ld48bclYBjeSc3VL4RR1e5MVWG0UrzD1Psuc9n7w= X-Received: by 2002:a9d:6946:: with SMTP id p6mr4511671oto.164.1553203310290; Thu, 21 Mar 2019 14:21:50 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Thu, 21 Mar 2019 17:21:39 -0400 Message-ID: To: Nikita Popov Cc: PHP internals Content-Type: multipart/alternative; boundary="00000000000092fcf90584a15323" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: cananian@wikimedia.org ("C. Scott Ananian") --00000000000092fcf90584a15323 Content-Type: text/plain; charset="UTF-8" ps. Just to put some numbers to it, using `psysh` on $html100 which contains the (Parsoid format) HTML for the [[en:Barack Obama]] article on Wikipedia. ``` >>> strlen($html100) => 2592386 >>> timeit -n1000 preg_match_all( '/(b)/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.008648 seconds on average (0.008236 median; 8.648343 total) to complete. >>> timeit -n1000 preg_match_all( '/b()/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.008438 seconds on average (0.008127 median; 8.437881 total) to complete. >>> timeit -n1000 preg_match_all( '/b()()/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.012069 seconds on average (0.011589 median; 12.069407 total) to complete. >>> timeit -n1000 preg_match_all( '/(?=(b))/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.012134 seconds on average (0.011483 median; 12.134265 total) to complete. >>> timeit -n1000 preg_match_all( '/(?=()b())/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.016513 seconds on average (0.016039 median; 16.513011 total) to complete. ``` So this isn't a good way to determine the cost of the string copy in the $matches array. (The string copy is really trivial in this particular case anyway.) --scott On Thu, Mar 21, 2019 at 5:16 PM C. Scott Ananian wrote: > Quick status update. I tried to prototype this in pure PHP in the > wikimedia/remex-html library using (?= .. ) around each regexp and ()...() > around each captured expression (replacing the capture parens) to > effectively bypass the string copy and return a bunch of zero-length > strings. That didn't succeed in speeding up remex-html on my pet benchmark > because (1) the (?= ... ) appears to deoptimize the regexp match, and (2) > it turns out there's a substantial cost to each capture (presumably all > those two-element arrays which Nikita flagged before as a future issue) and > so doubling the total number of captures by using () () instead of (....) > slowed the match down. > > So bad news: my benchmarking shortcut didn't work. Potential good news: I > guess that underlines why this feature is necessary and can't just be > emulated. > > I'm going to try this benchmark again tomorrow but by rebuilding PHP from > source using Nikita's proposed patch so that I can actually get an > apples-to-apples comparison. > --scott > > On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov wrote: > >> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian >> wrote: >> >>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov >>> wrote: >>> >>>> After thinking about this some more, while this may be a minor >>>> performance improvement, it still does more work than necessary. In >>>> particular the use of OFFSET_CAPTURE (which would be pretty much required >>>> here) needs one new two-element array for each subpattern. If the captured >>>> strings are short, this is where the main cost is going to be. >>>> >>> >>> The primary use of this feature is when the captured strings are *long*, >>> as that's when we most want to avoid copying a substring. >>> >>> >>>> I'm wondering if we shouldn't consider a new object oriented API for >>>> PCRE which can return a match object where subpattern positions and >>>> contents can be queried via method calls, so you only pay for the parts >>>> that you do access. >>>> >>> >>> Seems like this is letting the perfect be the enemy of the good. The >>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and >>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >>> just stores an integer where there would otherwise be an expensive >>> substring. Furthermore, since the array structure is left mostly alone, it >>> would be not-too-hard to support earlier-PHP versions, with something like: >>> >>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE >>> : 0; >>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture); >>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >>> $matchOneOffset = $m[1][1]; >>> >>> If you introduce a whole new OO accessor object, it starts becoming very >>> hard to write backward-compatible code. >>> --scott >>> >> >> Fair enough. I've created https://github.com/php/php-src/pull/3971 to >> implement this feature. It would be good to have some confirmation that >> this is really a significant performance improvement before we land it >> though. >> >> Nikita >> > > > -- > (http://cscott.net) > -- (http://cscott.net) --00000000000092fcf90584a15323--