Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104863 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 3957 invoked from network); 22 Mar 2019 00:24:36 -0000 Received: from unknown (HELO mail-oi1-f176.google.com) (209.85.167.176) by pb1.pair.com with SMTP; 22 Mar 2019 00:24:36 -0000 Received: by mail-oi1-f176.google.com with SMTP id v84so155408oif.4 for ; Thu, 21 Mar 2019 14:16:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=AKFqmbjh1NEUB5/ZwkiZJIxVhUY8rDfOY8Onym0MqoQ=; b=aCK7BqpGYIeRtfuCrzNSbSN8yHkFNBts8Zs8qqhIVPvEnDt6Wa2uNfZ9475AqtsR8p M6TWDIt1yOnRJD9oKE8NEsUkUa+pdvn+JVs5mSV3QQnhnec//+sf9HhvI0yrMcJgpVas xt02tqaC6LFiI7oNzAD8MghxTJhmsvzMYz+rI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=AKFqmbjh1NEUB5/ZwkiZJIxVhUY8rDfOY8Onym0MqoQ=; b=osioMfzkiuolXJnNUEYhziTj5ZaBB5IWZxlnTlFRFS28wIUSWqkdoP6wV94z6ouclX OKinaLssxwp6QlVCndXD9Q2xgSZpnW//gQjemw8m8eyuEj+gaEeYcicyaEqnTymN8wRK rs4unv7iSjkrN1+eTCFW+jJ4Ek5ttJcVBdkF346kIqQlHaLlxk7xSGikwADfkWkD1VKF mNbTRKH9E20X9WoBA7ncHC9GkRhmXlhjHGRm4Yh86nFkeHddjlc4lW4ruf9UbXoawGy0 W0lrAWNkMAAG6Tn0iD6L0paQWh2dXrUDLiXRPDRFi0VVTJ8MVHm4LrVY8+85TJo/ngm4 cZQQ== X-Gm-Message-State: APjAAAVkVEVDYE3pUWg0zU9zp7JD8cHz1O9gbBM3eHR1KsaVpFAbxetq 3ThUVTLw1aV0zJNhKl/axrm1ksPSb15afaSR/q/JPg== X-Google-Smtp-Source: APXvYqwlW6c8vl6kByGxmMYFGF63uN7CXom3lzZADcvsQ4OQGXO18fJa0IZFs3Omgro9JsWCzlglmpbWeRbbcSJAqww= X-Received: by 2002:aca:4c88:: with SMTP id z130mr1062753oia.170.1553203009219; Thu, 21 Mar 2019 14:16:49 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Thu, 21 Mar 2019 17:16:37 -0400 Message-ID: To: Nikita Popov Cc: PHP internals Content-Type: multipart/alternative; boundary="000000000000a130a30584a14181" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: cananian@wikimedia.org ("C. Scott Ananian") --000000000000a130a30584a14181 Content-Type: text/plain; charset="UTF-8" Quick status update. I tried to prototype this in pure PHP in the wikimedia/remex-html library using (?= .. ) around each regexp and ()...() around each captured expression (replacing the capture parens) to effectively bypass the string copy and return a bunch of zero-length strings. That didn't succeed in speeding up remex-html on my pet benchmark because (1) the (?= ... ) appears to deoptimize the regexp match, and (2) it turns out there's a substantial cost to each capture (presumably all those two-element arrays which Nikita flagged before as a future issue) and so doubling the total number of captures by using () () instead of (....) slowed the match down. So bad news: my benchmarking shortcut didn't work. Potential good news: I guess that underlines why this feature is necessary and can't just be emulated. I'm going to try this benchmark again tomorrow but by rebuilding PHP from source using Nikita's proposed patch so that I can actually get an apples-to-apples comparison. --scott On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov wrote: > On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian > wrote: > >> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov >> wrote: >> >>> After thinking about this some more, while this may be a minor >>> performance improvement, it still does more work than necessary. In >>> particular the use of OFFSET_CAPTURE (which would be pretty much required >>> here) needs one new two-element array for each subpattern. If the captured >>> strings are short, this is where the main cost is going to be. >>> >> >> The primary use of this feature is when the captured strings are *long*, >> as that's when we most want to avoid copying a substring. >> >> >>> I'm wondering if we shouldn't consider a new object oriented API for >>> PCRE which can return a match object where subpattern positions and >>> contents can be queried via method calls, so you only pay for the parts >>> that you do access. >>> >> >> Seems like this is letting the perfect be the enemy of the good. The >> LENGTH_CAPTURE significantly reduces allocation for long match strings, and >> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >> just stores an integer where there would otherwise be an expensive >> substring. Furthermore, since the array structure is left mostly alone, it >> would be not-too-hard to support earlier-PHP versions, with something like: >> >> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE >> : 0; >> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture); >> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >> $matchOneOffset = $m[1][1]; >> >> If you introduce a whole new OO accessor object, it starts becoming very >> hard to write backward-compatible code. >> --scott >> > > Fair enough. I've created https://github.com/php/php-src/pull/3971 to > implement this feature. It would be good to have some confirmation that > this is really a significant performance improvement before we land it > though. > > Nikita > -- (http://cscott.net) --000000000000a130a30584a14181--