Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104802 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 51665 invoked from network); 19 Mar 2019 18:07:01 -0000 Received: from unknown (HELO mail-it1-f180.google.com) (209.85.166.180) by pb1.pair.com with SMTP; 19 Mar 2019 18:07:01 -0000 Received: by mail-it1-f180.google.com with SMTP id m137so10884527ita.0 for ; Tue, 19 Mar 2019 07:58:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=IG51/soX6DoL4wj/IxUdTKLZ4hYyiDzW8wjXzqAOmVY=; b=IUhqDbICzb+GW2UJFH1OnKVh7L3zl5s/X0kNqg3XMy4+uXDgcCOfd0o6rmORIT2OTs W7AbO0fSMlZcZ2/wwskEBJJ+ESRVCLNeJayboY80/FbNIhQds04XWslFl9Z5PA7DVhJ0 j/MNh2q72ylJUFOirt1iwCWrRKZUKVRBZWyFdFWptWiF4saqWxZ6i/rm9JZV4u43Ttck QX2FF1eoSGdb8wtzpn6a6QvYu3mu5bTpoqF6UKaISGHo8NHMnRZB6jO1lU1hH3G2cjg5 FcTk5neSPjdoGYT8R6OHkIu9Rw2nrX/fwAziDWxz79cKbx7DCxkwlR+O3rH94in/Rtis Y94Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=IG51/soX6DoL4wj/IxUdTKLZ4hYyiDzW8wjXzqAOmVY=; b=kONdq3bxEuuCgEzTUJMAcM7vq/ewtYJkLBM66pAikaOYFZ2g/kg+taEGbZHe/eKwax mVVPMnLlQxhGCamme/UmjQm6EsMXkpA9jxvBhiCeQbB7DymuwUr14r5KQHwtrXSw6qfC dsHqWi3MU53J9NvcNjZJOUYni/7E/pWi2394eNuDL23+d6cQN7JQUMYhuWxUEWoHlcLe 8AzQu2g6tW4ctHeIm2xKB2acz/4KJGGwOzfjXfS1V1HYKb8hDbZrQb9SnuvysSZvnYPU 8Bc0n7zbmuLNOaUbaVJdeFuNgaXC48PiorEbuZowdond2cdBnx++SaIIyrJXtg/OAHLD meWg== X-Gm-Message-State: APjAAAU/Qey1cDQyDX6Otbjv+kahAIE7/FJIwEhfSCWT0hhUENLfDGuu B+JdV8A48wAW7rqucrLKzkfXo+zyqWezv3wlmBs= X-Google-Smtp-Source: APXvYqwU00xrWp/+y7P+81opUugPMpUhbgkXWS/dltQpGJTbhOEZE1vq17elrEav0zQXz+gHmme5W0bEel/0Z5hKai8= X-Received: by 2002:a24:78ca:: with SMTP id p193mr1551523itc.27.1553007521164; Tue, 19 Mar 2019 07:58:41 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Tue, 19 Mar 2019 15:58:24 +0100 Message-ID: To: "C. Scott Ananian" Cc: PHP internals Content-Type: multipart/alternative; boundary="000000000000a1f437058473bd58" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: nikita.ppv@gmail.com (Nikita Popov) --000000000000a1f437058473bd58 Content-Type: text/plain; charset="UTF-8" On Mon, Mar 18, 2019 at 2:43 PM Nikita Popov wrote: > On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian > wrote: > >> I'm floating an idea for an RFC here. >> >> I'm working on the wikimedia/remex-html library for high-performance >> PHP-native HTML5 parsing. When creating a high-performance lexer, it is >> worthwhile to try to reduce the number of string copies made. You can >> generally perform matches using offsets into your master source string. >> However, preg_match* will copy a substring for the entire matched region >> ($matches[0]) as well as for all captured patterns ($matches[1...n]). >> These substring copies can get expensive if the matched region/captured >> patterns are very large. >> >> It would be helpful if PHP's preg_match* functions offered a flag, say >> PREG_LENGTH_CAPTURE, which returned the numeric length instead of the >> matched/captured string. In combination, >> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in >> element 0 and the numeric offset in element 1, and avoid the need to copy >> the matched substring unnecessarily. This would allow greatly reducing >> the >> number of substring copies made during lexing. >> > > Generally sounds reasonable to me. Do you maybe have a sample input and > regular expression where you suspect this is a particularly large problem, > so we can test how much of a difference this makes? > After thinking about this some more, while this may be a minor performance improvement, it still does more work than necessary. In particular the use of OFFSET_CAPTURE (which would be pretty much required here) needs one new two-element array for each subpattern. If the captured strings are short, this is where the main cost is going to be. I'm wondering if we shouldn't consider a new object oriented API for PCRE which can return a match object where subpattern positions and contents can be queried via method calls, so you only pay for the parts that you do access. Nikita --000000000000a1f437058473bd58--