Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104786 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 8112 invoked from network); 18 Mar 2019 16:52:45 -0000 Received: from unknown (HELO mail-it1-f181.google.com) (209.85.166.181) by pb1.pair.com with SMTP; 18 Mar 2019 16:52:45 -0000 Received: by mail-it1-f181.google.com with SMTP id g17so20488519ita.2 for ; Mon, 18 Mar 2019 06:44:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Z2knNazXeP8QOlfQEU2RXVA3GdRlJQLmM9zZy7IInzU=; b=W2Y0TuxN0jenv+YZsTB8mBbXVmC+uIPcJ75LmF/u7PWla9+wsPR67cmQvAOI41l+Lc uvIILrTjtfGseqkDqrL488SYxZpAgkLmx1P9/zNWhQnPZanaNHTkvHGlA1C8v0SI5YU9 BRgjfk3p+HiWXsss4hAvOirEjW6rHpmh/iaUia1uY5bUSBAJPGKabl2FLHQ6y26ZYy5h TvPifChsBxjER5DRyEM2tlLh0d2yyHV/QKzHEi21l6E3ge3p0MOSL4N5X765QcHPhIC4 m+/Wmug1/Tn62qJV6yy+7D700tjbvqgwDWHAhEOIS1+go3O2K5dEbrvc+R2AyPDrtG8n tTEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Z2knNazXeP8QOlfQEU2RXVA3GdRlJQLmM9zZy7IInzU=; b=CvJX7DB1WxZKeGxyMyuMU1vR0eV0yl9cqZC+m297HpJBZKmlwn/0dEWgurYoofJ8EF wfn3m3TrZxddnmqrsC1sd3/JVnQu+jJ5BdyiF+8LBXMUpjPH4k7Uh2aAJVPsek+WTDJR EAEaoTdo2c48/W3KgHQd/ixqyhUoGjV4o71E83e4+43gyCU7e5jK0VbmL0h3YeJg5mmm 7y1n6Sf9+Jd4PmTLP84mwBqMZF3bC8UPwOt7DqWjFI98J3h8tX7IlLzmwww1wvqM4Cxx G/LFc+E7wKX1beF2G69IkRoLTYIl0TWC6TVIO/8qlDTA90rXzPn1igCws9m+V1Hfihn5 Erzw== X-Gm-Message-State: APjAAAX0W8lyWCdLpDAKEhOVH6zROyxuDYvvbm5WYXdt/lspRQ5Myll6 lUOilzHYLH2hmVak4Igc8XICIhgaqNIuIoSjMe8= X-Google-Smtp-Source: APXvYqxfOyGqOAy3XeEi5gMXFVvTF4wtOyzwMDoKG3aBfmhsgmzpA/BxCFv/Lx+Y5fSavJPV5GM1+LUT9vgc4OwVB50= X-Received: by 2002:a24:7542:: with SMTP id y63mr10579307itc.70.1552916648944; Mon, 18 Mar 2019 06:44:08 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Mon, 18 Mar 2019 14:43:52 +0100 Message-ID: To: "C. Scott Ananian" Cc: PHP internals Content-Type: multipart/alternative; boundary="00000000000039edd505845e9544" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: nikita.ppv@gmail.com (Nikita Popov) --00000000000039edd505845e9544 Content-Type: text/plain; charset="UTF-8" On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian wrote: > I'm floating an idea for an RFC here. > > I'm working on the wikimedia/remex-html library for high-performance > PHP-native HTML5 parsing. When creating a high-performance lexer, it is > worthwhile to try to reduce the number of string copies made. You can > generally perform matches using offsets into your master source string. > However, preg_match* will copy a substring for the entire matched region > ($matches[0]) as well as for all captured patterns ($matches[1...n]). > These substring copies can get expensive if the matched region/captured > patterns are very large. > > It would be helpful if PHP's preg_match* functions offered a flag, say > PREG_LENGTH_CAPTURE, which returned the numeric length instead of the > matched/captured string. In combination, > PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in > element 0 and the numeric offset in element 1, and avoid the need to copy > the matched substring unnecessarily. This would allow greatly reducing the > number of substring copies made during lexing. > Generally sounds reasonable to me. Do you maybe have a sample input and regular expression where you suspect this is a particularly large problem, so we can test how much of a difference this makes? > Thoughts? > --scott > > ps. more ambitious would be to introduce a new "substring" type, which > would share the allocation of a parent string with its own offset and > length fields. That would probably be as invasive as the ZVAL_INTERNED_STR > type, though -- a much much bigger project. > > pps. while I'm wishing -- preg_replace would benefit from some way to pass > options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get > the offset for each replacement match. Knowing the offset of the match > allows you to do (for example) error reporting from the callback function. > I've implemented this bit in https://github.com/php/php-src/pull/3958. Nikita --00000000000039edd505845e9544--