Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104832 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 89449 invoked from network); 20 Mar 2019 18:47:47 -0000 Received: from unknown (HELO mail-oi1-f171.google.com) (209.85.167.171) by pb1.pair.com with SMTP; 20 Mar 2019 18:47:47 -0000 Received: by mail-oi1-f171.google.com with SMTP id b4so2143108oif.6 for ; Wed, 20 Mar 2019 08:39:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=w/ClD4tQdTwaw7hCuGcojNxcIWCEeYNwrieRUeKUYcU=; b=HJMgP6XS2DLIsecd4MnkRqkZVZ1R2AD4aS1nqe+svFQKhVOB6vqtwPVJAKU5zxWUXF 6W/7SzfGHIlPWaN7J/TmC+OzQzLkhwNHXqCDrxJ4pnf0mAWDc1YEQe/uR81/XtxdCQeq eEuWeSt7aKiSNfVwlw8LireQfoCPJiaMg3ku4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=w/ClD4tQdTwaw7hCuGcojNxcIWCEeYNwrieRUeKUYcU=; b=m1i+MQYiSOhHREisjVsmrKAg0cafUVvm43DYAj4gqlvtKDO/1Bb2aq0J1Fhe08AkHN 1ZeLGTRmMQem6TyJG0zttn1Wp7klLgmG0Kv1j19OFF6AqtR/qoRLuHf7n1NAvMV5hLI8 FEQhkc5sR3vM3HzMykVCqHz51a+ykeEGAqDW8zHprL3LKekAV/NOd4CITK+Bw9aPSruX X15pI+KABSmokFarCa5g3hlUH6YuKmWYl5pl8GmzNsK2E/RbJwTyqkzU1WecxQn2IqNE iFGehjg0T76WuweCr/1FUQltsv+hsIdyGUDN8L+i7r0Wa8uQNzJscWge6cPquQy6L5SY dnyQ== X-Gm-Message-State: APjAAAXfsSkaf/3bSmflbGaC4jv6Z/Qn+wJZARa64XUo5HINN74IEURv lLbHpMit6yym6Ma4GMEUAT8go9tvvp/HTcK/OGqVKQ== X-Google-Smtp-Source: APXvYqx5WUaqvFZv5eH387znSXO3yr6OmBh37ZBHdux9zQ8yAiLZ20m/5cwPARjlYfBQIOkTnOvnNRdag6CveXvqY4A= X-Received: by 2002:aca:54c3:: with SMTP id i186mr5656291oib.16.1553096382045; Wed, 20 Mar 2019 08:39:42 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Wed, 20 Mar 2019 11:39:30 -0400 Message-ID: To: Nikita Popov Cc: PHP internals Content-Type: multipart/alternative; boundary="00000000000027715f0584886e0c" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: cananian@wikimedia.org ("C. Scott Ananian") --00000000000027715f0584886e0c Content-Type: text/plain; charset="UTF-8" On Mon, Mar 18, 2019 at 9:44 AM Nikita Popov wrote: > On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian > wrote: > >> I'm floating an idea for an RFC here. >> >> I'm working on the wikimedia/remex-html library for high-performance >> PHP-native HTML5 parsing. When creating a high-performance lexer, it is >> worthwhile to try to reduce the number of string copies made. You can >> generally perform matches using offsets into your master source string. >> However, preg_match* will copy a substring for the entire matched region >> ($matches[0]) as well as for all captured patterns ($matches[1...n]). >> These substring copies can get expensive if the matched region/captured >> patterns are very large. >> >> It would be helpful if PHP's preg_match* functions offered a flag, say >> PREG_LENGTH_CAPTURE, which returned the numeric length instead of the >> matched/captured string. In combination, >> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in >> element 0 and the numeric offset in element 1, and avoid the need to copy >> the matched substring unnecessarily. This would allow greatly reducing >> the >> number of substring copies made during lexing. >> > > Generally sounds reasonable to me. Do you maybe have a sample input and > regular expression where you suspect this is a particularly large problem, > so we can test how much of a difference this makes? > I'm going to work on emulating this today by changing as many of the captures in remex-html to zero-length captures at the start/end of the region; that should give me a reasonable idea of performance gain. > pps. while I'm wishing -- preg_replace would benefit from some way to pass > >> options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get >> the offset for each replacement match. Knowing the offset of the match >> allows you to do (for example) error reporting from the callback function. >> > > I've implemented this bit in https://github.com/php/php-src/pull/3958. > I notice that this has been merged already. Looks great, especially breaking out the creation of the matches array into a reusable function. That would make future additions easier/more consistent. --scott -- (http://cscott.net) --00000000000027715f0584886e0c--