Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104723 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 23794 invoked from network); 14 Mar 2019 22:42:57 -0000 Received: from unknown (HELO mail-ot1-f45.google.com) (209.85.210.45) by pb1.pair.com with SMTP; 14 Mar 2019 22:42:57 -0000 Received: by mail-ot1-f45.google.com with SMTP id b3so6228784otp.4 for ; Thu, 14 Mar 2019 12:33:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=H9axabhmHav8/GvwX1rRRJO3NgTNVCqmgBQtQdWUUs4=; b=TtrLYvUTAwBxdk97zlJ7RJ9R1YAUWGnx1iF7k44Y74e7XpTwS6kHxpRHoQBTg3bL7o ZY+oVznV0WliJhJtZumYnozr9khj3Gc1dgREgnHLJUGotZ8EZBrBiZK+E+v2NJaK0UrS 2Egm+NugD/OHRSeeY1HhH5YTa12YXbuKp596Q= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=H9axabhmHav8/GvwX1rRRJO3NgTNVCqmgBQtQdWUUs4=; b=twMepBVH0eGIyLFCam7zLTet4kKlqoNJSpMFMR4w6UTfsx9HC8PGhWKjMs65O7YKlZ h/toP2MgcoA1R9yrNcU7R6FQxDI49setj+kTKHCHfSmflcRAO16jTy10IAwy5tUCAWLf YjaaKOQl525HZEAA8MK9TPvb/1Ol8OlykTi95pUktijJiPkVTBhC5KCOPBHjc48E4Nca QiS3BDm1RbkAroBD8fgWitNhBNWWa7x1jw8UoSQSoEVAfojLWyfHAOMENmTGNSwzECk1 MSZnoHxkeFnviyNRuaQBm0wtQYORiHt0BHsEsJ0v7HhVTivO7bHvyKUWEAx1d82RPkeI QzqQ== X-Gm-Message-State: APjAAAWqBoCkNeELzX/ENZgnA6pFGahYMh72Om8AsFM1adYYbiUQo28O eeV4uL5vMM5y+cgUXRmdUwXGmPVDIsYrLBas1/bsqv5X X-Google-Smtp-Source: APXvYqza9PjpC2NTnJ/aGWK76pMDAfqrVy2ETJVmOagSEqj5i8r20ZiU9mwkMaQuIBWYudoycd9KoWv5ICniFOQjbN8= X-Received: by 2002:a9d:4c02:: with SMTP id l2mr28062869otf.56.1552592004413; Thu, 14 Mar 2019 12:33:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Thu, 14 Mar 2019 15:33:13 -0400 Message-ID: To: internals@lists.php.net Content-Type: multipart/alternative; boundary="000000000000e78857058412fe38" Subject: Offset-only results from preg_match From: cananian@wikimedia.org ("C. Scott Ananian") --000000000000e78857058412fe38 Content-Type: text/plain; charset="UTF-8" I'm floating an idea for an RFC here. I'm working on the wikimedia/remex-html library for high-performance PHP-native HTML5 parsing. When creating a high-performance lexer, it is worthwhile to try to reduce the number of string copies made. You can generally perform matches using offsets into your master source string. However, preg_match* will copy a substring for the entire matched region ($matches[0]) as well as for all captured patterns ($matches[1...n]). These substring copies can get expensive if the matched region/captured patterns are very large. It would be helpful if PHP's preg_match* functions offered a flag, say PREG_LENGTH_CAPTURE, which returned the numeric length instead of the matched/captured string. In combination, PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in element 0 and the numeric offset in element 1, and avoid the need to copy the matched substring unnecessarily. This would allow greatly reducing the number of substring copies made during lexing. Thoughts? --scott ps. more ambitious would be to introduce a new "substring" type, which would share the allocation of a parent string with its own offset and length fields. That would probably be as invasive as the ZVAL_INTERNED_STR type, though -- a much much bigger project. pps. while I'm wishing -- preg_replace would benefit from some way to pass options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get the offset for each replacement match. Knowing the offset of the match allows you to do (for example) error reporting from the callback function. -- (http://cscott.net) --000000000000e78857058412fe38--