Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:104972 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 81539 invoked from network); 28 Mar 2019 00:48:10 -0000 Received: from unknown (HELO mail-ot1-f41.google.com) (209.85.210.41) by pb1.pair.com with SMTP; 28 Mar 2019 00:48:10 -0000 Received: by mail-ot1-f41.google.com with SMTP id e5so16346319otk.12 for ; Wed, 27 Mar 2019 14:41:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=47lc/MbMKIixlRdK3AEyL6WjRKlrvhNWxbXvmrIqtdQ=; b=J/SMdMnkT9UaRxZhf6U+BfP957TnJysUPl+Kk0L0ROLTU442StQfgjV/d8vMO5TsBg jn7HBcdMRRns3lsG5w8lwdNJfwg8N0N4e/1y6yi+urwObDd2bP3OmjaA0wTp+AAMgkgW APY7rHxFViqSq8weeGvZHCkcLXq14YGYfXyW8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=47lc/MbMKIixlRdK3AEyL6WjRKlrvhNWxbXvmrIqtdQ=; b=BVhvLapxHnbeqRJVgVUhIDctC0oX4XIgLM4zLfImz2TKChchf1OHD0rhjR3tn0FD/0 fTPUrB+HOvvNMzqzvidIqeCN8X39SWZ2z7gX7PEYyGadOUUcM4YtNQjNxo406bPJnvCe lbTlC+nx1vpFbo5FNx51cR88SJ1r1xPzXgYMMuEHo6sCvjD5GLE/ZXWq+gn0+5dYnYk9 sXCktHhq5mKm78b7J0RNarsVv8WWPHI3h1dpba9MmVy5aW0edsYXOWtluLrvmGkWZR1q 9lGYyw5ya1imrDN4qqpOkiK5SctIQ10H0OZ/TJGCDGGMJtOvsz3y1ZdiG0zGqqKbBu6p JqCw== X-Gm-Message-State: APjAAAUSmMwiZ++n+ugToiUno0B4TE8yiJapHnuBws3oyGyo5o3mL1Wo 4qh5Un7Tkc4NcdMWkvWoAjSMn2xRlgkSzszButbB+A== X-Google-Smtp-Source: APXvYqx0pYSzhR7lrbO/fKVQE1Wbmk588E7+hAnTkYzoPnDFOl7CiCmDsebwpRNumEQpksUhKp0iZGsN4MIEE2jqFv8= X-Received: by 2002:a9d:604c:: with SMTP id v12mr1712866otj.247.1553722914070; Wed, 27 Mar 2019 14:41:54 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Wed, 27 Mar 2019 17:41:42 -0400 Message-ID: To: Nikita Popov Cc: PHP internals Content-Type: multipart/alternative; boundary="0000000000005f7cc105851a4e0b" Subject: Re: [PHP-DEV] Offset-only results from preg_match From: cananian@wikimedia.org ("C. Scott Ananian") --0000000000005f7cc105851a4e0b Content-Type: text/plain; charset="UTF-8" On Wed, Mar 27, 2019 at 2:30 PM C. Scott Ananian wrote: > Continuing this saga: I'm still having performance problems on character > entity expansion. Here's the baseline code: > https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881 > Of note: the regular expression is quite large -- around 26kB -- because > it needs to include the complete table of all HTML5 entities, which it gets > from a separate file of tables, HTMLData.php. > > Recapping briefly: we established before that it is very important that > large regex strings been interned, otherwise pcre_get_compiled_regex_cache > needs to do a full zend_string_equal_content() on every call to > preg_match*, and since the strings will match, that costs a complete > traversal of the 26kB regexp string. > > If I inline the char ref table directly into the regexp as a single huge > literal string, that string is interned and (with Nikita's recent fixes for > the CLI) things are ok. > > But that's bad for code maintainability; it violates Do Not Repeat > Yourself and now it's much harder to see what the character reference > regexp is doing because it's got this huge 26k table embedded in the middle > of it. > > PHP will let me initialize the string as: > > const CHAR_REF_REGEXP = ' ... ' . HTMLData::NAMED_ENTITY_REGEX . "..."; > > that is, it recognizes this as a compile-time constant -- but it doesn't > actually intern the resulting string. The code in > zend_declare_class_constant_ex interns *most* constant strings, but in this > case because there is a reference to another constant, the Z_TYPE_P(value) > == IS_STRING check in zend_declare_class_constant_ex fails (the actual type > is IS_CONSTANT_AST) presumably because we don't want to autoload HTMLData > too soon. (But this also seems to happen even if I use > self::NAMED_ENTITY_REGEX here, which wouldn't trigger the autoloader.) > > I *think* the proper fix is to intern the string lazily when it is finally > evaluated, in ZEND_FETCH_CLASS_CONSTANT_SPEC_CONST_CONST_HANDLER around the > point where we check Z_TYPE_P(value) == IS_CONSTANT_AST -- probably by > tweaking zval_update_constant_ex to intern any string result? > I've created https://github.com/php/php-src/pull/3994 implementing this fix, and confirmed that it is sufficient to get my large regexp interned when it is rewritten as a class constant referencing HTMLData::NAMED_ENTITY_REGEX. --scott --0000000000005f7cc105851a4e0b--