Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:128252 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id 961511A00BC for ; Mon, 28 Jul 2025 10:34:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1753698738; bh=V0hLraOwZ9aUVa8mIk6i5N0V3sHDiuqqTypBBt21H5A=; h=Date:To:From:Cc:Subject:In-Reply-To:References:From; b=kPt4ljWMQ9O4DlwA18ePj+1beCdvNlMrzYFXNpfB3Usyl9E0hTkQwpF+oDmFAyM93 ovdmyWjpnewjVbdoanhCdudUamn9myu4Ioq/5GHvDs3GB9Gpq6A6d8KvceC9YgAYrf JxDnzU1cFcv1gF0U+DcAl8f4ZGVpecPaW9fBBLAHCifDDdlf/9UeMUwbG1cCnfxn4E uzfFFjg3nOqWTqBrlh6YBblciU3kBQuYAfBfNM/ADA9r72Nw0H6w/t/cA5Ddo2MQBk +dN9UcGtw2GUXuD6U7ndeAx7VJuhmSjJJFwczNLqjHTcE5QOlrY/Zm+rywJyYrnCuB ecVoQLn+wSXqA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 5AA971806F1 for ; Mon, 28 Jul 2025 10:32:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=4.0.1 X-Spam-Virus: Error (Cannot connect to unix socket '/var/run/clamav/clamd.ctl': connect: Connection refused) X-Envelope-From: Received: from mail-10624.protonmail.ch (mail-10624.protonmail.ch [79.135.106.24]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 28 Jul 2025 10:32:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gpb.moe; s=protonmail; t=1753698838; x=1753958038; bh=V0hLraOwZ9aUVa8mIk6i5N0V3sHDiuqqTypBBt21H5A=; h=Date:To:From:Cc:Subject:Message-ID:In-Reply-To:References: Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID: Message-ID:BIMI-Selector; b=qiIKvoaK0z307wsGiewFM2+ZeDtDa1P0URIXJJZSsF/MqDF/fTR04vZz5JQ7Kxu7i n2IL+j+h+OV/SjSzVX0ygbynNEaesuLAlKOrF4CnmQdmqEjAG/WPiCFL4jvJK6nJU8 DjiAkqO3xE8GlA251H4tSeLP5+L/FmrTT3k7gWqvW4ShktVxP4Ue6OUSZq/pWhg1wM 9a8cab39K7XLRvJddq6ao/N+LQs/eCWHiPpXbvt7vZAuR07RDrFWE+Sg7wR4B70UZD ZZmY+uVG3iErjhcwCIS3Zr16GCb0TBxZNlvIvKr78LEJWIPM/DZEgToLhyDR3MTfMl hsagHxZQI5l8g== Date: Mon, 28 Jul 2025 10:33:53 +0000 To: Niels Dossche Cc: PHP internals Subject: Re: [PHP-DEV] pcre extended character class support Message-ID: In-Reply-To: <7f892c24-8eef-4238-844a-5dde7f5a536d@gmail.com> References: <7f892c24-8eef-4238-844a-5dde7f5a536d@gmail.com> Feedback-ID: 96993444:user:proton X-Pm-Message-ID: 70f01d4673bdba584a5313d834934e78fd56c138 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: internals@gpb.moe ("Gina P. Banyard") On Friday, 25 July 2025 at 23:20, Niels Dossche w= rote: > Hi internals >=20 > On PHP 8.5-dev, we ship with pcre2lib 10.45. >=20 > This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS". > It enables the use of complex character set operations in accordance to U= TS#18 (Unicode Technical Standard 18). > This means it becomes possible to nest character sets, perform set operat= ions on them, etc. > One example of such a set operation is a set subtraction, e.g. the regex = "[\ep{L}--[QW]]" means "Unicode letters other than Q and W". > Or a more realistic example (inspired from [1]): the regex "[\p{Lu}--[0-9= ]]" matches all non-ASCII unicode numbers. > You can also do ORs, ANDs, etc. >=20 > The reason this is opt-in in pcre2lib, is because the interpretation of e= xisting regexes may change. > This standard is being adopted in other languages too, also opt-in, for e= xample in JavaScript [1]. > To expose this functionality in PHP, we also have to make it opt-in via a= modifier. >=20 > In JavaScript, this is enabled via the /v modifier at the end of the rege= x [1]. > This does the same thing as the /u modifier, but extends it with this UTS= #18 standard. > We also already have /u in PHP that enables UTF-8 unicode mode. So we cou= ld do the same as JavaScript and add a /v modifier that extends /u and also= enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode proc= essing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a unicod= e standard (and that at least JavaScript does this too), it may make sense = to enable them both. >=20 > The actual patch is trivial: > `diff diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c index 8e0fb2= cce5f..4a4727545ad 100644 --- a/ext/pcre/php_pcre.c +++ b/ext/pcre/php_pcre= .c @@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cac= he_ex(zend_string *regex, bo case 'S': /* Pass. */ break; case 'X': /* Pass= . */ break; case 'U': coptions |=3D PCRE2_UNGREEDY; break; +#ifdef PCRE2_AL= T_EXTENDED_CLASS + case 'v': coptions |=3D PCRE2_ALT_EXTENDED_CLASS; ZEND_F= ALLTHROUGH; +#endif case 'u': coptions |=3D PCRE2_UTF; /* In PCRE, by defau= lt, \\d, \\D, \\s, \\S, \\w, and \\W recognize only ASCII characters, even = in UTF-8 mode. However, this can be changed by setting` >=20 > What do we think? >=20 > [1] https://github.com/tc39/proposal-regexp-v-flag >=20 > Kind regards > Niels I'm in favour of this, as this is small and self-contained, I don't think t= his should require an RFC. Best regards, Gina P. Banyard