Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:128233 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id 24B1B1A00BC for ; Fri, 25 Jul 2025 22:44:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1753483365; bh=upA5mQFB7M9ow3fZo5LjARYpqMgF42+QNsiFqdMFf4g=; h=Date:From:To:Subject:In-Reply-To:References:From; b=FtcXCVY+y7IeEt72UhFYokyRTUqs2YKrxqQiHT0RSA/zF7vZf9JSln+9MONaK07OO eXMBuXPSqSIxtSmE/WNS6J8TM5c3ylkqyWX3RmMtww27VCXfbni41E8Onw9PCrvClA GfuHRF6Hg9Rox6F4TNIyYhFMbLzBwPzg3MfY2fL1U+x0SRIscmJq+vjbAxsCnpQ9EM GmuCde+1uKvvrgjDcWPxVzdt4iNelzxMtzcTwVpQnyYJHmG4AYcXI7QEx8N25UjSAR K1lgOcWBPfh4YX/dAigQVZ5CwOkBu3SdQrRnUfHeOjUi0yW27ffnqokGqIJgExZaQ9 /d9Y7Qa+aEv0g== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 2A968180074 for ; Fri, 25 Jul 2025 22:42:45 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on php-smtp4.php.net X-Spam-Level: ** X-Spam-Status: No, score=2.8 required=5.0 tests=BAYES_20,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_PASS, SPF_SOFTFAIL autolearn=no autolearn_force=no version=4.0.1 X-Spam-Virus: Error (Cannot connect to unix socket '/var/run/clamav/clamd.ctl': connect: Connection refused) X-Envelope-From: Received: from xdebug.org (xdebug.org [82.113.146.227]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 25 Jul 2025 22:42:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1753483467; bh=upA5mQFB7M9ow3fZo5LjARYpqMgF42+QNsiFqdMFf4g=; h=Date:From:To:Subject:In-Reply-To:References:From; b=D8ji/re6N8oiyYi0YDeyWPmX7HlneP1xIOcj2hTynBaTfhaCXzhSzTm46eI5BAHdy Yd8GEYmSUsR40rZ/pRkLjQS/HZbtG3YSKwMqxC2pPvBKSbypjOMhaBrH4ZfNmLoVw9 c3rsqno2o3WhySVP64Q2mWhT5bl5BgigAOg+2WwcQ94vPcD/X+HsRVoLNd603xns+E szk/e/ozhHJ2iL3FElzSZlP7MhPMjP3laX8JsCw20JpC+mzjeAV77Qlb3CNFE9Zc9I ajS8MuX8gL/9eyY63tyjvJd0wIM9upSHZLEbNuQskd1kB99CJR72kEVByWhN4rQ9hW cP7fBm+deeagA== Received: from [127.0.0.1] (host109-150-47-90.range109-150.btcentralplus.com [109.150.47.90]) by xdebug.org (Postfix) with ESMTPSA id 6533110C033; Fri, 25 Jul 2025 23:44:27 +0100 (BST) Date: Fri, 25 Jul 2025 23:44:25 +0100 To: internals@lists.php.net Subject: Re: [PHP-DEV] pcre extended character class support User-Agent: K-9 Mail for Android In-Reply-To: <7f892c24-8eef-4238-844a-5dde7f5a536d@gmail.com> References: <7f892c24-8eef-4238-844a-5dde7f5a536d@gmail.com> Message-ID: <583C02BA-19CC-471C-A8AA-0734B824F01F@php.net> Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: derick@php.net (Derick Rethans) On 25 July 2025 23:17:43 BST, Niels Dossche w= rote: >Hi internals > >On PHP 8=2E5-dev, we ship with pcre2lib 10=2E45=2E > >This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS"=2E >It enables the use of complex character set operations in accordance to U= TS#18 (Unicode Technical Standard 18)=2E >This means it becomes possible to nest character sets, perform set operat= ions on them, etc=2E >One example of such a set operation is a set subtraction, e=2Eg=2E the re= gex "[\ep{L}--[QW]]" means "Unicode letters other than Q and W"=2E >Or a more realistic example (inspired from [1]): the regex "[\p{Lu}--[0-9= ]]" matches all non-ASCII unicode numbers=2E >You can also do ORs, ANDs, etc=2E > >The reason this is opt-in in pcre2lib, is because the interpretation of e= xisting regexes may change=2E >This standard is being adopted in other languages too, also opt-in, for e= xample in JavaScript [1]=2E >To expose this functionality in PHP, we also have to make it opt-in via a= modifier=2E > >In JavaScript, this is enabled via the /v modifier at the end of the rege= x [1]=2E >This does the same thing as the /u modifier, but extends it with this UTS= #18 standard=2E >We also already have /u in PHP that enables UTF-8 unicode mode=2E So we c= ould do the same as JavaScript and add a /v modifier that extends /u and al= so enables PCRE2_ALT_EXTENDED_CLASS=2E Technically, you don't need unicode = processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a un= icode standard (and that at least JavaScript does this too), it may make se= nse to enable them both=2E > >The actual patch is trivial: >```diff >diff --git a/ext/pcre/php_pcre=2Ec b/ext/pcre/php_pcre=2Ec >index 8e0fb2cce5f=2E=2E4a4727545ad 100644 >--- a/ext/pcre/php_pcre=2Ec >+++ b/ext/pcre/php_pcre=2Ec >@@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cach= e_ex(zend_string *regex, bo > case 'S': /* Pass=2E */ break; > case 'X': /* Pass=2E */ break; > case 'U': coptions |=3D PCRE2_UNGREEDY; break; >+#ifdef PCRE2_ALT_EXTENDED_CLASS >+ case 'v': coptions |=3D PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH; >+#endif > case 'u': coptions |=3D PCRE2_UTF; > /* In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only AS= CII > characters, even in UTF-8 mode=2E However, this can be changed by se= tting > >``` > >What do we think? > >[1] https://github=2Ecom/tc39/proposal-regexp-v-flag Yes, please=2E cheers Derick