Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:128273 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id 4982C1A00BC for ; Mon, 28 Jul 2025 17:26:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1753723498; bh=LYr+yQvMdPW2NHO0TgVQklBdRmucSY0zyOX1Bvvtk+k=; h=References:In-Reply-To:From:Date:Subject:To:From; b=GMprM6xrQFRt6B8OlJHeuiEu9Kjov+eHBCl9vqC0BQIgP5tiG+xuBkYqmUkKlbGJH eop6aVVTuBEfWNw5kkGwjmWtRhNES8VmYS4rlW1UxHFx5TrVmO7Ae4UGgJqJYPMbVU OxhIyfqzSCZr8tB3mzwjA9LtQX7WtGMHM4kqq6lp/PAKtgE/DGhFBfwn6nvqJ4SZW2 1+0uq10fIlSKeGg2d2pEuf3GgoGz+c4sSbc62OO98AX+p+cEcCRtsZDvDch3U3XAC+ pvMlxdM3KOK2gXcFhSPToxqqXyRS2eFvgxIkyjbhPM3deeVaHA28dWkbNnGT6JbUwZ VGpNX4IskHqYw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 2AF02180555 for ; Mon, 28 Jul 2025 17:24:58 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.1 X-Spam-Virus: Error (Cannot connect to unix socket '/var/run/clamav/clamd.ctl': connect: Connection refused) X-Envelope-From: Received: from mail-oo1-f43.google.com (mail-oo1-f43.google.com [209.85.161.43]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 28 Jul 2025 17:24:54 +0000 (UTC) Received: by mail-oo1-f43.google.com with SMTP id 006d021491bc7-615bc88dd3aso2564220eaf.2 for ; Mon, 28 Jul 2025 10:26:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753723597; x=1754328397; darn=lists.php.net; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=FEKE9RF0TYMG1k/XBqTNgq9+PuxUC0vf5CzgIQlKnBE=; b=kZG6gVDVvdMn1j62Kro8CBKq+CRall45BDzQzhb31XYX6U1KErlXx5su98Fx9mnhbO lEahpffdefPGg/yhZAVeFx7OJX384sYgEBKjYCIzW11kb+6JfuRZ0mV+xExcuBniHbAx +LM2vvzpLwR0jqa6Qe3dYjXHqmC72YfMCEHUm7r44kQOxGzhZqMauq3D5ZWPeOkjQNnJ h+XvoNc7dh+0on8cm2kcIbREBdIVLNiCfqnbAGvtZtniphluJo4QlPfGIvT8bC2POxvn tD007RCxwX9MsNHrKhSHippEKGQYIk2RnJQUPBoawAVxISvduMBy0uTVwDemysKxGn43 MojA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753723597; x=1754328397; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FEKE9RF0TYMG1k/XBqTNgq9+PuxUC0vf5CzgIQlKnBE=; b=Z2t/wsA+XimJ2UohZQ276Gh67Vm/M5BW4dK3xFDqXS7TPWmZrxpMwOmO95INl36lYV QH85JQoE77cA8SnTxIlpDz1twp21hnz2cb01Z7EH3TKRhLc6e5ixcMW+BdIRvTUKH4DC fYW4ybsVOQJOFoI4SFGgVvikQU7DvrgD+14aPehDUGMi19FEyadWfUFFB9WrM6aBdWNC oODh6R/SIMwLf8LzELhcZqnqlB++uGHPM6P4tp52QYvlWgIrP9bcp5vNFVqZ+o7C3I/7 iy5G0aXN+cSOGtegYhWuqLuOrrFVDY5vPIOu7E5uuFEVxXr2ridlcrr/wfhO4B5LqoKo PX/w== X-Forwarded-Encrypted: i=1; AJvYcCUzb3eBRzv/H0ly8jiBwL9C9KKwgpDvHwx3PLpedf7zDv9paG8uLczrdc0KfSoE5OutqKwujHTwZDA=@lists.php.net X-Gm-Message-State: AOJu0Yx/YKHTAuonjVSVwYqjXTlktYWUfKerjn1x6xOGkBaQtWQKqEiC P6gUt0VdB7ztkzI47QOf/AK0pywG7an5HDnuRDo2Gqsg2uvz/Zq+6Up5BU+MhdJS5ELIjcI6g5H rvCqDDPT1oaIi0MgAoHw0aXPRUmhstcM= X-Gm-Gg: ASbGnctRWIXi7lDueOWYLW8mkt9jXZHEJujEj1ppWhEgwMk45pCiSQQs1wTH3MvCa1F im6a5JchxUAduZpeAn4m1LOXBTKaE+fI5tqLvO2ANHAgXjVEalZXAvd58vww2N6FpaFihFdd5eI Pi1O21cqdKYwJY/TCJUNX4ZwhqBa5HCFHgBBVWMmiQDeEB/UYrxs4Yr/IppLKaVywEMV4jWsktT M4N7ZYTUICAaxP+ X-Google-Smtp-Source: AGHT+IEkv9besTZDBXvO+lKbYZv8pM1r68ePyAYpiXWyX2SpINyVWY90UoSGFSZZzFhzSHsOV6q/AQfsZE73adqIDfs= X-Received: by 2002:a05:6820:1795:b0:615:a269:fc5f with SMTP id 006d021491bc7-6190c8466f9mr5992479eaf.1.1753723596711; Mon, 28 Jul 2025 10:26:36 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <7f892c24-8eef-4238-844a-5dde7f5a536d@gmail.com> In-Reply-To: <7f892c24-8eef-4238-844a-5dde7f5a536d@gmail.com> Date: Mon, 28 Jul 2025 18:26:25 +0100 X-Gm-Features: Ac12FXz3EB__DooEig2ErikIEmby4mX-z1rAc6kFQ3DBeoUz_pYCtJM0cIKZR8A Message-ID: Subject: Re: [PHP-DEV] pcre extended character class support To: Niels Dossche , PHP internals Content-Type: multipart/alternative; boundary="000000000000042b9a063b009829" From: devnexen@gmail.com (David CARLIER) --000000000000042b9a063b009829 Content-Type: text/plain; charset="UTF-8" +1000 for me. Cheers. On Fri, 25 Jul 2025 at 23:20, Niels Dossche wrote: > Hi internals > > On PHP 8.5-dev, we ship with pcre2lib 10.45. > > This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS". > It enables the use of complex character set operations in accordance to > UTS#18 (Unicode Technical Standard 18). > This means it becomes possible to nest character sets, perform set > operations on them, etc. > One example of such a set operation is a set subtraction, e.g. the regex > "[\ep{L}--[QW]]" means "Unicode letters other than Q and W". > Or a more realistic example (inspired from [1]): the regex > "[\p{Lu}--[0-9]]" matches all non-ASCII unicode numbers. > You can also do ORs, ANDs, etc. > > The reason this is opt-in in pcre2lib, is because the interpretation of > existing regexes may change. > This standard is being adopted in other languages too, also opt-in, for > example in JavaScript [1]. > To expose this functionality in PHP, we also have to make it opt-in via a > modifier. > > In JavaScript, this is enabled via the /v modifier at the end of the regex > [1]. > This does the same thing as the /u modifier, but extends it with this > UTS#18 standard. > We also already have /u in PHP that enables UTF-8 unicode mode. So we > could do the same as JavaScript and add a /v modifier that extends /u and > also enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode > processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a > unicode standard (and that at least JavaScript does this too), it may make > sense to enable them both. > > The actual patch is trivial: > ```diff > diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c > index 8e0fb2cce5f..4a4727545ad 100644 > --- a/ext/pcre/php_pcre.c > +++ b/ext/pcre/php_pcre.c > @@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* > pcre_get_compiled_regex_cache_ex(zend_string *regex, bo > case 'S': /* Pass. */ > break; > case 'X': /* Pass. */ > break; > case 'U': coptions |= PCRE2_UNGREEDY; > break; > +#ifdef PCRE2_ALT_EXTENDED_CLASS > + case 'v': coptions |= > PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH; > +#endif > case 'u': coptions |= PCRE2_UTF; > /* In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize > only ASCII > characters, even in UTF-8 mode. However, this can be changed by > setting > > ``` > > What do we think? > > [1] https://github.com/tc39/proposal-regexp-v-flag > > Kind regards > Niels > --000000000000042b9a063b009829 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
+1000 for me.

Cheers.
On Fri, 25 Jul 2025 at 23:20, Niels Dossche <dossche.niels@gmail.com> wrote:
<= /div>
Hi internals

On PHP 8.5-dev, we ship with pcre2lib 10.45.

This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS&qu= ot;.
It enables the use of complex character set operations in accordance to UTS= #18 (Unicode Technical Standard 18).
This means it becomes possible to nest character sets, perform set operatio= ns on them, etc.
One example of such a set operation is a set subtraction, e.g. the regex &q= uot;[\ep{L}--[QW]]" means "Unicode letters other than Q and W&quo= t;.
Or a more realistic example (inspired from [1]): the regex "[\p{Lu}--[= 0-9]]" matches all non-ASCII unicode numbers.
You can also do ORs, ANDs, etc.

The reason this is opt-in in pcre2lib, is because the interpretation of exi= sting regexes may change.
This standard is being adopted in other languages too, also opt-in, for exa= mple in JavaScript [1].
To expose this functionality in PHP, we also have to make it opt-in via a m= odifier.

In JavaScript, this is enabled via the /v modifier at the end of the regex = [1].
This does the same thing as the /u modifier, but extends it with this UTS#1= 8 standard.
We also already have /u in PHP that enables UTF-8 unicode mode. So we could= do the same as JavaScript and add a /v modifier that extends /u and also e= nables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode pr= ocessing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a unic= ode standard (and that at least JavaScript does this too), it may make sens= e to enable them both.

The actual patch is trivial:
```diff
diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c
index 8e0fb2cce5f..4a4727545ad 100644
--- a/ext/pcre/php_pcre.c
+++ b/ext/pcre/php_pcre.c
@@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cache_= ex(zend_string *regex, bo
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 case 'S':=C2=A0 =C2=A0 =C2=A0 =C2=A0/* Pass. */=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 case 'X':=C2=A0 =C2=A0 =C2=A0 =C2=A0/* Pass. */=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 case 'U':=C2=A0 =C2=A0 =C2=A0 =C2=A0coptions |=3D PCRE2_= UNGREEDY;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break;
+#ifdef PCRE2_ALT_EXTENDED_CLASS
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0case 'v':=C2=A0 =C2=A0 =C2=A0 =C2=A0coptions |=3D PCRE2_A= LT_EXTENDED_CLASS; ZEND_FALLTHROUGH;
+#endif
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 case 'u':=C2=A0 =C2=A0 =C2=A0 =C2=A0coptions |=3D PCRE2_= UTF;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* In=C2=A0 PCRE,=C2=A0 by=C2=A0 default, \d, \= D, \s, \S, \w, and \W recognize only ASCII
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0characters, even in UTF-8 mode. Ho= wever, this can be changed by setting

```

What do we think?

[1] https://github.com/tc39/proposal-regexp-v-flag=

Kind regards
Niels
--000000000000042b9a063b009829--