Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:128232 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id 7B80C1A00BC for ; Fri, 25 Jul 2025 22:17:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1753481773; bh=+DrG+cxFYbX1835K47onEcoGdVHUZ5H/DvsplOuBSn4=; h=Date:To:From:Subject:From; b=P9PWhsKXgTJ1CvAkXTzK+3XIC/xoAkaXCCrwVZ0Lsu0tnr/UNOfUxVxbKjOzzlLd8 wt7XQlkSVvdZgjx3s94vqyxGgbEFv9sq1i+/VCDmXbttjLRKeBoo9ArcDrBYIqFLIh Wpl+exYqdT5FkLQGPF6Hb/QBpt0K2LS+bz6+nOFoBJTy6pwbUTzIsFbIzcc02CiDfk GXAC9/6ARuuQyyAo7aynOngR2nj30LDZmONKS4Jx0CEYomwWRuAmFiLjd8JvyANBI2 ofXwrZUjDejCYLFtUwjoIIIdhqVLkIdnCGDBD5QLoD+2ERX5Azcs4oP2tcgq5oycBX ZyKzljFtzg9Hw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 55A4A180003 for ; Fri, 25 Jul 2025 22:16:12 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.1 X-Spam-Virus: Error (Cannot connect to unix socket '/var/run/clamav/clamd.ctl': connect: Connection refused) X-Envelope-From: Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 25 Jul 2025 22:16:02 +0000 (UTC) Received: by mail-wm1-f46.google.com with SMTP id 5b1f17b1804b1-451d7b50815so18434175e9.2 for ; Fri, 25 Jul 2025 15:17:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753481865; x=1754086665; darn=lists.php.net; h=content-transfer-encoding:subject:from:to:content-language :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=nu7pzqh+YWmZYCnItM1q44WHXDV2DFASNHDedYEz3ZQ=; b=fNDxCe6O/w2Zh42jsTglBttPbjQGd31BovSjEqDYCEYtlVUv+IOBN7O+hIiQpQL1Hi jwIXWp5XcqWudAl8lghiY+1kbvSWAllp5Ey1PTSG0tMd7X3vsE77QdktpT/QDZLpn1Ad PTh+dHoDdSooSsYnHv9tz/tSSFHpSolKsNV73og2jxnYlxmRCKTf0qHn3cw0lz+AU9H/ zMOxogkzRVBQvPdw4T2bTwq8QFwdFvQJeLCTclIMk0AjVOTvTD1H9pCSBjQ6hLJHJLnE J8fGo6qE+aA1k2tbmpCjsHaitN86w5YE6YiMt/E/had7/wb7dKbZSwRTfkk5BuwAFj5/ pmbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753481865; x=1754086665; h=content-transfer-encoding:subject:from:to:content-language :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=nu7pzqh+YWmZYCnItM1q44WHXDV2DFASNHDedYEz3ZQ=; b=o6uL/rtDGu8SWudsrKBlfUmnNgPZDwdMTo4i/dCj+PzbxH8fNax6IRQUen/dVJFFsa 8KI7kqKQYU7T9tWkJ4Xu+YcOXSkzHmN7R1lI+LwFglN8ZHypJQeD6w1PnKLN0vRchHTq HLusHSgHYITqEVd4qKKWzlYCFSAy+I1dEQ5g5Dc4/AMNsif9wUX7cRQfHsYJYX0IvQiW 9hcGHPF7H3ylI7Iuu7+S8NWXnGffVKLv3dnYkaHqLykCWbFVK5ChnHXeTVWKBKumDI0M Yu6sYTbmhJdVaHa5yq1adXBUkW6weRI3kvX9x8HYuHU5aTxbhiqk2HJibDrenTNnt0sk DwYQ== X-Gm-Message-State: AOJu0YyJl052dmePfpDsdWKBcrfapQdShb+nAJrfHiDBSe3FS5fow1sG 3F5CS60lzJNT8AcwwVj5kV3kPe2Z9bkpa2pJrt/caRZXDBVnFmtk10ye+RbkYLVv X-Gm-Gg: ASbGncuesUz4v59UJJhHA9ZnT4nKHkzIoGgYW8O4WzpY+zW0kpLSxFvWc3zMXm/aWQ+ 4/3LYHPDCjNIIZS8kXF2eo795+AX6inRYBuftIJUG34a7Yl8QRX3zjzq5eaih+2qSC2YDxgcTjk 2Zi4y8pc/Z8MAOsbPXIDMrF81H/wcEi3sOv+bEpo3BR5zgHeqijTQF0x57NVZOwG0z+AaCGCzfD vzjURR6Dj/q7CYGgdwOHXxRy54pqGpl1p5iED9j5OU2WQspRO7LHDxYmsphOi2C703Et8vYqwj6 Ln4b0mN6g5HRPYfIZ6d3q4qYEiC2tw74eSQ6HmHOmg68b7zA+bK6sI5eU/ThWxLaWv2a/r5o6B5 1R2TW1T12DrbhdBmFEnZU2y0w6GHQpQRhGoyQHwWNI2Obt76gIFjg/4O4zHhy9V2Vvko= X-Google-Smtp-Source: AGHT+IEBuAPulo+2hGC3BVPQC9gZDK+x+ELqPftUfihlHPYGV0LyRQPiWQFpthAsCK+eMiY7B2JziQ== X-Received: by 2002:a05:600c:1d28:b0:440:6a79:6df0 with SMTP id 5b1f17b1804b1-458764427b3mr24651775e9.22.1753481864453; Fri, 25 Jul 2025 15:17:44 -0700 (PDT) Received: from [192.168.0.241] (178-119-85-231.access.telenet.be. [178.119.85.231]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-458705bcbb7sm68733015e9.18.2025.07.25.15.17.44 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 25 Jul 2025 15:17:44 -0700 (PDT) Message-ID: <7f892c24-8eef-4238-844a-5dde7f5a536d@gmail.com> Date: Sat, 26 Jul 2025 00:17:43 +0200 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: PHP internals Subject: [PHP-DEV] pcre extended character class support Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit From: dossche.niels@gmail.com (Niels Dossche) Hi internals On PHP 8.5-dev, we ship with pcre2lib 10.45. This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS". It enables the use of complex character set operations in accordance to UTS#18 (Unicode Technical Standard 18). This means it becomes possible to nest character sets, perform set operations on them, etc. One example of such a set operation is a set subtraction, e.g. the regex "[\ep{L}--[QW]]" means "Unicode letters other than Q and W". Or a more realistic example (inspired from [1]): the regex "[\p{Lu}--[0-9]]" matches all non-ASCII unicode numbers. You can also do ORs, ANDs, etc. The reason this is opt-in in pcre2lib, is because the interpretation of existing regexes may change. This standard is being adopted in other languages too, also opt-in, for example in JavaScript [1]. To expose this functionality in PHP, we also have to make it opt-in via a modifier. In JavaScript, this is enabled via the /v modifier at the end of the regex [1]. This does the same thing as the /u modifier, but extends it with this UTS#18 standard. We also already have /u in PHP that enables UTF-8 unicode mode. So we could do the same as JavaScript and add a /v modifier that extends /u and also enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a unicode standard (and that at least JavaScript does this too), it may make sense to enable them both. The actual patch is trivial: ```diff diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c index 8e0fb2cce5f..4a4727545ad 100644 --- a/ext/pcre/php_pcre.c +++ b/ext/pcre/php_pcre.c @@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cache_ex(zend_string *regex, bo case 'S': /* Pass. */ break; case 'X': /* Pass. */ break; case 'U': coptions |= PCRE2_UNGREEDY; break; +#ifdef PCRE2_ALT_EXTENDED_CLASS + case 'v': coptions |= PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH; +#endif case 'u': coptions |= PCRE2_UTF; /* In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII characters, even in UTF-8 mode. However, this can be changed by setting ``` What do we think? [1] https://github.com/tc39/proposal-regexp-v-flag Kind regards Niels