Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:88620 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 48919 invoked from network); 1 Oct 2015 19:19:27 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 1 Oct 2015 19:19:27 -0000 Authentication-Results: pb1.pair.com smtp.mail=fsb@thefsb.org; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=fsb@thefsb.org; sender-id=pass Received-SPF: pass (pb1.pair.com: domain thefsb.org designates 173.203.187.83 as permitted sender) X-PHP-List-Original-Sender: fsb@thefsb.org X-Host-Fingerprint: 173.203.187.83 smtp83.iad3a.emailsrvr.com Linux 2.6 Received: from [173.203.187.83] ([173.203.187.83:53967] helo=smtp83.iad3a.emailsrvr.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 02/34-26330-E378D065 for ; Thu, 01 Oct 2015 15:19:27 -0400 Received: from smtp27.relay.iad3a.emailsrvr.com (localhost.localdomain [127.0.0.1]) by smtp27.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id BF7BC180311 for ; Thu, 1 Oct 2015 15:19:23 -0400 (EDT) Received: by smtp27.relay.iad3a.emailsrvr.com (Authenticated sender: fsb-AT-thefsb.org) with ESMTPSA id 880C318016C for ; Thu, 1 Oct 2015 15:19:22 -0400 (EDT) X-Sender-Id: fsb@thefsb.org Received: from [10.0.1.2] (c-73-4-147-142.hsd1.ma.comcast.net [73.4.147.142]) (using TLSv1 with cipher DES-CBC3-SHA) by 0.0.0.0:465 (trex/5.4.2); Thu, 01 Oct 2015 19:19:23 GMT User-Agent: Microsoft-MacOutlook/14.5.5.150821 Date: Thu, 01 Oct 2015 15:19:20 -0400 To: php-internals Message-ID: Thread-Topic: PHP 7.0's Unicode version incoherence (mbstring, intl, pcre) Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3526557563_1256269" Subject: PHP 7.0's Unicode version incoherence (mbstring, intl, pcre) From: fsb@thefsb.org (Tom Worster) --B_3526557563_1256269 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: 7bit I think PHP should be consistent in using a given Unicode version in each release. Improvements in PHP 7.0, especially IntlChar, allows us to do properly various things that required hackery with preg and mbstring in the past. But both approaches will have to coexist for the foreseeable future. So I think it's desirable to have all three exts based on the same version of Unicode. Weirdness can otherwise arise, for example one API says a code point is unassigned and another that it is a lowercase letter. (Given the importance of validating strings in many PHP apps, this is relevant.) So it would be good if a UCD upgrade in any given PHP point release would apply to all three: mbstring, intl and pcre. The status for 7.0 doesn't look too bad 1. A recent commit updating ext/mbstring/unicode_data.h to Unicode 8.0.0 appears to be in 7.0.0RC4. 2. I think intl is using ICU4C 55.1 which also uses Unicode 7.0. ICU 56RC implementing Unicode 8.0.0 is available but it seems unlikely 56 will be ready in time for PHP 7.0. 3. The version of PCRE in ext/pcre/pcrelib uses Unicode 7.0, and, as I pointed out last week (seemingly to nobody's interest) is probably never going to upgrade. 4. Tables in ext/standard/html_tables are based on Unicode 3.0 but I doubt they have ever been affected by a Unicode upgrade. 5. I'm not sure where else the UCD is used in PHP and would be interested to find out. #70475 caused mbstring/unicode_data.h to be regenerated from Unicode 8.0.0. But it is still open and I commented that if it were regenerated using Unicode 7.0.0 instead then PHP 7.0 could be consistent. PHP 7.0 using Unicode 7.0.0 throughout is perfectly reasonable, imo. Do people here agree that PHP should have a *policy* of using a consistent Unicode version? This appears to be easy to accomplish for the moment. Moving to Unicode 8 will be harder. Tom From: Tom Worster Date: Thursday, September 24, 2015 at 9:40 AM To: php-internals Subject: Unicode regex roadmap While PCRE2 upgraded to Unicode version 8, PCRE, which is in maintenance mode, will presumably remain on Unicode version 7 indefinitely. Does PHP have a roadmap for up-to-date regex, either with PCRE2 or some other lib? Tom --B_3526557563_1256269--