I think PHP should be consistent in using a given Unicode version in each
release.
Improvements in PHP 7.0, especially IntlChar, allows us to do properly
various things that required hackery with preg and mbstring in the past. But
both approaches will have to coexist for the foreseeable future. So I think
it's desirable to have all three exts based on the same version of Unicode.
Weirdness can otherwise arise, for example one API says a code point is
unassigned and another that it is a lowercase letter. (Given the importance
of validating strings in many PHP apps, this is relevant.)
So it would be good if a UCD upgrade in any given PHP point release would
apply to all three: mbstring, intl and pcre.
The status for 7.0 doesn't look too bad
- A recent commit updating ext/mbstring/unicode_data.h to Unicode 8.0.0
appears to be in 7.0.0RC4. - I think intl is using ICU4C 55.1 which also uses Unicode 7.0. ICU 56RC
implementing Unicode 8.0.0 is available but it seems unlikely 56 will be
ready in time for PHP 7.0. - The version of PCRE in ext/pcre/pcrelib uses Unicode 7.0, and, as I
pointed out last week (seemingly to nobody's interest) is probably never
going to upgrade. - Tables in ext/standard/html_tables are based on Unicode 3.0 but I doubt
they have ever been affected by a Unicode upgrade. - I'm not sure where else the UCD is used in PHP and would be interested to
find out.
#70475 caused mbstring/unicode_data.h to be regenerated from Unicode 8.0.0.
But it is still open and I commented that if it were regenerated using
Unicode 7.0.0 instead then PHP 7.0 could be consistent. PHP 7.0 using
Unicode 7.0.0 throughout is perfectly reasonable, imo.
Do people here agree that PHP should have a policy of using a consistent
Unicode version?
This appears to be easy to accomplish for the moment. Moving to Unicode 8
will be harder.
Tom
From: Tom Worster fsb@thefsb.org
Date: Thursday, September 24, 2015 at 9:40 AM
To: php-internals internals@lists.php.net
Subject: Unicode regex roadmap
While PCRE2 upgraded to Unicode version 8, PCRE, which is in maintenance
mode, will presumably remain on Unicode version 7 indefinitely.
Does PHP have a roadmap for up-to-date regex, either with PCRE2 or some
other lib?
Tom
Le jeudi 1 octobre 2015, 15:19:20 Tom Worster a écrit :
Do people here agree that PHP should have a policy of using a consistent
Unicode version?
I agree with this, seems like a fair request.
Hello,
2015-10-01 21:19 GMT+02:00 Tom Worster fsb@thefsb.org:
Do people here agree that PHP should have a policy of using a consistent
Unicode version?This appears to be easy to accomplish for the moment. Moving to Unicode 8
will be harder.
I agree with the policy -> good idea.
But I think there will be a lot of problems, when staying on Unicode 7.
Since Unicode 8 has the new emoji (colors) and that is used more and more.
Hello,
2015-10-01 21:19 GMT+02:00 Tom Worster fsb@thefsb.org:
Do people here agree that PHP should have a policy of using a consistent
Unicode version?This appears to be easy to accomplish for the moment. Moving to Unicode 8
will be harder.I agree with the policy -> good idea.
But I think there will be a lot of problems, when staying on Unicode 7.
Since Unicode 8 has the new emoji (colors) and that is used more and more.
The new emoji are neat and people can and should use them. They just
can't expect intl or preg to recognize them until some release after 7.0.0.
Using Unicode 8 uniformly involves moving to PCRE2, which doesn't seem
easy. It's a big API change for little functional gain.
I'd like to see PCRE2 implement a bit more of TR18 but I doubt it's a
priority. ICU regex as the next big thing in intl is an interesting
idea. I've no idea how hard it is.
http://unicode.org/reports/tr18/
So Unicode 8 depends on what PHP does about regex going forwards. I
asked about this on Sep 24 but got no response.
http://www.serverphorums.com/read.php?7,1303221
Tom