Good evening,
I’m opening voting on the Unicode Codepoint Escape Syntax RFC. There’s been some discussion in the last two weeks since I introduced the RFC, but there’s nothing left which I feel needs changing. For the character name syntax suggestion (i.e. something like \N{arabic letter alef}), if that’s desired, it could be done later in a different RFC.
Please read through the RFC and cast your vote if you wish to do so:
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
Thanks!
Andrea Faulds
http://ajf.me/
Good evening,
I’m opening voting on the Unicode Codepoint Escape Syntax RFC. There’s been some discussion in the last two weeks since I introduced the RFC, but there’s nothing left which I feel needs changing. For the character name syntax suggestion (i.e. something like \N{arabic letter alef}), if that’s desired, it could be done later in a different RFC.
Please read through the RFC and cast your vote if you wish to do so:
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
I vote 'yes'.
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h
Hi Alain,
I vote 'yes’.
At the risk of stating the obvious: I don’t see your vote on the page’s voting widget. Please vote there.
Thanks!
Andrea Faulds
http://ajf.me/
Hi Alain,
I vote 'yes’.
At the risk of stating the obvious: I don’t see your vote on the page’s voting widget. Please vote there.
I looked ... I now see that I need voting Karma - which I don't have.
I suppose that I might qualify under 'regular participant of internals
discussions' - but that is not for me to decide.
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h
thanks for the rfc! Maybe you can add to the documentation that older PHP versions can use json_decode with the limit to 4 hex digits:
php -r "echo json_decode('"man\u0303ana"');"
php -r "echo json_decode('"ma\u00F1ana"');"
Regards
Thomas
Andrea Faulds wrote on 09.12.2014 00:51:
Good evening,
I’m opening voting on the Unicode Codepoint Escape Syntax RFC. There’s been
some discussion in the last two weeks since I introduced the RFC, but there’s
nothing left which I feel needs changing. For the character name syntax
suggestion (i.e. something like \N{arabic letter alef}), if that’s desired,
it could be done later in a different RFC.Please read through the RFC and cast your vote if you wish to do so:
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
Thanks!
Andrea Faulds
http://ajf.me/
2014-12-09 0:51 GMT+01:00 Andrea Faulds ajf@ajf.me:
Still leaves unmentioned that there was already an established Unicode
escape syntax. PCRE provides \x{1F520} for codepoints in conjunction to
plain \xFF for byte escapes.
Maybe there should be more elaboration on why PHP itself should go with
the \u{xxxx} ECMAScript representaton, thus introducing a syntax disparity
with our most major string handling extension.
Hi!
2014-12-09 0:51 GMT+01:00 Andrea Faulds ajf@ajf.me:
Still leaves unmentioned that there was already an established Unicode
escape syntax. PCRE provides \x{1F520} for codepoints in conjunction to
plain \xFF for byte escapes.
Interesting, I was unaware of that until now, thanks for pointing this out.
Maybe there should be more elaboration on why PHP itself should go with
the \u{xxxx} ECMAScript representaton, thus introducing a syntax disparity
with our most major string handling extension.
Well, PCRE does what it does probably because of its name: Perl-Compatible Regular Expressions. Perl has the \x syntax. But PCRE’s syntax comes from what suits Perl, not PHP, so I don’t see why we should necessarily match its behaviour. If we add \x{xxxxx} syntax to PHP’s string literals, then we’ll break existing code which uses double quoted strings for regular expressions.
I think \x{xxxx} is misleading anyway - \xXX is always single-byte/character, yet Unicode code points can’t be represented in PHP strings as single bytes when encoded in UTF-8 (unless they’re below U+0100, of course). If I saw "\x{abcd}” I'd expect it to be the same as "\xab\xbc”. Plus, while Perl has \x{xxxx} syntax, Ruby and ECMAScript 6 have the \u{xxxx} syntax, so \u{xxxx} is already more popular. The ‘u’ in \u{xxxx} also makes it more obviously “Unicode”.
Thanks!
Andrea Faulds
http://ajf.me/
Tue, 9 Dec 2014 02:44:33 +0000 Andrea Faulds ajf@ajf.me:
Well, PCRE does what it does probably because of its name:
Perl-Compatible Regular Expressions. Perl has the \x syntax. But
PCRE’s syntax comes from what suits Perl, not PHP, so I don’t see why
we should necessarily match its behaviour. If we add \x{xxxxx} syntax
to PHP’s string literals, then we’ll break existing code which uses
double quoted strings for regular expressions.
Actually the opposite seems alarming. For double quoted strings it'd be
irrelevant if \u{} or \x{} was priorly handled by PHP or left to PCREs
interpretation. (Having an alternative there is even beneficial.)
However, single-quoted strings are more commonly and habitually used for
regexps. And with \u{} going to be used regularily, then unknowingly or
accidentially in regex context, is where it would trigger PCRE failures.
preg_match('~\u{bad}~umixUs')
Both \u{} and \x{} are used in fringe cases only of course. Consistently
settling on one would still benefit forward compatibility here.
I think \x{xxxx} is misleading anyway - \xXX is always
single-byte/character, yet Unicode code points can’t be represented
in PHP strings as single bytes when encoded in UTF-8 (unless they’re
below U+0100, of course). If I saw "\x{abcd}” I'd expect it to be the
same as "\xab\xbc”. Plus, while Perl has \x{xxxx} syntax, Ruby and
ECMAScript 6 have the \u{xxxx} syntax, so \u{xxxx} is already more
popular. The ‘u’ in \u{xxxx} also makes it more obviously “Unicode”.
There's no question really about \u being more common and therefore
recognize~ and preferrable. Taking the cue from Ruby is appreciated!
Since the RFC rightly discounts the standard \uFFFF due to compatibility
reasons, there's however little visual and semantic distinction between
the {}-embellished variant \u{hhhhh} and a hypothetical \x{hhhhh}.
Not sure why or who would misinterpret \x{abc} as multi-bytes, really.
It's well understood and working for PCRE. The advantage of overloading
\x is a much lessened likelihood to ever encounter a residual "\x{" in
PHP strings.
Whereas "\u" is new, and never had an implicit payload constraint, thus
could run into a preexisting "\u{xxxx}" that was formerly targeted at a
later/distinct context.
Going with the Ruby theme; when piping a string there or receiving one
it's irrelevant who uses which syntax to preinterpret it. It's only
really interesting when exchanging string literals.
But the RFC and the patch don't cover stripcslashes()
or addcslashes()
for instance.
So there's no direct string syntax interoperability earmarked for.
Which is why I brought forward \x{hhhhh} as alternative for within-PHP
consistency at least.
(Not bent on lobbying for x, as \u{…} is visually more pleasing; just
unsure about its scope.)
\u{1F44B}
Maybe there should be more elaboration on why PHP itself should go with
the \u{xxxx} ECMAScript representaton, thus introducing a syntax disparity
with our most major string handling extension.
Well, PCRE does what it does probably because of its name: Perl-Compatible Regular Expressions. Perl has the \x syntax. But PCRE’s syntax comes from what suits Perl, not PHP, so I don’t see why we should necessarily match its behaviour. If we add \x{xxxxx} syntax to PHP’s string literals, then we’ll break existing code which uses double quoted strings for regular expressions.I think \x{xxxx} is misleading anyway - \xXX is always single-byte/character, yet Unicode code points can’t be represented in PHP strings as single bytes when encoded in UTF-8 (unless they’re below U+0100, of course). If I saw "\x{abcd}” I'd expect it to be the same as "\xab\xbc”. Plus, while Perl has \x{xxxx} syntax, Ruby and ECMAScript 6 have the \u{xxxx} syntax, so \u{xxxx} is already more popular. The ‘u’ in \u{xxxx} also makes it more obviously “Unicode”.
If ICU is to be adopted as the base for unicode support, then surely
everything else should follow those rules?
\uhhhh and \Uhhhhhhhh are defined along with \x{hhhhhh} so does it make
sense to add something which is not part of ICU?
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
If ICU is to be adopted as the base for unicode support, then surely
everything else should follow those rules?
\uhhhh and \Uhhhhhhhh are defined along with \x{hhhhhh} so does it make
sense to add something which is not part of ICU?
Er, where does ICU define \uXXXX and \UXXXXXX? I don't unferstand.
--
Andrea Faulds
http://ajf.me/
If ICU is to be adopted as the base for unicode support, then surely
everything else should follow those rules?
\uhhhh and \Uhhhhhhhh are defined along with \x{hhhhhh} so does it make
sense to add something which is not part of ICU?Er, where does ICU define \uXXXX and \UXXXXXX? I don't unferstand.
http://userguide.icu-project.org/strings/regexp
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
If ICU is to be adopted as the base for unicode support, then surely
everything else should follow those rules?
\uhhhh and \Uhhhhhhhh are defined along with \x{hhhhhh} so does it make
sense to add something which is not part of ICU?Er, where does ICU define \uXXXX and \UXXXXXX? I don't unferstand.
We aren't using ICU regular expressions, and ICU is merely an implementation detail anyway.
--
Andrea Faulds
http://ajf.me/
If ICU is to be adopted as the base for unicode support, then surely
everything else should follow those rules?
\uhhhh and \Uhhhhhhhh are defined along with \x{hhhhhh} so does it make
sense to add something which is not part of ICU?Er, where does ICU define \uXXXX and \UXXXXXX? I don't unferstand.
We aren't using ICU regular expressions, and ICU is merely an implementation detail anyway.
Has THAT been agreed on? Surely if using ICU fully in PHP7 in place of
the patchwork of current fixes for unicode then we don't want to be
breaking thing again by odd differences from the core code for unicode?
I though the agreement was that there was no resource to create an
alternative from scratch?
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
Lester Caine wrote on 09/12/2014 15:07:
If ICU is to be adopted as the base for unicode support, then surely
everything else should follow those rules?
\uhhhh and \Uhhhhhhhh are defined along with \x{hhhhhh} so does it make
sense to add something which is not part of ICU?
Er, where does ICU define \uXXXX and \UXXXXXX? I don't unferstand.
http://userguide.icu-project.org/strings/regexp
We aren't using ICU regular expressions, and ICU is merely an implementation detail anyway.
Has THAT been agreed on? Surely if using ICU fully in PHP7 in place of
the patchwork of current fixes for unicode then we don't want to be
breaking thing again by odd differences from the core code for unicode?
I though the agreement was that there was no resource to create an
alternative from scratch?
I think what Andrea's getting at is that the fact that ICU is in use
under the hood shouldn't be particularly visible to users. If PHP gets
"Unicode support" (whatever that turns out to mean), what the user
should see is PHP's Unicode facilities; only core devs and package
maintainers will need to know that those are implemented using ICU. As
such, there's no automatic need for PHP to do everything the same way as
ICU.
--
Rowan Collins
[IMSoP]
Lester Caine wrote on 09/12/2014 15:07:
If ICU is to be adopted as the base for unicode support, then surely
everything else should follow those rules?
\uhhhh and \Uhhhhhhhh are defined along with \x{hhhhhh} so does it
make
sense to add something which is not part of ICU?
Er, where does ICU define \uXXXX and \UXXXXXX? I don't unferstand.
http://userguide.icu-project.org/strings/regexp
We aren't using ICU regular expressions, and ICU is merely an
implementation detail anyway.
Has THAT been agreed on? Surely if using ICU fully in PHP7 in place of
the patchwork of current fixes for unicode then we don't want to be
breaking thing again by odd differences from the core code for unicode?
I though the agreement was that there was no resource to create an
alternative from scratch?I think what Andrea's getting at is that the fact that ICU is in use
under the hood shouldn't be particularly visible to users. If PHP gets
"Unicode support" (whatever that turns out to mean), what the user
should see is PHP's Unicode facilities; only core devs and package
maintainers will need to know that those are implemented using ICU. As
such, there's no automatic need for PHP to do everything the same way as
ICU.
That was the reason for asking ...
What is the point of all these piecemeal patches when the underlying
base has not yet been agreed on? That we are using ICU in things like
the database interfaces for unicode support would point to it being
somewhat useful if those processes produced the same code as the same
actions in PHP. ICU is well established and it's API already in use in
the same platform as PHP is running on ... so can we please treat all of
these 'patches' in the light of a proper debate on the bigger picture.
Forcing something like this through now simply does not make sense, and
while there may be no 'automatic need' for the database interface to
work the same as other parts, it would perhaps be worth a little
consideration?
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
Lester Caine wrote on 09/12/2014 16:00:
Lester Caine wrote on 09/12/2014 15:07:
If ICU is to be adopted as the base for unicode support, then surely
everything else should follow those rules?
\uhhhh and \Uhhhhhhhh are defined along with \x{hhhhhh} so does it
make
sense to add something which is not part of ICU?
Er, where does ICU define \uXXXX and \UXXXXXX? I don't unferstand.
http://userguide.icu-project.org/strings/regexp
We aren't using ICU regular expressions, and ICU is merely an
implementation detail anyway.
Has THAT been agreed on? Surely if using ICU fully in PHP7 in place of
the patchwork of current fixes for unicode then we don't want to be
breaking thing again by odd differences from the core code for unicode?
I though the agreement was that there was no resource to create an
alternative from scratch?
I think what Andrea's getting at is that the fact that ICU is in use
under the hood shouldn't be particularly visible to users. If PHP gets
"Unicode support" (whatever that turns out to mean), what the user
should see is PHP's Unicode facilities; only core devs and package
maintainers will need to know that those are implemented using ICU. As
such, there's no automatic need for PHP to do everything the same way as
ICU.
That was the reason for asking ...
What is the point of all these piecemeal patches when the underlying
base has not yet been agreed on? That we are using ICU in things like
the database interfaces for unicode support would point to it being
somewhat useful if those processes produced the same code as the same
actions in PHP. ICU is well established and it's API already in use in
the same platform as PHP is running on ... so can we please treat all of
these 'patches' in the light of a proper debate on the bigger picture.
Forcing something like this through now simply does not make sense, and
while there may be no 'automatic need' for the database interface to
work the same as other parts, it would perhaps be worth a little
consideration?
I see what you mean, but I think in this case, it would make very little
difference what other Unicode pieces are added, since the Unicode escape
syntax will only ever be interpreted by the compiler, and no other
functions will ever see what it looks like. The only exception would be
things like PCRE (not ICU) regexes, where - in a single-quoted string -
a visually similar syntax might exist, but there are already lots of
differences between what backslash-something means in a regex and what
it means in a double-quoted string literal.
--
Rowan Collins
[IMSoP]
I think \x{xxxx} is misleading anyway - \xXX is always
single-byte/character, yet Unicode code points can’t be represented in
PHP strings as single bytes when encoded in UTF-8 (unless they’re
below U+0100, of course).
You mean below U+0080 surely? Only the "first 7 bits" can be represented
as a single byte with UTF-8. U+0080 is for example 0xC2 0x80 in UTF-8.
cheers,
Derick
Hi Derick,
I think \x{xxxx} is misleading anyway - \xXX is always
single-byte/character, yet Unicode code points can’t be represented in
PHP strings as single bytes when encoded in UTF-8 (unless they’re
below U+0100, of course).You mean below U+0080 surely? Only the "first 7 bits" can be represented
as a single byte with UTF-8. U+0080 is for example 0xC2 0x80 in UTF-8.
Ah, yes, my bad. I was probably getting confused with how all Unicode codepoints up to U+0100 match Latin-1.
Thanks!
Andrea Faulds
http://ajf.me/
Please read through the RFC and cast your vote if you wish to do so:
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
I was just updating my HHVM patch to match your PHP implementation and
an issue came up. The following code, which is valid in PHP5:
<?php
echo json_decode(""ma\u00F1ana"");
Will throw a fatal compiler error as "\u00F1" is an invalid unicode
escape sequence. Since this represents an unnecessary BC break, I'd
like to propose the error handling be modified to match \x, which is
to say: Pass the value through unmodified.
-Sara
Please read through the RFC and cast your vote if you wish to do so:
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
I was just updating my HHVM patch to match your PHP implementation and
an issue came up. The following code, which is valid in PHP5:<?php
echo json_decode(""ma\u00F1ana"");Will throw a fatal compiler error as "\u00F1" is an invalid unicode
escape sequence. Since this represents an unnecessary BC break, I'd
like to propose the error handling be modified to match \x, which is
to say: Pass the value through unmodified.
I was wondering about that case. Previously, the patch just raised a warning but let it through unmodified. But then old code would be littered with warnings, and I felt it was better just to throw an error. Part of the problem is that I’d rather mistakes in string literals be caught at compile time, to prevent someone accidentally echoing ‘\u00F1’ somewhere.
A possible compromise might be to let ‘\u’ through but not ‘\u{‘.
--
Andrea Faulds
http://ajf.me/
A possible compromise might be to let ‘\u’ through but not ‘\u{‘.
+1
I can see that some people might have \u (for what reason I do not know), but it
would be more unlikely for \u{ to be found in 'legacy' code.
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h
I was just updating my HHVM patch to match your PHP implementation and
an issue came up. The following code, which is valid in PHP5:<?php
echo json_decode(""ma\u00F1ana"");Will throw a fatal compiler error as "\u00F1" is an invalid unicode
escape sequence. Since this represents an unnecessary BC break, I'd
like to propose the error handling be modified to match \x, which is
to say: Pass the value through unmodified.I was wondering about that case. Previously, the patch just raised a warning
but let it through unmodified. But then old code would be littered with warnings,
and I felt it was better just to throw an error. Part of the problem is that I’d rather
mistakes in string literals be caught at compile time, to prevent someone
accidentally echoing ‘\u00F1’ somewhere.
Well, consistency would be not raising a warning at all (like with \x
or unknown \ sequences), but I get your concern with helping coders
catch mistakes early.
I don't have the stats on how many projects are using \uXXXX sequences
(with the json_decode()
or similar hack to parse them), but I know
FB's codebase is absolutely covered in them.
A possible compromise might be to let ‘\u’ through but not ‘\u{‘.
Still don't like it from the inconsistency with existing escape
sequence handlers pov, but it'd cover the biggest set of BC issues, so
I'd be happy with it.
-Sara
A possible compromise might be to let ‘\u’ through but not ‘\u{‘.
Still don't like it from the inconsistency with existing escape
sequence handlers pov, but it'd cover the biggest set of BC issues, so
I'd be happy with it.
I’ve updated the patches for php-src and the specification to implement this, along with their tests, and I’ve also updated the RFC.
Now this won’t error:
"\"\u202e\""
But this still will:
"\u{foobar"
I think this is an acceptable compromise.
Andrea Faulds
http://ajf.me/
I’ve updated the patches for php-src and the specification to implement this, along with their tests, and I’ve also updated the RFC.
Now this won’t error:
""\u202e""But this still will:
"\u{foobar"I think this is an acceptable compromise.
Groovy, thanks.
-Sara
Good evening,
I’m opening voting on the Unicode Codepoint Escape Syntax RFC. There’s been some discussion in the last two weeks since I introduced the RFC, but there’s nothing left which I feel needs changing. For the character name syntax suggestion (i.e. something like \N{arabic letter alef}), if that’s desired, it could be done later in a different RFC.
Please read through the RFC and cast your vote if you wish to do so:
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
The RFC is really a good writeup, very much appreciated.
I've voted no because I'm not entirely convinced the current approach
suits PHP long-term. I absolutely appreciate the hard work going into
it, no doubt, and I rather have a solution yesterday than tomorrow.
But I think a proper Unicode implementation in PHP must not rely on
puzzle pieces put together without a clear goal where's we're overall
heading.
I know that the latter has been tried and failed (multiple times) but
that's how I feel about the situation.
- Markus
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
The RFC is really a good writeup, very much appreciated.I've voted no because I'm not entirely convinced the current approach
suits PHP long-term. I absolutely appreciate the hard work going into
it, no doubt, and I rather have a solution yesterday than tomorrow.But I think a proper Unicode implementation in PHP must not rely on
puzzle pieces put together without a clear goal where's we're overall
heading.I know that the latter has been tried and failed (multiple times) but
that's how I feel about the situation.
It's nice to hear someone understands the current problem ... but I'm
sure there must be a few more who can vote who do as well!
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
Please read through the RFC and cast your vote if you wish to do so:
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
Hi,
A more complete and "long term" approach might come from a better/deeper
integration of Unicode into PHP itself -- but that won't happen for PHP
7 (it failed with PHP 6).
That being said, the feature described here is quite small and concise,
simple to use, useful in some situations, and shouldn't break BC too much.
So, after discussing with other members of AFUP, we are on the +1 side.
--
Pascal MARTIN, AFUP - French UG
http://php-internals.afup.org/
I’m opening voting on the Unicode Codepoint Escape Syntax RFC. There’s been some discussion in the last two weeks since I introduced the RFC, but there’s nothing left which I feel needs changing. For the character name syntax suggestion (i.e. something like \N{arabic letter alef}), if that’s desired, it could be done later in a different RFC.
Please read through the RFC and cast your vote if you wish to do so:
https://wiki.php.net/rfc/unicode_escape
Voting starts today (2014-12-08) and ends in 10 days’ time (2014-12-18).
Good evening once again,
By 23 votes to 2 against, the Unicode Codepoint Escape Syntax RFC has been accepted! I’ll merge the patch into master shortly, and it’s already in HHVM thanks to Ms Golemon’s wonderful efforts.
Thanks for voting!
Andrea Faulds
http://ajf.me/