[RFC] Unicode Escape Syntax

10 years ago by Andrea Faulds — view source

unread

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

My apologies to you all, a small correction: The title of that email should’ve been “[RFC] Unicode Codepoint Escape Syntax” to match the title of the RFC, I missed out the “Codepoint".

Andrea Faulds
http://ajf.me/

10 years ago by Sara Golemon — view source

unread

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I'm okay with producing UTF-8 even though our strings are technically
binary. As you state, UTF-8 is the de-facto encoding, and recognizing
this is pretty reasonable.

You may want to make it a requirement that strings containing \u
escapes are denoted as: u"blah blah" We set aside this format
back in the PHP6 days (note that b"blah" is equivalent to "blah" for
binary strings).

On the BMP versus SMP issue of \uXXXX styles, we addressed this in
PHP6 by making \u denote 4 hexit BMP codepoints, while \U denoted six
hexit codepoints. e.g. "\u1234" === "\U001234" I'd rather
follow this style than making \u special and different from hex and
octal notations by using braces.

-Sara

10 years ago by Andrea Faulds — view source

unread

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I'm okay with producing UTF-8 even though our strings are technically
binary. As you state, UTF-8 is the de-facto encoding, and recognizing
this is pretty reasonable.

On that note, it strikes me now that we assume an encoding anyway for all escape sequences. If I’m using EBCDIC or UTF-16, “\n” isn’t going to help me much!

You may want to make it a requirement that strings containing \u
escapes are denoted as: u"blah blah" We set aside this format
back in the PHP6 days (note that b"blah" is equivalent to "blah" for
binary strings).

I’d rather keep u"blah blah” for if/when we add actual Unicode strings.

On the BMP versus SMP issue of \uXXXX styles, we addressed this in
PHP6 by making \u denote 4 hexit BMP codepoints, while \U denoted six
hexit codepoints. e.g. "\u1234" === "\U001234" I'd rather
follow this style than making \u special and different from hex and
octal notations by using braces.

That is something I’d thought about. \U takes 8 hex digits in every other language which has it, though.

I suppose we could do this, it resolves the BMP issue, certainly. Still, I think the brace syntax has its advantages because it’s completely unambiguous and it means we only have one syntax for this, not two different ones (less mental overhead). Plus, it’s worth noting that \u would still be different from \ooo and \xXX anyway, as it’d be fixed-length while octal and hex aren’t.

--
Andrea Faulds
http://ajf.me/

10 years ago by Adam Harvey — view source

unread

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I'm okay with producing UTF-8 even though our strings are technically
binary. As you state, UTF-8 is the de-facto encoding, and recognizing
this is pretty reasonable.

I'm also OK with this, although I do wonder if we should be respecting
the user's default_charset setting instead. (Since default_charset
defaults to "UTF-8", in practice this isn't a significant difference
for the average user.)

You may want to make it a requirement that strings containing \u
escapes are denoted as: u"blah blah" We set aside this format
back in the PHP6 days (note that b"blah" is equivalent to "blah" for
binary strings).

It seems to me that the point of \u and \U escapes is to embed Unicode
in potentially non-Unicode strings, so using u"" doesn't feel right.

On the BMP versus SMP issue of \uXXXX styles, we addressed this in
PHP6 by making \u denote 4 hexit BMP codepoints, while \U denoted six
hexit codepoints. e.g. "\u1234" === "\U001234" I'd rather
follow this style than making \u special and different from hex and
octal notations by using braces.

I think I prefer the brace style, personally. Non-BMP codepoints have
become more important since PHP 6 (thanks, emoji), and having \u and
\U be case sensitive when \x isn't seems confusing.

Adam

10 years ago by Andrea Faulds — view source

unread

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I'm okay with producing UTF-8 even though our strings are technically
binary. As you state, UTF-8 is the de-facto encoding, and recognizing
this is pretty reasonable.

I'm also OK with this, although I do wonder if we should be respecting
the user's default_charset setting instead. (Since default_charset
defaults to "UTF-8", in practice this isn't a significant difference
for the average user.)

Ooh, that would be a possibility. That or using whatever encoding the source file is specified to be with declare(), so it matches the encoding of other characters in the string.

This’d add significant complexity to it, though (would we have to require ICU or something? D:), plus the vast majority of Unicode characters will only be supported by Unicode encodings… and of those, only UTF-8 is really in much use here anyway.

You may want to make it a requirement that strings containing \u
escapes are denoted as: u"blah blah" We set aside this format
back in the PHP6 days (note that b"blah" is equivalent to "blah" for
binary strings).

It seems to me that the point of \u and \U escapes is to embed Unicode
in potentially non-Unicode strings, so using u"" doesn't feel right.

I don’t really see where you’re coming from, it also makes just as much sense within Unicode strings. There are plenty of cases (like the U+202E or mañana examples in the RFC) where you’d want a Unicode escape in a Unicode string.

--
Andrea Faulds
http://ajf.me/

10 years ago by Adam Harvey — view source

unread

I'm also OK with this, although I do wonder if we should be respecting
the user's default_charset setting instead. (Since default_charset
defaults to "UTF-8", in practice this isn't a significant difference
for the average user.)

Ooh, that would be a possibility. That or using whatever encoding the source file is specified to be with declare(), so it matches the encoding of other characters in the string.

This’d add significant complexity to it, though (would we have to require ICU or something? D:), plus the vast majority of Unicode characters will only be supported by Unicode encodings… and of those, only UTF-8 is really in much use here anyway.

We would have to require ICU, but that might be worthwhile for PHP 7
anyway. Having at least one i18n API that's guaranteed to be available
would be nice.

You may want to make it a requirement that strings containing \u
escapes are denoted as: u"blah blah" We set aside this format
back in the PHP6 days (note that b"blah" is equivalent to "blah" for
binary strings).

It seems to me that the point of \u and \U escapes is to embed Unicode
in potentially non-Unicode strings, so using u"" doesn't feel right.

I don’t really see where you’re coming from, it also makes just as much sense within Unicode strings. There are plenty of cases (like the U+202E or mañana examples in the RFC) where you’d want a Unicode escape in a Unicode string.

I probably worded that badly — I just mean that I don't think \u and
\U should be limited to only u"" strings, but should work in normal
strings as well. (In other words, I'm agreeing with what's in your
RFC, not with Sara.)

Adam

10 years ago by Sara Golemon — view source

unread

We would have to require ICU, but that might be worthwhile for PHP 7
anyway. Having at least one i18n API that's guaranteed to be available
would be nice.

It's 2014. I think requiring ICU is reasonable at this point.

Orthogonal to this RFC, but I'd be in favor of deprecating all the
non-ICU intl stuff sometime soon.

I probably worded that badly — I just mean that I don't think \u and
\U should be limited to only u"" strings, but should work in normal
strings as well. (In other words, I'm agreeing with what's in your
RFC, not with Sara.)

I don't feel strongly about the u"" requirement, it doesn't make the
world a darker place if we're more permissive.

Plus, it’s worth noting that \u would still be different from \ooo and \xXX anyway,
as it’d be fixed-length while octal and hex aren’t.

And I really wish that weren't true. :p

-Sara

10 years ago by Andrea Faulds — view source

unread

We would have to require ICU, but that might be worthwhile for PHP 7
anyway. Having at least one i18n API that's guaranteed to be available
would be nice.

It's 2014. I think requiring ICU is reasonable at this point.

I also think it would be reasonable to require ICU, especially as it means we could perhaps enable Joe Watkins’s UString by default, assuming it actually makes it into PHP 7.

That said, I don’t think we should go down the route of making \u convert to the current encoding. It doesn’t make much sense, if any, for non-Unicode encodings, and nobody is using UTF-16 or UTF-32. Plus, it’d be inconsistent, given we don’t convert any of the other escape sequences in strings anyway! It would be quite weird if \u{77} converted yet \x77 did not.

--
Andrea Faulds
http://ajf.me/

10 years ago by Alain Williams — view source

unread

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I'm okay with producing UTF-8 even though our strings are technically
binary. As you state, UTF-8 is the de-facto encoding, and recognizing
this is pretty reasonable.

You may want to make it a requirement that strings containing \u
escapes are denoted as: u"blah blah" We set aside this format
back in the PHP6 days (note that b"blah" is equivalent to "blah" for
binary strings).

On the BMP versus SMP issue of \uXXXX styles, we addressed this in
PHP6 by making \u denote 4 hexit BMP codepoints, while \U denoted six
hexit codepoints. e.g. "\u1234" === "\U001234" I'd rather
follow this style than making \u special and different from hex and
octal notations by using braces.

There is a big difference with \u or \U and \x or \o and that is the number of
characters that follow the escape. \x has 2, \o has 3 - both are short and easy
to count with the eye. \U012345 is quite long and it is not so visually obvious
where it should end.

Ergo: I prefer Andrea's "\u{0123}" as it is going to be more robust against typos.

One other thing that we could do is to allow code points to be named, with \U
(capital 'U') eg:

echo "\U{arabic letter alef}\n";

If you think that it is a bad idea, please update the RFC to say why this is a
bad idea and so why it is not going to happen - for now.

It would be nice since a code point is just a big number without any really obvious
meaning, but a name makes for greater clarity.

However: I suspect that interpretting this might be considerably slower which
means slower compilation.

Regards

--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h

10 years ago by Andrea Faulds — view source

unread

There is a big difference with \u or \U and \x or \o and that is the number of
characters that follow the escape. \x has 2, \o has 3 - both are short and easy
to count with the eye. \U012345 is quite long and it is not so visually obvious
where it should end.

Ergo: I prefer Andrea's "\u{0123}" as it is going to be more robust against typos.

Typos are an angle I hadn’t quite considered, but yes, this syntax is better against that. Importantly, it’s a compile error if you produce a broken literal, while if you screwed up the brace-free style you’d probably just get a mangled string.

One other thing that we could do is to allow code points to be named, with \U
(capital 'U') eg:

echo "\U{arabic letter alef}\n”;

Ooh, that’s an interesting idea. I believe Perl actually has this already, although it uses the \N syntax:

http://perldoc.perl.org/perlreref.html#ESCAPE-SEQUENCES

Is something like that what you have in mind?

If you think that it is a bad idea, please update the RFC to say why this is a
bad idea and so why it is not going to happen - for now.

It would be nice since a code point is just a big number without any really obvious
meaning, but a name makes for greater clarity.

However: I suspect that interpretting this might be considerably slower which
means slower compilation.

I’ll add it to the Future Scope part.

One issue with this, however, is that we’d have to include a Unicode info database from somewhere with the names of the characters. That’d probably mean requiring ICU or something like it, which the current patch doesn’t do.

Andrea Faulds
http://ajf.me/

10 years ago by Alain Williams — view source

unread

echo "\U{arabic letter alef}\n”;

Ooh, that’s an interesting idea. I believe Perl actually has this already, although it uses the \N syntax:

http://perldoc.perl.org/perlreref.html#ESCAPE-SEQUENCES

Is something like that what you have in mind?

Exactly.

Confession: it was looking at the perl documentation that led me to suggest it.

--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h

10 years ago by Derick Rethans — view source

unread

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I'm okay with producing UTF-8 even though our strings are technically
binary. As you state, UTF-8 is the de-facto encoding, and recognizing
this is pretty reasonable.

You may want to make it a requirement that strings containing \u
escapes are denoted as: u"blah blah" We set aside this format
back in the PHP6 days (note that b"blah" is equivalent to "blah" for
binary strings).

On the BMP versus SMP issue of \uXXXX styles, we addressed this in
PHP6 by making \u denote 4 hexit BMP codepoints, while \U denoted six
hexit codepoints. e.g. "\u1234" === "\U001234" I'd rather
follow this style than making \u special and different from hex and
octal notations by using braces.

I agree with this fully. No need to reinvent a wheel (that we left
behind on the road)...

cheers,
Derick

10 years ago by Andrea Faulds — view source

unread

On the BMP versus SMP issue of \uXXXX styles, we addressed this in
PHP6 by making \u denote 4 hexit BMP codepoints, while \U denoted six
hexit codepoints. e.g. "\u1234" === "\U001234" I'd rather
follow this style than making \u special and different from hex and
octal notations by using braces.

I agree with this fully. No need to reinvent a wheel (that we left
behind on the road)…

The \u{} style has its advantages, like not being case-sensitive and being clearly delimited, though. Plus, PHP 6 is dead, and non-BMP code points are more important than ever.

--
Andrea Faulds
http://ajf.me/

10 years ago by Yasuo Ohgaki — view source

unread

Hi all,

non-BMP code points are more important than ever.

Yes, it is! We(Japanese) have number of them already.

\u{code point} has huge advantage. We do not have care if code point value
is BMP or not.
i.e. We can do
echo "\u{code point}";
regardless of code point value.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Sara Golemon — view source

unread

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I've linked a provisional HHVM implementation from that page.
Planning to match whatever PHP7 does, of course, but for the moment
I've added named entity support since it's being discussed.

https://github.com/sgolemon/hhvm/compare/unicode-escape

10 years ago by Ivan Enderlin @ Hoa — view source

unread

Le 24/11/2014 23:09, Andrea Faulds a écrit :

Good evening,

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

It has a rationale section explaining why certain decisions were made, that I’d recommend you read in full.
Excellent RFC, thank you for this proposal.
I would suggest this talk
https://speakerdeck.com/mathiasbynens/hacking-with-unicode (you might
already know) but interesting concepts and limitations of current
Unicode implementations are mentioned.
The usage of \u{…} fixes most limitations and I could not be more
agree with that notation!

Cheers.

--
Ivan Enderlin
Developer of Hoa
http://hoa-project.net/

PhD. at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/

Member of HTML and WebApps Working Group of W3C
http://w3.org/

10 years ago by Christoph Becker — view source

unread

Ivan Enderlin @ Hoa wrote:

Le 24/11/2014 23:09, Andrea Faulds a écrit :

Good evening,

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

It has a rationale section explaining why certain decisions were made,
that I’d recommend you read in full.
Excellent RFC, thank you for this proposal.
I would suggest this talk
https://speakerdeck.com/mathiasbynens/hacking-with-unicode (you might
already know) but interesting concepts and limitations of current
Unicode implementations are mentioned.
The usage of \u{…} fixes most limitations and I could not be more
agree with that notation!

I don't see that the proposed \u{...} notation fixes any limitation.
Its only advantage would be slightly better readability opposed to
inserting the desired UTF-8 byte sequences as octal or hexadecimal
escapes. For instance, all of the following would print Ä (in an UTF-8
context):

echo "\u{00C4}";
echo "\xC3\x84";
echo "\303\204";

--
Christoph M. Becker

10 years ago by Dmitry Stogov — view source

unread

May be I misunderstood something, but why to introduce unicode escapes if
PHP engine doesn't support Unicode.

Always converting such escapes into UTF-8 encoding, doesn't make any sense
for people who use other encodings for output, databases, etc.

Thanks. Dmitry.

Good evening,

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

It has a rationale section explaining why certain decisions were made,
that I’d recommend you read in full.

Thanks!

Andrea Faulds
http://ajf.me/

10 years ago by Andrea Faulds — view source

unread

May be I misunderstood something, but why to introduce unicode escapes if PHP engine doesn't support Unicode.

We don't have Unicode strings which are made of codepoints rather than bytes, sure. But we do usually treat these strings as UTF-8. The idea of doing this in a language without Unicode strings isn't new, C/C++ have the u8"" syntax for making UTF-8 strings.

Always converting such escapes into UTF-8 encoding, doesn't make any sense for people who use other encodings for output, databases, etc.

If you're using other encodings, why do you want to use a Unicode codepoints? Most Unicode codepoints will not supported by another character set.

--
Andrea Faulds
http://ajf.me/

10 years ago by Dmitry Stogov — view source

unread

May be I misunderstood something, but why to introduce unicode escapes
if PHP engine doesn't support Unicode.

We don't have Unicode strings which are made of codepoints rather than
bytes, sure. But we do usually treat these strings as UTF-8. The idea of
doing this in a language without Unicode strings isn't new, C/C++ have the
u8"" syntax for making UTF-8 strings.

u8"string" tells that the whole string is UTF-8 encoded.
Your escape Unicode proposal assumes just UTF-8 codepoint, but the whole
string encoding is still undefined.

Always converting such escapes into UTF-8 encoding, doesn't make any
sense for people who use other encodings for output, databases, etc.

If you're using other encodings, why do you want to use a Unicode
codepoints? Most Unicode codepoints will not supported by another character
set.

Agree, this Unicode escapes are not going to be used for anything except
UTF-8 encoded strings.
I'm not completely against it. It's just an incomplete solution.

echo "\u{1F602}"; // won't output ? if the output encoding is not UTF-8

echo "Привет \u{1F602}"; // won't output anything useful if script
encoding is not UTF-8

The second problem present even for European counties that use Windows-1250
codepage.

echo "mañana \u{1F602}"; // won't output anything useful if script
encoding is not UTF-8

Thanks. Dmitry.

--
Andrea Faulds
http://ajf.me/

10 years ago by Andrea Faulds — view source

unread

u8"string" tells that the whole string is UTF-8 encoded.
Your escape Unicode proposal assumes just UTF-8 codepoint, but the whole string encoding is still undefined.

True. There’s an assumption there that you’re using a UTF-8-compatible source file. Actually, for other encodings, do we even guarantee that “\n” produces an ASCII LF just now? It certainly will on most Windows and Unix systems, but since we’re just using C’s ‘\n’ (http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_language_scanner.l#885), it might produce the newline character of some other encoding like EBCDIC in the right environment.

If you're using other encodings, why do you want to use a Unicode codepoints? Most Unicode codepoints will not supported by another character set.

Agree, this Unicode escapes are not going to be used for anything except UTF-8 encoded strings.
I'm not completely against it. It's just an incomplete solution.

echo "\u{1F602}"; // won't output ? if the output encoding is not UTF-8

echo "Привет \u{1F602}"; // won't output anything useful if script encoding is not UTF-8
The second problem present even for European counties that use Windows-1250 codepage.
echo "mañana \u{1F602}"; // won't output anything useful if script encoding is not UTF-8
Thanks. Dmitry.

Yeah, that’s unfortunate. Although I don’t think there’s much we can do about it here. We can’t really convert, as if most Unicode characters won’t be available in the codepage you’re using.

Even if we did have Unicode strings like the fabled PHP6 would have had, you still have this problem when you’re outputting in non-Unicode encodings.

Although it’s worth noting that mbstring should handle this, since if you have an internal encoding of UTF-8 and an output encoding of, say, Windows-1250, you can use UTF-8 in your strings it should convert that for you on output. How well this works in practice, however, I have no idea.

Andrea Faulds
http://ajf.me/

10 years ago by Dmitry Stogov — view source

unread

u8"string" tells that the whole string is UTF-8 encoded.
Your escape Unicode proposal assumes just UTF-8 codepoint, but the
whole string encoding is still undefined.

True. There’s an assumption there that you’re using a UTF-8-compatible
source file. Actually, for other encodings, do we even guarantee that “\n”
produces an ASCII LF just now? It certainly will on most Windows and Unix
systems, but since we’re just using C’s ‘\n’ (
http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_language_scanner.l#885), it
might produce the newline character of some other encoding like EBCDIC in
the right environment.

If you're using other encodings, why do you want to use a Unicode
codepoints? Most Unicode codepoints will not supported by another character
set.

Agree, this Unicode escapes are not going to be used for anything except
UTF-8 encoded strings.
I'm not completely against it. It's just an incomplete solution.

echo "\u{1F602}"; // won't output ? if the output encoding is not UTF-8

echo "Привет \u{1F602}"; // won't output anything useful if script
encoding is not UTF-8
The second problem present even for European counties that use
Windows-1250 codepage.
echo "mañana \u{1F602}"; // won't output anything useful if script
encoding is not UTF-8
Thanks. Dmitry.
ot sy
Yeah, that’s unfortunate. Although I don’t think there’s much we can do
about it here. We can’t really convert, as if most Unicode characters won’t
be available in the codepage you’re using.

If character is not available in codepage it's replaced with "?" or
something, but in you case we will get unexpected UTF-8 sequence.

Even if we did have Unicode strings like the fabled PHP6 would have had,
you still have this problem when you’re outputting in non-Unicode encodings.

Right, but just for output we already have HTML entities

echo "?" // HTML entities already work independently from encodings.

I know, it's not completely the same as "\u{1F602}", but "\u{...} assumes
UTF-8 is used everywhere and it's not true.

PHP6 was able to use Unicode escapes with any script encodings, because it
converted all the strings into some internal encoding anyway.
If we convert all strings from string encoding into the same internal
encoding (e.g. UTF-8 or user defined) than "\u{...}" will really work.

Thanks. Dmitry.

Although it’s worth noting that mbstring should handle this, since if
you have an internal encoding of UTF-8 and an output encoding of, say,
Windows-1250, you can use UTF-8 in your strings it should convert that for
you on output. How well this works in practice, however, I have no idea.

Andrea Faulds
http://ajf.me/

10 years ago by Alain Williams — view source

unread

I'm not completely against it. It's just an incomplete solution.

echo "\u{1F602}"; // won't output ? if the output encoding is not UTF-8

echo "Привет \u{1F602}"; // won't output anything useful if script
encoding is not UTF-8

The second problem present even for European counties that use Windows-1250
codepage.

I think that we need to clarify what we are talking about.

What Andrea has proposed is a way of writing string constants. These characters
in these strings will still be 8 bits big, this means that there needs to be
some way of encoding characters with code points that will not fit in 8 bits.
The only way of avoiding that would be to use, internally, 32 bit characters --
which would be a huge change.

So: we need to have some form of encoding.

As I started ''a way of writing string constants'' - ie a compile time action.

With the code below it is likely that at run-time mb_internal_encoding() has
been called before the echo is executed or the 'Content-Type:' header specifies
some encoding.

echo "mañana \u{1F602}"; // won't output anything useful if script
encoding is not UTF-8

This is not something that the compiler can guess.

It is even worse if my proposal of \U{arabic letter alef} types is added, how is
that encoded ? UTF-8 or iso-8859-6 or .... ?

So, how do we fix the problem ?

mb_internal_encoding($new_encoding) finds every string (variable and constant)
and converts from the previous encoding to the $new_encoding.

Possible, but horribly slow and would prob break things (eg strings that
contain binary data).

Not a good idea.
Decide that UTF-8 is king.
That is what I have decided - but I do not have any legacy code to worry about
-- being a Brit I don't have to worry much.
Rely on the programmer to understand encoding and know what the eventual
output encoding will be and if it is not UTF-8 write characters using \Xxx or
use mb_convert_encoding($string, $output_encoding, 'utf-8').

If we decide to support non-utf-8 encoding at compile time then we could extend
the syntax a bit to allow the encoding to be specified, eg:

\U{utf-8: arabic letter alef}

\U{iso-8859-6: arabic letter alef}

Ie, allow this to be optionally specified and terminated by ':'. If not
specified then assume utf-8.

--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h

10 years ago by Andrea Faulds — view source

unread

I think that we need to clarify what we are talking about.

What Andrea has proposed is a way of writing string constants. These characters
in these strings will still be 8 bits big, this means that there needs to be
some way of encoding characters with code points that will not fit in 8 bits.
The only way of avoiding that would be to use, internally, 32 bit characters --
which would be a huge change.

So: we need to have some form of encoding.

As I started ''a way of writing string constants'' - ie a compile time action.

With the code below it is likely that at run-time mb_internal_encoding() has
been called before the echo is executed or the 'Content-Type:' header specifies
some encoding.

echo "mañana \u{1F602}"; // won't output anything useful if script
encoding is not UTF-8

This is not something that the compiler can guess.

Well, we do already have a compile-time system for declaring encoding, the declare() construct.

It is even worse if my proposal of \U{arabic letter alef} types is added, how is
that encoded ? UTF-8 or iso-8859-6 or .... ?

So, how do we fix the problem ?

mb_internal_encoding($new_encoding) finds every string (variable and constant)
and converts from the previous encoding to the $new_encoding.

Possible, but horribly slow and would prob break things (eg strings that
contain binary data).

Not a good idea.

I also agree this isn’t a good idea.

Decide that UTF-8 is king.
That is what I have decided - but I do not have any legacy code to worry about
-- being a Brit I don't have to worry much.

Rely on the programmer to understand encoding and know what the eventual
output encoding will be and if it is not UTF-8 write characters using \Xxx or
use mb_convert_encoding($string, $output_encoding, 'utf-8').

If we decide to support non-utf-8 encoding at compile time then we could extend
the syntax a bit to allow the encoding to be specified, eg:

\U{utf-8: arabic letter alef}

\U{iso-8859-6: arabic letter alef}

Ie, allow this to be optionally specified and terminated by ':'. If not
specified then assume utf-8.

There are only two sane options:

Always UTF-8
Whatever source file encoding we’ve specified with declare()

Of those, I’d prefer UTF-8, as nobody’s using UTF-16 or UTF-32.

--
Andrea Faulds
http://ajf.me/

10 years ago by Alain Williams — view source

unread

Well, we do already have a compile-time system for declaring encoding, the declare() construct.

I missed that. Reading the documentation I confess that I do not really
understand what the effect of declare(encoding=xxx) does.

http://uk1.php.net/manual/en/control-structures.declare.php#control-structures.declare.encoding

I can see how it would change a \U{arabic letter alef} -- but not what it does today.

--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h

10 years ago by Sara Golemon — view source

unread

If we decide to support non-utf-8 encoding at compile time then we could extend
the syntax a bit to allow the encoding to be specified, eg:
\U{utf-8: arabic letter alef}

\U{iso-8859-6: arabic letter alef}

God, that's such a PHP thing to do. And not in a good way...

I get the argument against this RFC on the grounds that it's a
half-measure, but on the other hand, we all know what happened to our
full-measure solution. And that bit of history is why I actually like
this as a small, simple, and practical step forward. That's a very
PHP thing to do, in a good way.

-Sara

10 years ago by Derick Rethans — view source

unread

May be I misunderstood something, but why to introduce unicode escapes
if PHP engine doesn't support Unicode.

We don't have Unicode strings which are made of codepoints rather than
bytes, sure. But we do usually treat these strings as UTF-8. The idea of
doing this in a language without Unicode strings isn't new, C/C++ have the
u8"" syntax for making UTF-8 strings.

u8"string" tells that the whole string is UTF-8 encoded.
Your escape Unicode proposal assumes just UTF-8 codepoint, but the whole
string encoding is still undefined.

Always converting such escapes into UTF-8 encoding, doesn't make any
sense for people who use other encodings for output, databases, etc.

If you're using other encodings, why do you want to use a Unicode
codepoints? Most Unicode codepoints will not supported by another character
set.

Agree, this Unicode escapes are not going to be used for anything
except UTF-8 encoded strings. I'm not completely against it. It's just
an incomplete solution.

I think "incomplete" nails it on the head. Without "proper" Unicode
support in the parser, compiler and string function semantics, having
these escape codes doesn't really do a lot for us. I now think it would
fit better just as part of the "UString" idea - although I still need to
write my reservations and recommendations down about that too.

cheers,
Derick

10 years ago by Andrea Faulds — view source

unread

I think "incomplete" nails it on the head. Without "proper" Unicode
support in the parser, compiler and string function semantics, having
these escape codes doesn't really do a lot for us.

How so? Why are they less useful because we don't have "true" Unicode strings? This would work perfectly well and provide an actual benefit for people manipulating Unicode text today. It'd work great with Unicode strings in future if they were ever to happen, too, as we could support u"{202e}".

--
Andrea Faulds
http://ajf.me/

10 years ago by Stanislav Malyshev — view source

unread

Hi!

I'm not completely against it. It's just an incomplete solution.

echo "\u{1F602}"; // won't output ? if the output encoding is not UTF-8

You can always use iconv/recode to bring it to every encoding you need
(provided it supports full unicode range). I see this as a readability
feature - you can look up 1F602 but it's much harder to understand
what's going on if you have \xF0\x9F\x98\x82 instead. Of course, using
this in non-UTF-8 strings is useless, but my question would be - why
would you code have non-utf8 text literals? I mean, even if you output
in other format - why not use de-facto standard internally? Of course,
there might be legacy reasons - but then one won't use \u.

As an alternative, we may have \u{} which produces utf-8 and another one
which produces current script encoding (and errors out if this code
point is not part of it).

Stas Malyshev
smalyshev@gmail.com

10 years ago by Markus Fischer — view source

unread

Good evening,

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I think the choice of \u{xx} is interesting, i.e. using '{' and '}'.

Afaik, one of the current best practices is to use json_decode(), like so:

$ cat test.php
<?php var_dump( json_decode( '"\u00c4"' ) );

$ php test.php
string(2) "Ä"

So your choice definitely will not influence this practice which is
good, IMHO. OTOH I'm not sure if not using '{' and '}' would actually
cause BC problems, but it's of no concern with your RFC.

thanks,

Markus

10 years ago by Andrea Faulds — view source

unread

Good evening,

Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape

I think the choice of \u{xx} is interesting, i.e. using '{' and '}'.

Afaik, one of the current best practices is to use json_decode(), like so:

$ cat test.php
<?php var_dump( json_decode( '"\u00c4"' ) );

$ php test.php
string(2) "Ä"

json_decode is an interesting workaround, but JSON suffers from JavaScript, where it got its syntax, having poor Unicode handling. JSON's Unicode escapes only allow BMP characters and, worse, non-BMP characters must be encoded using surrogate pairs.

--
Andrea Faulds
http://ajf.me/

[RFC] Unicode Escape Syntax

Thanks!

My apologies to you all, a small correction: The title of that email should’ve been “[RFC] Unicode Codepoint Escape Syntax” to match the title of the RFC, I missed out the “Codepoint".

One issue with this, however, is that we’d have to include a Unicode info database from somewhere with the names of the characters. That’d probably mean requiring ICU or something like it, which the current patch doesn’t do.

Thanks!

Although it’s worth noting that mbstring should handle this, since if you have an internal encoding of UTF-8 and an output encoding of, say, Windows-1250, you can use UTF-8 in your strings it should convert that for you on output. How well this works in practice, however, I have no idea.

Although it’s worth noting that mbstring should handle this, since if you have an internal encoding of UTF-8 and an output encoding of, say, Windows-1250, you can use UTF-8 in your strings it should convert that for you on output. How well this works in practice, however, I have no idea.

As an alternative, we may have \u{} which produces utf-8 and another one which produces current script encoding (and errors out if this code point is not part of it).

Although it’s worth noting that mbstring should handle this, since if
you have an internal encoding of UTF-8 and an output encoding of, say,
Windows-1250, you can use UTF-8 in your strings it should convert that for
you on output. How well this works in practice, however, I have no idea.

As an alternative, we may have \u{} which produces utf-8 and another one
which produces current script encoding (and errors out if this code
point is not part of it).