-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
The Unicode support design document in README.UNICODE discusses three types of
strings, IS_UNICODE, IS_STRING, and IS_BINARY, and specifies two new casts,
(unicode) and (binary). The spec allows Unicode and string types to be
implicitly concatenated and explicitly cast to one another, while the binary
type is a black hole that requires a conversion function call to get out of.
According to the notes from November I see this has been reduced to just Unicode
and binary types:
http://www.php.net/~derick/meeting-notes.html#different-string-types
I've been prodding some strings from user code to see how they react, and I'm
wondering if they're working as intended or if it's just some side effects of
this merge that haven't been finished yet...
Both the implicit coercions and the explicit casts seem to have vanished, and
behavior is worryingly inconsistent:
With unicode_semantics off:
- (unicode) cast fails on binary strings
- (string) converts things, including Unicode strings, to binary strings
- Binary and Unicode strings can't be concatenated.
- There's no available cast from string literals and variables to Unicode strings.
With unicode_semantics on:
- (unicode) fails on binary strings
- (string) behaves as (unicode), converting things to unicode strings
- Binary and Unicode strings can't be concatenated.
- There is no available cast from Unicode string variables to binary strings.
(For literals you can use b"blah".)
This looks like a pretty painful place to be as far as writing portable
Unicode-friendly code, because there is no way to write Unicode literals that
will reliably work. Even if your in-code literals are all ASCII, you can't mix
them with runtime Unicode strings because it throws a fatal error with
unicode_semantics off.
This is particularly bad if unicode_semantics can't be changed on a per-request
basis; this virtually guarantees that many hosting providers will turn it off
"for compatibility" or "for speed", and individual users won't be able to do a
darn thing about it.
Wrapping every string literal in a conditional call to unicode_decode() sounds
less than ideal; if (unicode) casts worked they would still be pretty ugly too.
I would love a pragma setting like the declare(encoding="UTF-8") to say "I'm
going to use Unicode string literals in this file, whatever unicode_semantics
may be." Would there be any interest in supporting a mode like this?
A Python-style modifier like u"blah" could go along with the b"blah" binary
string literal as well, though I'd rather not have to put a sigil on every string...
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFD8cU+wRnhpk1wk44RAnwKAJ99lNB5C44jvKhqbPzlBnLiUwKLBwCfYYQh
7VGvgqkgRrL+Le6bPxbsD54=
=JRAP
-----END PGP SIGNATURE
Hello Brion,
Thank you for your feedback.
First of all, README.UNICODE is a bit out of date, as you probably
noticed. I need to update it once we finalize this conversion/casting
discussion.
Your point about writing portable Unicode-friendly code is well taken.
Rasmus and I have chatted a bit here, and we think we can propose some
changes that may make it easier.
With unicode_semantics=off:
- (unicode) cast converts binary strings to Unicode strings using
runtime_encoding setting - (string) converts Unicode strings to binary strings using
runtime_encoding again - Binary and Unicode strings cannot be concatenated. You have to cast
all operands to the same type.
With unicode_semantics=on:
- (unicode) cast converts binary strings to Unicode strings. The issue
here is whether to use script_encoding (in case you do
(unicode)b"blah") or runtime_encoding (in case it's a binary string
that came from elsewhere) - (string) converts Unicode strings to binary strings using
runtime_encoding setting - Binary and Unicode strings cannot be concatenated. You have to cast
all operands to the same type.
I think this will make it easier to write code, because you can always
depend on the behavior of the cast operators. The (unicode) and
(string) casts are basically shortcuts for unicode_encode() and
unicode_decode() used with runtime_encoding setting (excepting the
issue I mentioned above).
The unicode_semantics switch will not be per-request, due to a variety
of reasons we have covered before.
Your suggestion about treating all string literals as Unicode if an
encoding pragma is used is an interesting one and merits more
discussion I think. Do you think it should affect only literals or also
identifiers?
-Andrei
Both the implicit coercions and the explicit casts seem to have
vanished, and
behavior is worryingly inconsistent:With unicode_semantics off:
- (unicode) cast fails on binary strings
- (string) converts things, including Unicode strings, to binary
strings- Binary and Unicode strings can't be concatenated.
- There's no available cast from string literals and variables to
Unicode strings.With unicode_semantics on:
- (unicode) fails on binary strings
- (string) behaves as (unicode), converting things to unicode strings
- Binary and Unicode strings can't be concatenated.
- There is no available cast from Unicode string variables to binary
strings.
(For literals you can use b"blah".)This looks like a pretty painful place to be as far as writing portable
Unicode-friendly code, because there is no way to write Unicode
literals that
will reliably work. Even if your in-code literals are all ASCII, you
can't mix
them with runtime Unicode strings because it throws a fatal error with
unicode_semantics off.This is particularly bad if unicode_semantics can't be changed on a
per-request
basis; this virtually guarantees that many hosting providers will turn
it off
"for compatibility" or "for speed", and individual users won't be able
to do a
darn thing about it.Wrapping every string literal in a conditional call to
unicode_decode() sounds
less than ideal; if (unicode) casts worked they would still be pretty
ugly too.I would love a pragma setting like the declare(encoding="UTF-8") to
say "I'm
going to use Unicode string literals in this file, whatever
unicode_semantics
may be." Would there be any interest in supporting a mode like this?A Python-style modifier like u"blah" could go along with the b"blah"
binary
string literal as well, though I'd rather not have to put a sigil on
every string...
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.orgiD8DBQFD8cU+wRnhpk1wk44RAnwKAJ99lNB5C44jvKhqbPzlBnLiUwKLBwCfYYQh
7VGvgqkgRrL+Le6bPxbsD54=
=JRAP
-----END PGP SIGNATURE
Your point about writing portable Unicode-friendly code is well taken.
Rasmus and I have chatted a bit here, and we think we can propose some
changes that may make it easier.
sorry, i can hardly found the thread. can u give me sone hint on the
subject so i can search it?With unicode_semantics=off:
- (unicode) cast converts binary strings to Unicode strings using
runtime_encoding setting- (string) converts Unicode strings to binary strings using
runtime_encoding again- Binary and Unicode strings cannot be concatenated. You have to cast
all operands to the same type.With unicode_semantics=on:
- (unicode) cast converts binary strings to Unicode strings. The issue
here is whether to use script_encoding (in case you do (unicode)b"blah")
i don't thinik if good to write such code nor to speed it up by
converting it in compile time
runtime_encoding (in case it's a binary string
that came from elsewhere)- (string) converts Unicode strings to binary strings using
runtime_encoding setting- Binary and Unicode strings cannot be concatenated. You have to cast
all operands to the same type.
looks good. but not allowing $binary . $unicode makes some problem
with the old code in index.php:
require_once($_SERVER["MY_PROJECT_DIR"] . "/lib.php"); where
$_SERVER["MY_PROJECT_DIR"] is import from httpd(such as apache)
mod_setenv
one have to modify it to:
require_once($_SERVER["MY_PROJECT_DIR"] . b"/lib.php");
and such code cannot even parsed under <php6
or use "declare" for even 1 string.
declare (encodig="binary") {
require_once($_SERVER["MY_PROJECT_DIR"] . "/lib.php");
}
I would love a pragma setting like the declare(encoding="UTF-8") to
say "I'm
going to use Unicode string literals in this file, whatever
unicode_semantics
may be." Would there be any interest in supporting a mode like this?
able to declare for binary too...
unicode_semantics is much less useful/harmless if most of the script
files have declare at the top of the code :)
i wonder what the world will be with the following code if there's no
implicit(auto) cast:
function test($a, $b = "me") { return "$a is a friend of $b"; }
$x = u"x";
$y = b"y";
test($x);
test($y);
i see no reason not to allow $binary . $unicode, except for
performance (maybe there was in the discussion thread). it's better to
use a E_STRICT
or profiler etc to tell u that a implicit cast is occur
for performance only.
Andrei Zmievski wrote:
Your point about writing portable Unicode-friendly code is well taken.
Rasmus and I have chatted a bit here, and we think we can propose some
changes that may make it easier.With unicode_semantics=off:
- (unicode) cast converts binary strings to Unicode strings using
runtime_encoding setting- (string) converts Unicode strings to binary strings using
runtime_encoding again
Will a program always be able to change the runtime_encoding setting?
Some hosts like to lock off everything and disable ini_set etc. If the host has
hardlocked it at something terrible, can my portable program still declare that
it needs to work with UTF-8?
Which brings to mind; if the input in $_REQUEST etc has been misconverted by a
bad setting, how do I get at the unconverted data to fix it? The (outdated ;)
README says this will be possible but I didn't see any reference to how.
- Binary and Unicode strings cannot be concatenated. You have to cast
all operands to the same type.
I do find the FATAL ERRORS on using the 'wrong' string type a bit odd though;
most other types in PHP will coerce silently (string . int), and the wildly
incompatible ones usually cause mere NOTICE or WARNING-level messages.
Was this change from PHP's regular behavior a conscious decision to make people
think harder about what kind of strings they're using? From the original design
document I got the impression that it was meant to be specific to special
binary-only strings, which would be used relatively rarely (eg for binary file
I/O) while more typical strings would transparently "just work" most of the
time. Now the binary strings have replaced the native strings and the whole
behavior has changed.
(A comparison with other languages; Python is normally very strict about typing
and won't even let you concatenate a string with an integer without an explicit
conversion. But it will let you concatenate a byte string with a Unicode string,
with an automatic coercion to Unicode.)
With unicode_semantics=on:
- (unicode) cast converts binary strings to Unicode strings. The issue
here is whether to use script_encoding (in case you do (unicode)b"blah")
or runtime_encoding (in case it's a binary string that came from elsewhere)
Another thing you might consider is allowing only ASCII character literals in a
b"blah" binary string literal. Escape codes are available...
I think this will make it easier to write code, because you can always
depend on the behavior of the cast operators. The (unicode) and (string)
casts are basically shortcuts for unicode_encode() and unicode_decode()
used with runtime_encoding setting (excepting the issue I mentioned above).
Reliable casts would indeed be great. :)
The unicode_semantics switch will not be per-request, due to a variety
of reasons we have covered before.Your suggestion about treating all string literals as Unicode if an
encoding pragma is used is an interesting one and merits more discussion
I think. Do you think it should affect only literals or also identifiers?
Personally I have no use for non-ASCII identifiers.
Anything that needs to get used for referring to identifiers, though, needs to
be able to operate consistently in some fashion...
- array_map("some_function_name", $data);
- $GLOBALS["myConfigVar"] = $newval;
etc
These probably need to either 'just work' when passed the other kind of string,
or have some kind of consistent cast available.
(Life would be a lot simpler if there weren't two different modes, of course. :)
-- brion vibber (brion @ pobox.com)