Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:21906 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 27481 invoked by uid 1010); 17 Feb 2006 21:02:07 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 27458 invoked from network); 17 Feb 2006 21:02:07 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 17 Feb 2006 21:02:07 -0000 X-Host-Fingerprint: 216.145.54.171 mrout1.yahoo.com FreeBSD 4.7-5.2 (or MacOS X 10.2-10.3) (2) Received: from ([216.145.54.171:30158] helo=mrout1.yahoo.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id DE/90-23394-AC936F34 for ; Fri, 17 Feb 2006 16:02:04 -0500 Received: from [66.228.175.145] (borndress-lm.corp.yahoo.com [66.228.175.145]) by mrout1.yahoo.com (8.13.4/8.13.4/y.out) with ESMTP id k1HL1HF6055892; Fri, 17 Feb 2006 13:01:17 -0800 (PST) In-Reply-To: <43F1C53E.1080607@pobox.com> References: <43F1C53E.1080607@pobox.com> Mime-Version: 1.0 (Apple Message framework v623) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-ID: Content-Transfer-Encoding: 7bit Cc: internals@lists.php.net Date: Fri, 17 Feb 2006 13:02:50 -0800 To: Brion Vibber X-Mailer: Apple Mail (2.623) Subject: Re: [PHP-DEV] Unicode string literals and casting From: andrei@gravitonic.com (Andrei Zmievski) Hello Brion, Thank you for your feedback. First of all, README.UNICODE is a bit out of date, as you probably noticed. I need to update it once we finalize this conversion/casting discussion. Your point about writing portable Unicode-friendly code is well taken. Rasmus and I have chatted a bit here, and we think we can propose some changes that may make it easier. With unicode_semantics=off: * (unicode) cast converts binary strings to Unicode strings using runtime_encoding setting * (string) converts Unicode strings to binary strings using runtime_encoding again * Binary and Unicode strings cannot be concatenated. You have to cast all operands to the same type. With unicode_semantics=on: * (unicode) cast converts binary strings to Unicode strings. The issue here is whether to use script_encoding (in case you do (unicode)b"blah") or runtime_encoding (in case it's a binary string that came from elsewhere) * (string) converts Unicode strings to binary strings using runtime_encoding setting * Binary and Unicode strings cannot be concatenated. You have to cast all operands to the same type. I think this will make it easier to write code, because you can always depend on the behavior of the cast operators. The (unicode) and (string) casts are basically shortcuts for unicode_encode() and unicode_decode() used with runtime_encoding setting (excepting the issue I mentioned above). The unicode_semantics switch will not be per-request, due to a variety of reasons we have covered before. Your suggestion about treating all string literals as Unicode if an encoding pragma is used is an interesting one and merits more discussion I think. Do you think it should affect only literals or also identifiers? -Andrei > Both the implicit coercions and the explicit casts seem to have > vanished, and > behavior is worryingly inconsistent: > > With unicode_semantics off: > * (unicode) cast fails on binary strings > * (string) converts things, including Unicode strings, to binary > strings > * Binary and Unicode strings can't be concatenated. > * There's no available cast from string literals and variables to > Unicode strings. > > With unicode_semantics on: > * (unicode) fails on binary strings > * (string) behaves as (unicode), converting things to unicode strings > * Binary and Unicode strings can't be concatenated. > * There is no available cast from Unicode string variables to binary > strings. > (For literals you can use b"blah".) > > > This looks like a pretty painful place to be as far as writing portable > Unicode-friendly code, because there is no way to write Unicode > literals that > will reliably work. Even if your in-code literals are all ASCII, you > can't mix > them with runtime Unicode strings because it throws a fatal error with > unicode_semantics off. > > This is particularly bad if unicode_semantics can't be changed on a > per-request > basis; this virtually guarantees that many hosting providers will turn > it off > "for compatibility" or "for speed", and individual users won't be able > to do a > darn thing about it. > > > Wrapping every string literal in a conditional call to > unicode_decode() sounds > less than ideal; if (unicode) casts worked they would still be pretty > ugly too. > > I would *love* a pragma setting like the declare(encoding="UTF-8") to > say "I'm > going to use Unicode string literals in this file, whatever > unicode_semantics > may be." Would there be any interest in supporting a mode like this? > > A Python-style modifier like u"blah" could go along with the b"blah" > binary > string literal as well, though I'd rather not have to put a sigil on > every string... > > - -- brion vibber (brion @ pobox.com) > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.2.4 (Darwin) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFD8cU+wRnhpk1wk44RAnwKAJ99lNB5C44jvKhqbPzlBnLiUwKLBwCfYYQh > 7VGvgqkgRrL+Le6bPxbsD54= > =JRAP > -----END PGP SIGNATURE----- > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php