Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:21865 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 49280 invoked by uid 1010); 14 Feb 2006 11:56:18 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 49264 invoked from network); 14 Feb 2006 11:56:18 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Feb 2006 11:56:18 -0000 X-Host-Fingerprint: 207.106.133.28 proof.pobox.com Linux 2.5 (sometimes 2.4) (4) Received: from ([207.106.133.28:40763] helo=proof.pobox.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 41/00-48772-165C1F34 for ; Tue, 14 Feb 2006 06:56:17 -0500 Received: from proof (localhost [127.0.0.1]) by proof.pobox.com (Postfix) with ESMTP id AFBB663AF0 for ; Tue, 14 Feb 2006 06:56:13 -0500 (EST) Received: from [192.168.1.3] (70-34-20-209.lmdaca.adelphia.net [70.34.20.209]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by proof.sasl.smtp.pobox.com (Postfix) with ESMTP id 731AE1A0F3 for ; Tue, 14 Feb 2006 06:56:13 -0500 (EST) Message-ID: <43F1C53E.1080607@pobox.com> Date: Tue, 14 Feb 2006 03:55:42 -0800 User-Agent: Thunderbird 1.5 (Macintosh/20051201) MIME-Version: 1.0 To: internals@lists.php.net X-Enigmail-Version: 0.93.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Unicode string literals and casting From: brion@pobox.com (Brion Vibber) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Unicode support design document in README.UNICODE discusses three types of strings, IS_UNICODE, IS_STRING, and IS_BINARY, and specifies two new casts, (unicode) and (binary). The spec allows Unicode and string types to be implicitly concatenated and explicitly cast to one another, while the binary type is a black hole that requires a conversion function call to get out of. According to the notes from November I see this has been reduced to just Unicode and binary types: http://www.php.net/~derick/meeting-notes.html#different-string-types I've been prodding some strings from user code to see how they react, and I'm wondering if they're working as intended or if it's just some side effects of this merge that haven't been finished yet... Both the implicit coercions and the explicit casts seem to have vanished, and behavior is worryingly inconsistent: With unicode_semantics off: * (unicode) cast fails on binary strings * (string) converts things, including Unicode strings, to binary strings * Binary and Unicode strings can't be concatenated. * There's no available cast from string literals and variables to Unicode strings. With unicode_semantics on: * (unicode) fails on binary strings * (string) behaves as (unicode), converting things to unicode strings * Binary and Unicode strings can't be concatenated. * There is no available cast from Unicode string variables to binary strings. (For literals you can use b"blah".) This looks like a pretty painful place to be as far as writing portable Unicode-friendly code, because there is no way to write Unicode literals that will reliably work. Even if your in-code literals are all ASCII, you can't mix them with runtime Unicode strings because it throws a fatal error with unicode_semantics off. This is particularly bad if unicode_semantics can't be changed on a per-request basis; this virtually guarantees that many hosting providers will turn it off "for compatibility" or "for speed", and individual users won't be able to do a darn thing about it. Wrapping every string literal in a conditional call to unicode_decode() sounds less than ideal; if (unicode) casts worked they would still be pretty ugly too. I would *love* a pragma setting like the declare(encoding="UTF-8") to say "I'm going to use Unicode string literals in this file, whatever unicode_semantics may be." Would there be any interest in supporting a mode like this? A Python-style modifier like u"blah" could go along with the b"blah" binary string literal as well, though I'd rather not have to put a sigil on every string... - -- brion vibber (brion @ pobox.com) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFD8cU+wRnhpk1wk44RAnwKAJ99lNB5C44jvKhqbPzlBnLiUwKLBwCfYYQh 7VGvgqkgRrL+Le6bPxbsD54= =JRAP -----END PGP SIGNATURE-----