Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:21963 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 20956 invoked by uid 1010); 20 Feb 2006 18:59:32 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 20941 invoked from network); 20 Feb 2006 18:59:32 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 20 Feb 2006 18:59:32 -0000 X-Host-Fingerprint: 207.106.133.28 proof.pobox.com Linux 2.5 (sometimes 2.4) (4) Received: from ([207.106.133.28:35119] helo=proof.pobox.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 5B/DB-45151-2911AF34 for ; Mon, 20 Feb 2006 13:59:30 -0500 Received: from proof (localhost [127.0.0.1]) by proof.pobox.com (Postfix) with ESMTP id 7480371840; Mon, 20 Feb 2006 13:59:28 -0500 (EST) Received: from [192.168.1.3] (70-34-20-209.lmdaca.adelphia.net [70.34.20.209]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by proof.sasl.smtp.pobox.com (Postfix) with ESMTP id 4D7031C68; Mon, 20 Feb 2006 13:59:26 -0500 (EST) Message-ID: <43FA11B8.2050806@pobox.com> Date: Mon, 20 Feb 2006 11:00:08 -0800 User-Agent: Thunderbird 1.5 (Macintosh/20051201) MIME-Version: 1.0 To: Andrei Zmievski Cc: internals@lists.php.net References: <43F1C53E.1080607@pobox.com> In-Reply-To: X-Enigmail-Version: 0.93.0.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigD41F27F26BFC153505E929BA" Subject: Re: [PHP-DEV] Unicode string literals and casting From: brion@pobox.com (Brion Vibber) --------------enigD41F27F26BFC153505E929BA Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Andrei Zmievski wrote: > Your point about writing portable Unicode-friendly code is well taken. > Rasmus and I have chatted a bit here, and we think we can propose some > changes that may make it easier. >=20 > With unicode_semantics=3Doff: > * (unicode) cast converts binary strings to Unicode strings using > runtime_encoding setting > * (string) converts Unicode strings to binary strings using > runtime_encoding again Will a program always be able to change the runtime_encoding setting? Some hosts like to lock off everything and disable ini_set etc. If the ho= st has hardlocked it at something terrible, can my portable program still declar= e that it needs to work with UTF-8? Which brings to mind; if the input in $_REQUEST etc has been misconverted= by a bad setting, how do I get at the unconverted data to fix it? The (outdate= d ;) README says this will be possible but I didn't see any reference to how. > * Binary and Unicode strings cannot be concatenated. You have to cast > all operands to the same type. I do find the FATAL ERRORS on using the 'wrong' string type a bit odd tho= ugh; most other types in PHP will coerce silently (string . int), and the wild= ly incompatible ones usually cause mere NOTICE or WARNING-level messages. Was this change from PHP's regular behavior a conscious decision to make = people think harder about what kind of strings they're using? From the original = design document I got the impression that it was meant to be specific to special= binary-only strings, which would be used relatively rarely (eg for binary= file I/O) while more typical strings would transparently "just work" most of t= he time. Now the binary strings have replaced the native strings and the who= le behavior has changed. (A comparison with other languages; Python is normally very strict about = typing and won't even let you concatenate a string with an integer without an ex= plicit conversion. But it will let you concatenate a byte string with a Unicode = string, with an automatic coercion to Unicode.) > With unicode_semantics=3Don: > * (unicode) cast converts binary strings to Unicode strings. The issue= > here is whether to use script_encoding (in case you do (unicode)b"blah"= ) > or runtime_encoding (in case it's a binary string that came from elsewh= ere) Another thing you might consider is allowing only ASCII character literal= s in a b"blah" binary string literal. Escape codes are available... > I think this will make it easier to write code, because you can always > depend on the behavior of the cast operators. The (unicode) and (string= ) > casts are basically shortcuts for unicode_encode() and unicode_decode()= > used with runtime_encoding setting (excepting the issue I mentioned abo= ve). Reliable casts would indeed be great. :) > The unicode_semantics switch will not be per-request, due to a variety > of reasons we have covered before. >=20 > Your suggestion about treating all string literals as Unicode if an > encoding pragma is used is an interesting one and merits more discussio= n > I think. Do you think it should affect only literals or also identifier= s? Personally I have no use for non-ASCII identifiers. Anything that needs to get used for referring to identifiers, though, nee= ds to be able to operate consistently in some fashion... * array_map("some_function_name", $data); * $GLOBALS["myConfigVar"] =3D $newval; etc These probably need to either 'just work' when passed the other kind of s= tring, or have some kind of consistent cast available. (Life would be a lot simpler if there weren't two different modes, of cou= rse. :) -- brion vibber (brion @ pobox.com) --------------enigD41F27F26BFC153505E929BA Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFD+hG4wRnhpk1wk44RAlc/AJ9le1+nBWJDWVw7T8zKOn1fn4KD6ACghw6g 3zkijtxCQSqrwJAbYxcBpDU= =+XT6 -----END PGP SIGNATURE----- --------------enigD41F27F26BFC153505E929BA--