Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:30828 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 99117 invoked by uid 1010); 12 Jul 2007 02:14:45 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 99102 invoked from network); 12 Jul 2007 02:14:45 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Jul 2007 02:14:45 -0000 Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 207.126.228.150 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 207.126.228.150 rsmtp2.corp.yahoo.com FreeBSD 4.7-5.2 (or MacOS X 10.2-10.3) (2) Received: from [207.126.228.150] ([207.126.228.150:24388] helo=rsmtp2.corp.yahoo.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 0B/AD-05872-09E85964 for ; Wed, 11 Jul 2007 22:14:42 -0400 Received: from trainburn-lm-corp-yahoo-com.local (socks1.corp.yahoo.com [216.145.54.158]) (authenticated bits=0) by rsmtp2.corp.yahoo.com (8.13.8/8.13.6/y.rout) with ESMTP id l6C2EFkU071086 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 11 Jul 2007 19:14:15 -0700 (PDT) Message-ID: <46958E6B.1000707@lerdorf.com> Date: Wed, 11 Jul 2007 19:14:03 -0700 User-Agent: Thunderbird 2.0.0.4 (Macintosh/20070604) MIME-Version: 1.0 To: ceo@l-i-e.com CC: Larry Garfield , internals@lists.php.net References: <1181829227.3478.3.camel@localhost.localdomain> <4692B1A3.1000808@zend.com> <4692B7D4.6040001@zend.com> <200707101906.30925.larry@garfieldtech.com> <2237.24.1.37.132.1184204516.squirrel@www.l-i-e.com> In-Reply-To: <2237.24.1.37.132.1184204516.squirrel@www.l-i-e.com> X-Enigmail-Version: 0.95.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] What is the use of "unicode.semantics" in PHP 6? From: rasmus@lerdorf.com (Rasmus Lerdorf) Richard Lynch wrote: > On Tue, July 10, 2007 7:06 pm, Larry Garfield wrote: >> If 90% of the strings in use would work fine if treated as unicode, >> then it >> would make sense to just always assume Unicode unless explicitly >> specified >> otherwise. > > If that 10% includes enough users who have written millions of line of > code in a self-consistent manner that voids ALL their work, you may > want to re-think this 90% number you have chosen... > > And of course you need 2 distinct data types for Unicode and strings. > > What I don't understand is why you'd lock things down so that: > > a) the default "string" is Unicode, breaking XX% of existing applications > > b) the end user can't readily change a) in a huge percentage of > existing install base (read: non-dedicated hosting or mixed-user > servers with shared httpd.conf settings) > > > I realize it's far too late by now to do anything about it, most > likely, but why in the world didn't you just choose a new keyword to > define/declare a string as Unicode? > > And did I dream the thread on this way back when where it was stated > that Unicode was backwards-compatible, so this wouldn't be a problem? > > Yet now it seems that UTF-16 is *not* backwards-compatible, and this > seems like a pretty big problem to me. Richard, you are rather confused on this Unicode stuff. The fact that PHP and ICU uses UTF-16 internally has absolutely nothing to do with what is exposed at the scripting level. The only things that will break in a standard application is stuff that relies on strings being binary. Normal text passing back and forth between the browser and the server will work just fine. The breakages, apart from various bugs at this early stage, are limited to places where the code is expecting to see a binary string and PHP hasn't been able to determine this automatically. And hopefully we can come up with ways to automatically determine when something should default to a binary string. But if you write: $a = "マニュアル"; echo $a[1]; and you expect to have that spew out 0xe3, then yes, it will break because it will result in ニ which is what it really should do. And yes, I know a lot of people reading this list don't care much for other charsets, but people reading an english mailing list are rather self-selecting. -Rasmus