Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47331 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 11445 invoked from network); 16 Mar 2010 19:35:19 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 16 Mar 2010 19:35:19 -0000 Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 72.14.220.153 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 72.14.220.153 fg-out-1718.google.com Received: from [72.14.220.153] ([72.14.220.153:3970] helo=fg-out-1718.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 9B/12-15129-E6DDF9B4 for ; Tue, 16 Mar 2010 14:35:15 -0500 Received: by fg-out-1718.google.com with SMTP id 16so175313fgg.11 for ; Tue, 16 Mar 2010 12:35:02 -0700 (PDT) Received: by 10.87.73.14 with SMTP id a14mr661908fgl.30.1268768102598; Tue, 16 Mar 2010 12:35:02 -0700 (PDT) Received: from [192.168.200.22] (c-98-234-184-167.hsd1.ca.comcast.net [98.234.184.167]) by mx.google.com with ESMTPS id 16sm4196546fxm.3.2010.03.16.12.34.59 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 16 Mar 2010 12:35:01 -0700 (PDT) Message-ID: <4B9FDD60.6000407@lerdorf.com> Date: Tue, 16 Mar 2010 12:34:56 -0700 User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.10pre) Gecko/20100316 Shredder/3.0.4pre MIME-Version: 1.0 To: dreamcat four CC: Lester Caine , PHP internals References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com> <4B9F4196.9030404@lsces.co.uk> <99cf22521003160448k5028ae61y70e1e61428d13280@mail.gmail.com> <99cf22521003161040x4dba08fblb7e088cef16b64a9@mail.gmail.com> <4B9FCEA7.50108@lerdorf.com> <99cf22521003161205w22335143lbf531a0f58a60610@mail.gmail.com> In-Reply-To: <99cf22521003161205w22335143lbf531a0f58a60610@mail.gmail.com> X-Enigmail-Version: 1.0.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: rasmus@lerdorf.com (Rasmus Lerdorf) On 03/16/2010 12:05 PM, dreamcat four wrote: > On Tue, Mar 16, 2010 at 6:32 PM, Rasmus Lerdorf wrote: >> On 03/16/2010 10:40 AM, dreamcat four wrote: >>> As for text files on disk, if they are unicode, they are most commonly >>> utf-8 too. So then, why use utf-16 as internal unicode representation >>> in Php? It doesn't really make a lot of sense for most regular people >>> who want to use Php for their web application. Unless they don't >>> really care how slow its gonna be converting everything, constantly... >> >> Well, the obvious original reason is that ICU uses UTF-16 internally and >> the logic was that we would be going in and out of ICU to do all the >> various Unicode operations many more times than we would be interfacing >> with external things like MySQL or files on disk. You generally only >> read or write a string once from an external source, but you may perform >> multiple Unicode operations on that same string so avoiding a conversion >> for each operation seems logical. >> >> -Rasmus > > Its only logical if you've bothered to profile the conversion calls to > ICU against the non-ICU conversion calls. Im guessing the way to do > that, is to have 2 versions of each conversion method. One used by > ICU, and another used everywhere else. The harder part is to find some > suitable, real life php programs to test with. You mean check to see how many actual Unicode operations a standard app makes? We did talk about that, but there is a bit of a chicken-and-egg problem here. Because PHP doesn't natively support Unicode, people write apps in a way that lets them just pass Unicode through PHP and deal with it elsewhere. I would expect the profile to change once PHP gets better support for Unicode. But yes, some ideas around lazy conversions and other tricks would be interesting. If your input and output encoding are both utf-8 and all your data sources are utf-8 and you never do any sort of string manipulation on a particular string, why bother doing the utf-8 to utf-16 conversion on that string. -Rasmus