Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47338 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 29154 invoked from network); 16 Mar 2010 20:39:26 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 16 Mar 2010 20:39:26 -0000 Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 213.123.20.119 cause and error) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 213.123.20.119 c2bthomr01.btconnect.com Received: from [213.123.20.119] ([213.123.20.119:3430] helo=c2bthomr01.btconnect.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id EE/B5-15129-A7CEF9B4 for ; Tue, 16 Mar 2010 15:39:25 -0500 Received: from [10.0.0.5] (host81-138-11-136.in-addr.btopenworld.com [81.138.11.136]) by c2bthomr01.btconnect.com with ESMTP id MUV70349; Tue, 16 Mar 2010 20:39:19 GMT X-Mirapoint-IP-Reputation: reputation=Fair-1, source=Queried, refid=0001.0A0B0302.4B9FEC77.001E, actions=tag Message-ID: <4B9FEC76.9040608@lsces.co.uk> Date: Tue, 16 Mar 2010 20:39:18 +0000 User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100217 Fedora/2.0.3-1.fc12 SeaMonkey/2.0.3 MIME-Version: 1.0 To: PHP internals References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com> <4B9F4196.9030404@lsces.co.uk> <99cf22521003160448k5028ae61y70e1e61428d13280@mail.gmail.com> <99cf22521003161040x4dba08fblb7e088cef16b64a9@mail.gmail.com> <4B9FCEA7.50108@lerdorf.com> <99cf22521003161205w22335143lbf531a0f58a60610@mail.gmail.com> <4B9FDD60.6000407@lerdorf.com> In-Reply-To: <4B9FDD60.6000407@lerdorf.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Junkmail-Status: score=10/50, host=c2bthomr01.btconnect.com X-Junkmail-SD-Raw: score=unknown, refid=str=0001.0A0B0202.4B9FEC77.02B5,ss=1,fgs=0, ip=0.0.0.0, so=2009-07-20 21:54:04, dmn=5.7.1/2009-08-27, mode=single engine X-Junkmail-IWF: false Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: lester@lsces.co.uk (Lester Caine) Rasmus Lerdorf wrote: > On 03/16/2010 12:05 PM, dreamcat four wrote: >> On Tue, Mar 16, 2010 at 6:32 PM, Rasmus Lerdorf wrote: >>> On 03/16/2010 10:40 AM, dreamcat four wrote: >>>> As for text files on disk, if they are unicode, they are most commonly >>>> utf-8 too. So then, why use utf-16 as internal unicode representation >>>> in Php? It doesn't really make a lot of sense for most regular people >>>> who want to use Php for their web application. Unless they don't >>>> really care how slow its gonna be converting everything, constantly... >>> >>> Well, the obvious original reason is that ICU uses UTF-16 internally and >>> the logic was that we would be going in and out of ICU to do all the >>> various Unicode operations many more times than we would be interfacing >>> with external things like MySQL or files on disk. You generally only >>> read or write a string once from an external source, but you may perform >>> multiple Unicode operations on that same string so avoiding a conversion >>> for each operation seems logical. >>> >>> -Rasmus >> >> Its only logical if you've bothered to profile the conversion calls to >> ICU against the non-ICU conversion calls. Im guessing the way to do >> that, is to have 2 versions of each conversion method. One used by >> ICU, and another used everywhere else. The harder part is to find some >> suitable, real life php programs to test with. > > You mean check to see how many actual Unicode operations a standard app > makes? We did talk about that, but there is a bit of a chicken-and-egg > problem here. Because PHP doesn't natively support Unicode, people > write apps in a way that lets them just pass Unicode through PHP and > deal with it elsewhere. I would expect the profile to change once PHP > gets better support for Unicode. > > But yes, some ideas around lazy conversions and other tricks would be > interesting. If your input and output encoding are both utf-8 and all > your data sources are utf-8 and you never do any sort of string > manipulation on a particular string, why bother doing the utf-8 to > utf-16 conversion on that string. I think that is what I said originally ;) When a string is read in you set an extra flag if it needs special handling, otherwise you just handle it as a single byte per character string ... and for the diehards you add a switch to treat everything as it is now :) -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk// Firebird - http://www.firebirdsql.org/index.php