Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:47338
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 213.123.20.119 cause and error)
Message-ID: <4B9FEC76.9040608@lsces.co.uk>
Date: Tue, 16 Mar 2010 20:39:18 +0000
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100217 Fedora/2.0.3-1.fc12 SeaMonkey/2.0.3
MIME-Version: 1.0
To: PHP internals <internals@lists.php.net>
References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com> 	<4B9F4196.9030404@lsces.co.uk> <99cf22521003160448k5028ae61y70e1e61428d13280@mail.gmail.com> 	<99cf22521003161040x4dba08fblb7e088cef16b64a9@mail.gmail.com> 	<4B9FCEA7.50108@lerdorf.com> <99cf22521003161205w22335143lbf531a0f58a60610@mail.gmail.com> <4B9FDD60.6000407@lerdorf.com>
In-Reply-To: <4B9FDD60.6000407@lerdorf.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode?
From: lester@lsces.co.uk (Lester Caine)

Rasmus Lerdorf wrote:
> On 03/16/2010 12:05 PM, dreamcat four wrote:
>> On Tue, Mar 16, 2010 at 6:32 PM, Rasmus Lerdorf<rasmus@lerdorf.com>  wrote:
>>> On 03/16/2010 10:40 AM, dreamcat four wrote:
>>>> As for text files on disk, if they are unicode, they are most commonly
>>>> utf-8 too. So then, why use utf-16 as internal unicode representation
>>>> in Php? It doesn't really make a lot of sense for most regular people
>>>> who want to use Php for their web application. Unless they don't
>>>> really care how slow its gonna be converting everything, constantly...
>>>
>>> Well, the obvious original reason is that ICU uses UTF-16 internally and
>>> the logic was that we would be going in and out of ICU to do all the
>>> various Unicode operations many more times than we would be interfacing
>>> with external things like MySQL or files on disk.  You generally only
>>> read or write a string once from an external source, but you may perform
>>> multiple Unicode operations on that same string so avoiding a conversion
>>> for each operation seems logical.
>>>
>>> -Rasmus
>>
>> Its only logical if you've bothered to profile the conversion calls to
>> ICU against the non-ICU conversion calls. Im guessing the way to do
>> that, is to have 2 versions of each conversion method. One used by
>> ICU, and another used everywhere else. The harder part is to find some
>> suitable, real life php programs to test with.
>
> You mean check to see how many actual Unicode operations a standard app
> makes?  We did talk about that, but there is a bit of a chicken-and-egg
> problem here.  Because PHP doesn't natively support Unicode, people
> write apps in a way that lets them just pass Unicode through PHP and
> deal with it elsewhere.  I would expect the profile to change once PHP
> gets better support for Unicode.
>
> But yes, some ideas around lazy conversions and other tricks would be
> interesting.  If your input and output encoding are both utf-8 and all
> your data sources are utf-8 and you never do any sort of string
> manipulation on a particular string, why bother doing the utf-8 to
> utf-16 conversion on that string.

I think that is what I said originally ;)
When a string is read in you set an extra flag if it needs special handling, 
otherwise you just handle it as a single byte per character string ... and for 
the diehards you add a switch to treat everything as it is now :)

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php