Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:72839
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error)
Message-ID: <530F18C6.1000301@lsces.co.uk>
Date: Thu, 27 Feb 2014 10:51:50 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24
MIME-Version: 1.0
To: PHP internals <internals@lists.php.net>
References: <CAEZPtU4zzN=Xxs03XbBGEM2UTVe3jXH5TAaXUqjsVkzm3kOOyg@mail.gmail.com>	<CAEZPtU4HDaOhvdk=BtgXhGOyU3rqH4-+4wcXFdbygq+d8Jz8Qw@mail.gmail.com>	<530F0BF8.4040307@lsces.co.uk> <CAEZPtU73J0AMAtFaz=vdttWC_Ach=mkuShmmUjNvsHk5WoBq2Q@mail.gmail.com>
In-Reply-To: <CAEZPtU73J0AMAtFaz=vdttWC_Ach=mkuShmmUjNvsHk5WoBq2Q@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Re: [php6] Unicode support, options?
From: lester@lsces.co.uk (Lester Caine)

Pierre Joye wrote:
>> That what ever is used will need to be both tailored for PHP and transparent
>> >as far as ICU is concerned is as you have identified - a given. ICU is still
>> >built using 32bit string lengths ( I think? ) which does add to the fun, but
>> >I don't see any reason not to be using functions like compareUTF8() and
>> >ucasemap_utf8ToLower() from ICU in which case the strings need to be
>> >standard ICU UTF-8 strings? I can see the advantage of the 'fast' compare
>> >that I have been banging on about elsewhere, which looks for a simple match
>> >between two raw strings of bytes. UTF-8 only comes into that when you need
>> >to add 'rank'? But much of the core processing CAN simply ignore that as
>> >long as the generic calls don't have dead tails which activate it?

> We may use our own functions (or other lib) to covers operations not
> implemented in ICU or too slow because of the conversions. That's why
> investigating in other tools is still a good thing to do.

The bit I'm still missing here is 'operations not implemented in ICU'?
As soon as conversions are required then speed is always going to be 
compromised, but where the platform is already UTF-8 based, which is a growing 
situation, then all we are looking for is to handle UTF-8 strings quickly. For 
the best performance conversions can simply be avoided. So I'm currently looking 
at conversion as a secondary problem - probably less important than case! - and 
just trying to identify what is missing from ICU's UTF-8 that needs to be added?

It may well be that windows is a special case that needs it's own conversion 
layer, but that should not form part of any core upgrade. It is not needed for 
many installations?

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk