Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:72837
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error)
Message-ID: <530F0BF8.4040307@lsces.co.uk>
Date: Thu, 27 Feb 2014 09:57:12 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24
MIME-Version: 1.0
To: PHP internals <internals@lists.php.net>
References: <CAEZPtU4zzN=Xxs03XbBGEM2UTVe3jXH5TAaXUqjsVkzm3kOOyg@mail.gmail.com> <CAEZPtU4HDaOhvdk=BtgXhGOyU3rqH4-+4wcXFdbygq+d8Jz8Qw@mail.gmail.com>
In-Reply-To: <CAEZPtU4HDaOhvdk=BtgXhGOyU3rqH4-+4wcXFdbygq+d8Jz8Qw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Re: [php6] Unicode support, options?
From: lester@lsces.co.uk (Lester Caine)

Pierre Joye wrote:
> On Thu, Feb 20, 2014 at 6:54 AM, Pierre Joye <pierre.php@gmail.com> wrote:
>
>> * ICU:
>> U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
>> ICU compile time setting.It is is not possible to set it at PHP
>> configure time. It means that users will have to create their own
>> build. Alternatively we can bundle ICU but this will be awkward, a
>> maintenance nightmare for both php and the distros.
>>
>> Alternatively UText can be used to create UTF-8 string. APIs accepting
>> UText allow almost everything we need. However the counterpart is that
>> a UTF-8 UText is readonly. Any operation altering its content will
>> require duplication, clones or conversions. That may kill all gains we
>> got from using UTF-8 only.
>>
>> The  U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
>>   show stopper. Asking users to custom build ICU is not an option
>> either. I do not know if the distros will be ready to provide two
>> different builds of ICU either, it may add a lot of issues with all
>> projects using ICU.
>
> Here is a 1st reply from ICU:
>
> http://sourceforge.net/p/icu/mailman/message/32031609/
>
> It sounds like this flag could be a good option for PHP's Unicode support.

Reading between the lines, it would seem that a switch to UTF-8 base is their 
preferred path, but the core code is too engrained as UTF-16? Since there is 
really no alternative to ICU for the heavy grunt, I do see this as the right 
starting point. Any 'bells and whistles' should use the ICU UTF-8 style rather 
than pulling in yet more variations?

The main problem in all of this is how it dovetails into windows? The reliance 
on 'UTF-16' style WCHAR seems to be the real problem there?

> Btw, I created a sub page for Unicode support:
>
> https://wiki.php.net/ideas/php6/unicode
>
>> Thoughts, comments or ideas?

Like you Pierre I'm no Unicode expert, and digging deeper simply reinforces the 
at times irritating compromises that Unicode contains. Obviously designed by 
committee? :(

Currently I'm trying to work out just what is required at the core to support 
UTF-8 and while it is not a trivial problem, the bulk of the code is designed to 
handle strings of variable length and in it's basic form UTF-8 just creates 
longer strings? So isn't the next question quite simply 'case'? And how we 
handle case insensitivity in the core will determine what core Unicode functions 
are required?

> I found another C++ library to do the basic UTF-8 operations, easl:
>
> https://code.google.com/p/easl/
>
> It could be a nice one to use in combination with ICU, small and fast
> (1st tests).

C++ ?
That what ever is used will need to be both tailored for PHP and transparent as 
far as ICU is concerned is as you have identified - a given. ICU is still built 
using 32bit string lengths ( I think? ) which does add to the fun, but I don't 
see any reason not to be using functions like compareUTF8() and 
ucasemap_utf8ToLower() from ICU in which case the strings need to be standard 
ICU UTF-8 strings? I can see the advantage of the 'fast' compare that I have 
been banging on about elsewhere, which looks for a simple match between two raw 
strings of bytes. UTF-8 only comes into that when you need to add 'rank'? But 
much of the core processing CAN simply ignore that as long as the generic calls 
don't have dead tails which activate it?

Given the complexity of case conversion I can see the possible need for a mirror 
string holding a 'lower case' version which may be a different length and so 
'string' could become a more complex object? But is this aspect what you are 
looking for the 'small fast library' to provide? easl would seem only to be 
trying to smooth the edges between windows and other platforms?

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk