Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:72742
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error)
Message-ID: <5307548C.20100@lsces.co.uk>
Date: Fri, 21 Feb 2014 13:28:44 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24
MIME-Version: 1.0
To: PHP internals <internals@lists.php.net>
References: <CAEZPtU4zzN=Xxs03XbBGEM2UTVe3jXH5TAaXUqjsVkzm3kOOyg@mail.gmail.com>	<53061982.2050901@googlemail.com>	<CAEZPtU5qnhwcq1BCkqG99dXJ8F=p6K8ABALacCiAOvPBQ=huYQ@mail.gmail.com>	<53066DE9.4090809@googlemail.com>	<CAEZPtU6b+aLuma-nxy84BYVZdOOTr7+ZGZhOVvWMNfbZ4RdPNg@mail.gmail.com>	<530740B9.5000509@lsces.co.uk> <CAEZPtU4-yG_FN2uL59XwwpDsc8BrsHdAUcfg-5iWAH1jSmJF6w@mail.gmail.com>
In-Reply-To: <CAEZPtU4-yG_FN2uL59XwwpDsc8BrsHdAUcfg-5iWAH1jSmJF6w@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] [php6] Unicode support, options?
From: lester@lsces.co.uk (Lester Caine)

Pierre Joye wrote:
> On Fri, Feb 21, 2014 at 1:04 PM, Lester Caine <lester@lsces.co.uk> wrote:
>> Pierre Joye wrote:
>>>>
>>>> What do you understand by "storage"?
>>>
>>> To have string stored as UTF-8 only, no conversion required for 99% of our
>>> use.
>>
>>
>> I think that the first thing that needs to be agreed on is if there will be
>> support for UTF-8 in the core? As has already been said, in many places this
>> currently just works and so blocking that may be more of a problem now? The
>> question surly is "What is the 1% that needs some extra work?"
>
> I think we pretty much agree already that we need UTF-8 as the base,
> meaning are stored in UTF-8. Conversions may be needed for advanced
> usages provided by ICU (or maybe not, I just do not know for sure
> now).
>
>> I light library would be most appropriate for filling the gaps currently
>> created by use of UTF-8 strings in the core? It is not until one starts
>> adding the mbstring level of string processing that a more powerful library
>> is required. Something that simply ensures UTF-8 strings are valid and can
>> carry out comparisons as required?
>
> it is more than only comparison. If only comparison, additions and the
> likes, utf8proc is enough, or librope with some additions.
Only thing putting me off utf8proc is that it only supports Unicode 5.0.0
librope does not seem to understand any of the fine detail of the uncode 
standards? What I've been looking for is the case switch actions and currently 
all I can find is ICU to handle that?

>> The black hole is still 'case sensitivity' and it is perhaps laying down a
>> 'light' set of rules for this which would allow a path forward? As I have
>> indicated, I'd prefer simply dropping case insensitivity, but a compromise
>> might be to retain it where a string length does not change, and a clean
>> reverse transform exists? So a library that provides that comparison as part
>> of the core package?
>
> I do not care much about languages support for UTF-8 names for
> methods, functons, variables etc. My take on it is that we should
> stick to ASCII for it and be done with that. But that's only my
> opinion :)
While I have no intention of using more than ASCII myself I can see the argument 
for supporting use of more user friendly names for functions and the like. I see 
the complaints about our current 'English' names and how they need improving 
while at the same time I am dealing with customer sites where we provide simple 
aliases for all text in a local translation. Easy enough in a relational 
database where you simply select the right set of entries from a table, but not 
so easy for PHP ...

> We may end writing our own library for the core operations... But I
> would prefer to avoid that as it is really not a trivial task.
Totally agree ... but I don't see a good path yet?
While ICU creates it's own complications, using ready bundled versions, it is by 
far the cleanest code for both UTF-8 and actually UTF-32 if one simply ditches 
all the UTF-16 mess. I'd much rather start from that code than any of the other 
libraries so far identified. In any case I don't see any option for the 
conversion process to and from UTF-8?

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk