Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:72742 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 10552 invoked from network); 21 Feb 2014 13:24:37 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 Feb 2014 13:24:37 -0000 Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 217.147.176.204 mail4.serversure.net Linux 2.6 Received: from [217.147.176.204] ([217.147.176.204:41441] helo=mail4.serversure.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id C9/A4-22355-49357035 for ; Fri, 21 Feb 2014 08:24:37 -0500 Received: (qmail 8646 invoked by uid 89); 21 Feb 2014 13:24:33 -0000 Received: by simscan 1.3.1 ppid: 8639, pid: 8643, t: 0.0862s scanners: attach: 1.3.1 clamav: 0.96/m:52 Received: from unknown (HELO linux-dev4.lsces.org.uk) (lester@rainbowdigitalmedia.org.uk@81.138.11.136) by mail4.serversure.net with ESMTPA; 21 Feb 2014 13:24:33 -0000 Message-ID: <5307548C.20100@lsces.co.uk> Date: Fri, 21 Feb 2014 13:28:44 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24 MIME-Version: 1.0 To: PHP internals References: <53061982.2050901@googlemail.com> <53066DE9.4090809@googlemail.com> <530740B9.5000509@lsces.co.uk> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] [php6] Unicode support, options? From: lester@lsces.co.uk (Lester Caine) Pierre Joye wrote: > On Fri, Feb 21, 2014 at 1:04 PM, Lester Caine wrote: >> Pierre Joye wrote: >>>> >>>> What do you understand by "storage"? >>> >>> To have string stored as UTF-8 only, no conversion required for 99% of our >>> use. >> >> >> I think that the first thing that needs to be agreed on is if there will be >> support for UTF-8 in the core? As has already been said, in many places this >> currently just works and so blocking that may be more of a problem now? The >> question surly is "What is the 1% that needs some extra work?" > > I think we pretty much agree already that we need UTF-8 as the base, > meaning are stored in UTF-8. Conversions may be needed for advanced > usages provided by ICU (or maybe not, I just do not know for sure > now). > >> I light library would be most appropriate for filling the gaps currently >> created by use of UTF-8 strings in the core? It is not until one starts >> adding the mbstring level of string processing that a more powerful library >> is required. Something that simply ensures UTF-8 strings are valid and can >> carry out comparisons as required? > > it is more than only comparison. If only comparison, additions and the > likes, utf8proc is enough, or librope with some additions. Only thing putting me off utf8proc is that it only supports Unicode 5.0.0 librope does not seem to understand any of the fine detail of the uncode standards? What I've been looking for is the case switch actions and currently all I can find is ICU to handle that? >> The black hole is still 'case sensitivity' and it is perhaps laying down a >> 'light' set of rules for this which would allow a path forward? As I have >> indicated, I'd prefer simply dropping case insensitivity, but a compromise >> might be to retain it where a string length does not change, and a clean >> reverse transform exists? So a library that provides that comparison as part >> of the core package? > > I do not care much about languages support for UTF-8 names for > methods, functons, variables etc. My take on it is that we should > stick to ASCII for it and be done with that. But that's only my > opinion :) While I have no intention of using more than ASCII myself I can see the argument for supporting use of more user friendly names for functions and the like. I see the complaints about our current 'English' names and how they need improving while at the same time I am dealing with customer sites where we provide simple aliases for all text in a local translation. Easy enough in a relational database where you simply select the right set of entries from a table, but not so easy for PHP ... > We may end writing our own library for the core operations... But I > would prefer to avoid that as it is really not a trivial task. Totally agree ... but I don't see a good path yet? While ICU creates it's own complications, using ready bundled versions, it is by far the cleanest code for both UTF-8 and actually UTF-32 if one simply ditches all the UTF-16 mess. I'd much rather start from that code than any of the other libraries so far identified. In any case I don't see any option for the conversion process to and from UTF-8? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk