Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:72837 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 62503 invoked from network); 27 Feb 2014 09:52:38 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 27 Feb 2014 09:52:38 -0000 Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 217.147.176.204 mail4.serversure.net Linux 2.6 Received: from [217.147.176.204] ([217.147.176.204:43203] helo=mail4.serversure.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 3D/92-41714-4EA0F035 for ; Thu, 27 Feb 2014 04:52:37 -0500 Received: (qmail 7336 invoked by uid 89); 27 Feb 2014 09:52:34 -0000 Received: by simscan 1.3.1 ppid: 7327, pid: 7333, t: 0.0970s scanners: attach: 1.3.1 clamav: 0.96/m:52 Received: from unknown (HELO linux-dev4.lsces.org.uk) (lester@rainbowdigitalmedia.org.uk@81.138.11.136) by mail4.serversure.net with ESMTPA; 27 Feb 2014 09:52:33 -0000 Message-ID: <530F0BF8.4040307@lsces.co.uk> Date: Thu, 27 Feb 2014 09:57:12 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24 MIME-Version: 1.0 To: PHP internals References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Re: [php6] Unicode support, options? From: lester@lsces.co.uk (Lester Caine) Pierre Joye wrote: > On Thu, Feb 20, 2014 at 6:54 AM, Pierre Joye wrote: > >> * ICU: >> U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a >> ICU compile time setting.It is is not possible to set it at PHP >> configure time. It means that users will have to create their own >> build. Alternatively we can bundle ICU but this will be awkward, a >> maintenance nightmare for both php and the distros. >> >> Alternatively UText can be used to create UTF-8 string. APIs accepting >> UText allow almost everything we need. However the counterpart is that >> a UTF-8 UText is readonly. Any operation altering its content will >> require duplication, clones or conversions. That may kill all gains we >> got from using UTF-8 only. >> >> The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually >> show stopper. Asking users to custom build ICU is not an option >> either. I do not know if the distros will be ready to provide two >> different builds of ICU either, it may add a lot of issues with all >> projects using ICU. > > Here is a 1st reply from ICU: > > http://sourceforge.net/p/icu/mailman/message/32031609/ > > It sounds like this flag could be a good option for PHP's Unicode support. Reading between the lines, it would seem that a switch to UTF-8 base is their preferred path, but the core code is too engrained as UTF-16? Since there is really no alternative to ICU for the heavy grunt, I do see this as the right starting point. Any 'bells and whistles' should use the ICU UTF-8 style rather than pulling in yet more variations? The main problem in all of this is how it dovetails into windows? The reliance on 'UTF-16' style WCHAR seems to be the real problem there? > Btw, I created a sub page for Unicode support: > > https://wiki.php.net/ideas/php6/unicode > >> Thoughts, comments or ideas? Like you Pierre I'm no Unicode expert, and digging deeper simply reinforces the at times irritating compromises that Unicode contains. Obviously designed by committee? :( Currently I'm trying to work out just what is required at the core to support UTF-8 and while it is not a trivial problem, the bulk of the code is designed to handle strings of variable length and in it's basic form UTF-8 just creates longer strings? So isn't the next question quite simply 'case'? And how we handle case insensitivity in the core will determine what core Unicode functions are required? > I found another C++ library to do the basic UTF-8 operations, easl: > > https://code.google.com/p/easl/ > > It could be a nice one to use in combination with ICU, small and fast > (1st tests). C++ ? That what ever is used will need to be both tailored for PHP and transparent as far as ICU is concerned is as you have identified - a given. ICU is still built using 32bit string lengths ( I think? ) which does add to the fun, but I don't see any reason not to be using functions like compareUTF8() and ucasemap_utf8ToLower() from ICU in which case the strings need to be standard ICU UTF-8 strings? I can see the advantage of the 'fast' compare that I have been banging on about elsewhere, which looks for a simple match between two raw strings of bytes. UTF-8 only comes into that when you need to add 'rank'? But much of the core processing CAN simply ignore that as long as the generic calls don't have dead tails which activate it? Given the complexity of case conversion I can see the possible need for a mirror string holding a 'lower case' version which may be a different length and so 'string' could become a more complex object? But is this aspect what you are looking for the 'small fast library' to provide? easl would seem only to be trying to smooth the edges between windows and other platforms? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk