Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:72838 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 64750 invoked from network); 27 Feb 2014 10:28:41 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 27 Feb 2014 10:28:41 -0000 Authentication-Results: pb1.pair.com smtp.mail=pierre.php@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=pierre.php@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.180 as permitted sender) X-PHP-List-Original-Sender: pierre.php@gmail.com X-Host-Fingerprint: 209.85.216.180 mail-qc0-f180.google.com Received: from [209.85.216.180] ([209.85.216.180:36141] helo=mail-qc0-f180.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 7B/F2-41714-8531F035 for ; Thu, 27 Feb 2014 05:28:41 -0500 Received: by mail-qc0-f180.google.com with SMTP id i17so2986848qcy.39 for ; Thu, 27 Feb 2014 02:28:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=i6n+P3QJKSa95GpF2kVNrKfn2/T/1DjBJs54EkdSDMo=; b=aDMB+C8pulmcXvl+IgLnWWACHLsu5b11v9e7rgKcfI3Q/PbyO03/De6HTAAh2mIyqf uAlKR2mQ8A7RhSZ12YIMyT5uJxCraxDq0YMi4B/u0OQabI71BKFtb6vybJ1N01qhepoy lu3MVVdBtIH+GLYcmbIEObONrK9Y/tOlcrdZb7RNqqq/pedOk/TkUnVWcu3aidz91DaQ WUPRTMQOb9V2S1AO/PcCjbL85ErsWicF51ZP1n6FWRpibT80SaAI42PwkMfuoMCNwvkY SXf9rWZX0aqjxfjDo8S07YjSae6yJ6R+MwNXRUqZR4FCs2YSL0APe9ZXRvLgz8vdf88E T1QQ== MIME-Version: 1.0 X-Received: by 10.224.11.196 with SMTP id u4mr15564200qau.4.1393496918348; Thu, 27 Feb 2014 02:28:38 -0800 (PST) Received: by 10.140.18.145 with HTTP; Thu, 27 Feb 2014 02:28:38 -0800 (PST) In-Reply-To: <530F0BF8.4040307@lsces.co.uk> References: <530F0BF8.4040307@lsces.co.uk> Date: Thu, 27 Feb 2014 11:28:38 +0100 Message-ID: To: Lester Caine Cc: PHP internals Content-Type: text/plain; charset=UTF-8 Subject: Re: [PHP-DEV] Re: [php6] Unicode support, options? From: pierre.php@gmail.com (Pierre Joye) On Thu, Feb 27, 2014 at 10:57 AM, Lester Caine wrote: > Pierre Joye wrote: >> >> On Thu, Feb 20, 2014 at 6:54 AM, Pierre Joye wrote: >> >>> * ICU: >>> U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a >>> ICU compile time setting.It is is not possible to set it at PHP >>> configure time. It means that users will have to create their own >>> build. Alternatively we can bundle ICU but this will be awkward, a >>> maintenance nightmare for both php and the distros. >>> >>> Alternatively UText can be used to create UTF-8 string. APIs accepting >>> UText allow almost everything we need. However the counterpart is that >>> a UTF-8 UText is readonly. Any operation altering its content will >>> require duplication, clones or conversions. That may kill all gains we >>> got from using UTF-8 only. >>> >>> The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually >>> show stopper. Asking users to custom build ICU is not an option >>> either. I do not know if the distros will be ready to provide two >>> different builds of ICU either, it may add a lot of issues with all >>> projects using ICU. >> >> >> Here is a 1st reply from ICU: >> >> http://sourceforge.net/p/icu/mailman/message/32031609/ >> >> It sounds like this flag could be a good option for PHP's Unicode support. > > > Reading between the lines, it would seem that a switch to UTF-8 base is > their preferred path, but the core code is too engrained as UTF-16? Since > there is really no alternative to ICU for the heavy grunt, I do see this as > the right starting point. Any 'bells and whistles' should use the ICU UTF-8 > style rather than pulling in yet more variations? There are optimizations when this flag is used. Not all operations are possible using UTF-8, in these cases a conversions will be done before. There are not much to read between the lines here :) > The main problem in all of this is how it dovetails into windows? The > reliance on 'UTF-16' style WCHAR seems to be the real problem there? wchar is not UTF-16, nor Unicode. It is something we have to deal with no matter which road we take. Conversions from UTF-* to and from wchar will be required anyway on windows, for any *w APIs call. >> Btw, I created a sub page for Unicode support: >> >> https://wiki.php.net/ideas/php6/unicode >> >>> Thoughts, comments or ideas? > > > Like you Pierre I'm no Unicode expert, and digging deeper simply reinforces > the at times irritating compromises that Unicode contains. Obviously > designed by committee? :( > > Currently I'm trying to work out just what is required at the core to > support UTF-8 and while it is not a trivial problem, the bulk of the code is > designed to handle strings of variable length and in it's basic form UTF-8 > just creates longer strings? So isn't the next question quite simply 'case'? > And how we handle case insensitivity in the core will determine what core > Unicode functions are required? I do not care about case insensitivity yet, nor about unicode function/method/constant/etc names. This is a secondary issue at this stage. >> I found another C++ library to do the basic UTF-8 operations, easl: >> >> https://code.google.com/p/easl/ >> >> It could be a nice one to use in combination with ICU, small and fast >> (1st tests). > > > C++ ? yes. with c helpers. > That what ever is used will need to be both tailored for PHP and transparent > as far as ICU is concerned is as you have identified - a given. ICU is still > built using 32bit string lengths ( I think? ) which does add to the fun, but > I don't see any reason not to be using functions like compareUTF8() and > ucasemap_utf8ToLower() from ICU in which case the strings need to be > standard ICU UTF-8 strings? I can see the advantage of the 'fast' compare > that I have been banging on about elsewhere, which looks for a simple match > between two raw strings of bytes. UTF-8 only comes into that when you need > to add 'rank'? But much of the core processing CAN simply ignore that as > long as the generic calls don't have dead tails which activate it? We may use our own functions (or other lib) to covers operations not implemented in ICU or too slow because of the conversions. That's why investigating in other tools is still a good thing to do. Cheers, Pierre