Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:72839 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 67514 invoked from network); 27 Feb 2014 10:47:16 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 27 Feb 2014 10:47:16 -0000 Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 217.147.176.204 mail4.serversure.net Linux 2.6 Received: from [217.147.176.204] ([217.147.176.204:42578] helo=mail4.serversure.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 83/83-41714-2B71F035 for ; Thu, 27 Feb 2014 05:47:15 -0500 Received: (qmail 25074 invoked by uid 89); 27 Feb 2014 10:47:11 -0000 Received: by simscan 1.3.1 ppid: 25068, pid: 25071, t: 0.0701s scanners: attach: 1.3.1 clamav: 0.96/m:52 Received: from unknown (HELO linux-dev4.lsces.org.uk) (lester@rainbowdigitalmedia.org.uk@81.138.11.136) by mail4.serversure.net with ESMTPA; 27 Feb 2014 10:47:11 -0000 Message-ID: <530F18C6.1000301@lsces.co.uk> Date: Thu, 27 Feb 2014 10:51:50 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24 MIME-Version: 1.0 To: PHP internals References: <530F0BF8.4040307@lsces.co.uk> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Re: [php6] Unicode support, options? From: lester@lsces.co.uk (Lester Caine) Pierre Joye wrote: >> That what ever is used will need to be both tailored for PHP and transparent >> >as far as ICU is concerned is as you have identified - a given. ICU is still >> >built using 32bit string lengths ( I think? ) which does add to the fun, but >> >I don't see any reason not to be using functions like compareUTF8() and >> >ucasemap_utf8ToLower() from ICU in which case the strings need to be >> >standard ICU UTF-8 strings? I can see the advantage of the 'fast' compare >> >that I have been banging on about elsewhere, which looks for a simple match >> >between two raw strings of bytes. UTF-8 only comes into that when you need >> >to add 'rank'? But much of the core processing CAN simply ignore that as >> >long as the generic calls don't have dead tails which activate it? > We may use our own functions (or other lib) to covers operations not > implemented in ICU or too slow because of the conversions. That's why > investigating in other tools is still a good thing to do. The bit I'm still missing here is 'operations not implemented in ICU'? As soon as conversions are required then speed is always going to be compromised, but where the platform is already UTF-8 based, which is a growing situation, then all we are looking for is to handle UTF-8 strings quickly. For the best performance conversions can simply be avoided. So I'm currently looking at conversion as a secondary problem - probably less important than case! - and just trying to identify what is missing from ICU's UTF-8 that needs to be added? It may well be that windows is a special case that needs it's own conversion layer, but that should not form part of any core upgrade. It is not needed for many installations? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk