Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:73063 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 65934 invoked from network); 11 Mar 2014 12:27:17 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Mar 2014 12:27:17 -0000 Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 217.147.176.204 mail4.serversure.net Linux 2.6 Received: from [217.147.176.204] ([217.147.176.204:53354] helo=mail4.serversure.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 15/45-29501-4210F135 for ; Tue, 11 Mar 2014 07:27:16 -0500 Received: (qmail 14533 invoked by uid 89); 11 Mar 2014 12:27:12 -0000 Received: by simscan 1.3.1 ppid: 14524, pid: 14528, t: 0.0744s scanners: attach: 1.3.1 clamav: 0.96/m:52 Received: from unknown (HELO linux-dev4.lsces.org.uk) (lester@rainbowdigitalmedia.org.uk@81.138.11.136) by mail4.serversure.net with ESMTPA; 11 Mar 2014 12:27:12 -0000 Message-ID: <531F0146.5010701@lsces.co.uk> Date: Tue, 11 Mar 2014 12:27:50 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24 MIME-Version: 1.0 To: PHP Developers Mailing List References: <531EE602.3090207@lsces.co.uk> <531EEE2A.2000602@googlemail.com> In-Reply-To: <531EEE2A.2000602@googlemail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Unicode strings? From: lester@lsces.co.uk (Lester Caine) Crypto Compress wrote: >> I'm slowly working through a long list of things relating to unicode strings >> trying to work out just where the main problems are. >> >> The very first problem I hit is ICU's limitation to 32bit string lengths. How >> does the switch to 64bit string length on 64 bit platforms impinge on this. >> While I can see the advantage of this particular change, would that also now >> require our own version of ICU capable of also handling longer strings? This >> probably falls out in the wash of my next point ... > > Where have you found this information? Can you please provide source for this? This information has been published in several places on the list and in the wiki already ... http://userguide.icu-project.org/strings/utf-8 for the ICU, and the RFC's here for 64 bit improvements to PHP ... >> Currently strings are simply strings? I'm sure we have already had this >> discussion, and it will be necessary to switch from simple strings to a string >> object which can handle the intricacies of unicode? > > Yes, currently we have so called binary strings (simple bytes, 8 bits). > No, we should not create an string-object to handle all intricacies of unicode. How do you provide a holder for the various additional items required for a unicode 'object'? While I can see one would get away with calling functions all the time on a single string object, having calculated different versions of the same string or complex character counts, they need to be cached so they can be used again? Or does one maintain each answer in different variables? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk