Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:73074 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 24777 invoked from network); 12 Mar 2014 10:15:24 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Mar 2014 10:15:24 -0000 Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 217.147.176.204 mail4.serversure.net Linux 2.6 Received: from [217.147.176.204] ([217.147.176.204:45035] helo=mail4.serversure.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 12/21-17005-AB330235 for ; Wed, 12 Mar 2014 05:15:23 -0500 Received: (qmail 7477 invoked by uid 89); 12 Mar 2014 10:15:19 -0000 Received: by simscan 1.3.1 ppid: 7469, pid: 7474, t: 0.0716s scanners: attach: 1.3.1 clamav: 0.96/m:52 Received: from unknown (HELO linux-dev4.lsces.org.uk) (lester@rainbowdigitalmedia.org.uk@81.138.11.136) by mail4.serversure.net with ESMTPA; 12 Mar 2014 10:15:19 -0000 Message-ID: <532033E1.60602@lsces.co.uk> Date: Wed, 12 Mar 2014 10:16:01 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24 MIME-Version: 1.0 To: PHP Developers Mailing List References: <531EE602.3090207@lsces.co.uk> <531EEE2A.2000602@googlemail.com> <531F0146.5010701@lsces.co.uk> <53202DC5.4010306@googlemail.com> In-Reply-To: <53202DC5.4010306@googlemail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Unicode strings? From: lester@lsces.co.uk (Lester Caine) Crypto Compress wrote: >>>> The very first problem I hit is ICU's limitation to 32bit string lengths. How >>>> does the switch to 64bit string length on 64 bit platforms impinge on this. >>>> While I can see the advantage of this particular change, would that also now >>>> require our own version of ICU capable of also handling longer strings? This >>>> probably falls out in the wash of my next point ... >>> >>> Where have you found this information? Can you please provide source for this? >> >> This information has been published in several places on the list and in the >> wiki already ... >> http://userguide.icu-project.org/strings/utf-8 for the ICU, and the RFC's here >> for 64 bit improvements to PHP ... > > Quote #1: "You can request 64 or 32 bits with the --with-library-bits= option, ..." > Quote #2: "Strings are represented as UChar * as the base string type." > > http://userguide.icu-project.org/icufaq#TOC-How-do-I-get-32--or-64-bit-versions-of-the-ICU-libraries- > > String length is platform dependent. It is not only PHP that has hidden gems of information buried in the documentation, but ... "For UTF-8 strings, ICU normally uses (const) char * pointers and int32_t lengths" The question here is how UTF-8 default works in ICU as we want to actually avoid using UChar altogether using UText instead - I think? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk