Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:73075 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 26508 invoked from network); 12 Mar 2014 10:27:26 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Mar 2014 10:27:26 -0000 Authentication-Results: pb1.pair.com header.from=cryptocompress@googlemail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=cryptocompress@googlemail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain googlemail.com designates 209.85.214.43 as permitted sender) X-PHP-List-Original-Sender: cryptocompress@googlemail.com X-Host-Fingerprint: 209.85.214.43 mail-bk0-f43.google.com Received: from [209.85.214.43] ([209.85.214.43:56898] helo=mail-bk0-f43.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 82/81-17005-D8630235 for ; Wed, 12 Mar 2014 05:27:26 -0500 Received: by mail-bk0-f43.google.com with SMTP id v15so1404896bkz.30 for ; Wed, 12 Mar 2014 03:27:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=ckeQCGQXfqIrp2+uxH4kn8YRY/q5kYZYByK6kZfjxyc=; b=cToRVAHCGmY9jiBuzfGzvfNgB5CWAO5Da59leyK3ZCT2/lT3I9g+XQcaIJpLAkTN2j 6XZlVrog1cihFXuNuH//iGNU/lH7wFA+9R9OjQ64Kt/8yze2AFw53BsPN7MSB/3OzICg itQAtKLaw6mxmjG33Nd7ySlc6ydZ05rULLQHbq/bk0GHooha0/9Frh81fjsss0sxk6mG xki3R+IxakQM7A0JgN/jU84GXWq7wv5MBDnOzQ72wbAEgYzgN4qqLHU3+qKVWcYz4vFR AGM60zWvx+eyzn+FWPhYb1LKxCZs7fZMg99hpgWkgaHbWjpqs/tbdq3Dl9zA0jkfGISP 7yKQ== X-Received: by 10.204.68.14 with SMTP id t14mr49493bki.99.1394620042732; Wed, 12 Mar 2014 03:27:22 -0700 (PDT) Received: from [192.168.1.115] (mnch-5d854468.pool.mediaWays.net. [93.133.68.104]) by mx.google.com with ESMTPSA id u14sm17876527bkg.9.2014.03.12.03.27.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 12 Mar 2014 03:27:21 -0700 (PDT) Message-ID: <53203687.7090405@googlemail.com> Date: Wed, 12 Mar 2014 11:27:19 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Lester Caine , PHP Developers Mailing List References: <531EE602.3090207@lsces.co.uk> <531EEE2A.2000602@googlemail.com> <531F0146.5010701@lsces.co.uk> <53202DC5.4010306@googlemail.com> <532033E1.60602@lsces.co.uk> In-Reply-To: <532033E1.60602@lsces.co.uk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Unicode strings? From: cryptocompress@googlemail.com (Crypto Compress) Am 12.03.2014 11:16, schrieb Lester Caine: > Crypto Compress wrote: >>>>> The very first problem I hit is ICU's limitation to 32bit string >>>>> lengths. How >>>>> does the switch to 64bit string length on 64 bit platforms impinge >>>>> on this. >>>>> While I can see the advantage of this particular change, would >>>>> that also now >>>>> require our own version of ICU capable of also handling longer >>>>> strings? This >>>>> probably falls out in the wash of my next point ... >>>> >>>> Where have you found this information? Can you please provide >>>> source for this? >>> >>> This information has been published in several places on the list >>> and in the >>> wiki already ... >>> http://userguide.icu-project.org/strings/utf-8 for the ICU, and the >>> RFC's here >>> for 64 bit improvements to PHP ... >> >> Quote #1: "You can request 64 or 32 bits with the >> --with-library-bits= option, ..." >> Quote #2: "Strings are represented as UChar * as the base string type." >> >> http://userguide.icu-project.org/icufaq#TOC-How-do-I-get-32--or-64-bit-versions-of-the-ICU-libraries- >> >> >> String length is platform dependent. > > It is not only PHP that has hidden gems of information buried in the > documentation, but ... > "For UTF-8 strings, ICU normally uses (const) char * pointers and > int32_t lengths" > > The question here is how UTF-8 default works in ICU as we want to > actually avoid using UChar altogether using UText instead - I think? > http://www.icu-project.org/apiref/icu4c/utext_8h.html int64_t utext_nativeLength (UText *ut) Get the length of the text. Looks like UText is utf-16.