Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:73076 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 28042 invoked from network); 12 Mar 2014 10:33:32 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Mar 2014 10:33:32 -0000 Authentication-Results: pb1.pair.com header.from=cryptocompress@googlemail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=cryptocompress@googlemail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain googlemail.com designates 209.85.214.48 as permitted sender) X-PHP-List-Original-Sender: cryptocompress@googlemail.com X-Host-Fingerprint: 209.85.214.48 mail-bk0-f48.google.com Received: from [209.85.214.48] ([209.85.214.48:56793] helo=mail-bk0-f48.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 68/D1-17005-BF730235 for ; Wed, 12 Mar 2014 05:33:31 -0500 Received: by mail-bk0-f48.google.com with SMTP id mx12so1350249bkb.7 for ; Wed, 12 Mar 2014 03:33:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=zTcWrryV55OpuRpJguKT/FKqk0mFqIBUMn74J1Gx8dM=; b=XfaF+eE2xltM5vNfy/i0jbv6r2ZwTZFki67mEgo5q9I8+xyNxc7yWrzb028KdDJFQW xRhvFosUmoEILWUUGG+fqUn+X5z1P7W5Sn3W7h8op7fmSpBnNDp4Bgigd5CPb++alY3k xfsG2TzaPZ4oQsiD8WmjSdflWDZ2XqRGgibR82EiF7rkJdsrJguwoRPK9ltAPZeunMuu o1bUeFMYrpN9iRpzRyyiqCH0+LEqlwFOaWjEB2zFdRM21+vYB5jbxNCzojZgX8ht4C9L DFKEkaEvxolVbSVIzCe4aQWYGWGAgZel/8htIPTadM0lneCXro+kw7e8tioqbDs4CpFw AzXA== X-Received: by 10.205.36.133 with SMTP id ta5mr1321485bkb.28.1394620407986; Wed, 12 Mar 2014 03:33:27 -0700 (PDT) Received: from [192.168.1.115] (mnch-5d854468.pool.mediaWays.net. [93.133.68.104]) by mx.google.com with ESMTPSA id c15sm8244222bky.13.2014.03.12.03.33.25 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 12 Mar 2014 03:33:26 -0700 (PDT) Message-ID: <532037F4.6020204@googlemail.com> Date: Wed, 12 Mar 2014 11:33:24 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Lester Caine , PHP Developers Mailing List References: <531EE602.3090207@lsces.co.uk> <531EEE2A.2000602@googlemail.com> <531F0146.5010701@lsces.co.uk> <53202DC5.4010306@googlemail.com> <532033E1.60602@lsces.co.uk> <53203687.7090405@googlemail.com> In-Reply-To: <53203687.7090405@googlemail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Unicode strings? From: cryptocompress@googlemail.com (Crypto Compress) Am 12.03.2014 11:27, schrieb Crypto Compress: > Am 12.03.2014 11:16, schrieb Lester Caine: >> Crypto Compress wrote: >>>>>> The very first problem I hit is ICU's limitation to 32bit string >>>>>> lengths. How >>>>>> does the switch to 64bit string length on 64 bit platforms >>>>>> impinge on this. >>>>>> While I can see the advantage of this particular change, would >>>>>> that also now >>>>>> require our own version of ICU capable of also handling longer >>>>>> strings? This >>>>>> probably falls out in the wash of my next point ... >>>>> >>>>> Where have you found this information? Can you please provide >>>>> source for this? >>>> >>>> This information has been published in several places on the list >>>> and in the >>>> wiki already ... >>>> http://userguide.icu-project.org/strings/utf-8 for the ICU, and the >>>> RFC's here >>>> for 64 bit improvements to PHP ... >>> >>> Quote #1: "You can request 64 or 32 bits with the >>> --with-library-bits= option, ..." >>> Quote #2: "Strings are represented as UChar * as the base string type." >>> >>> http://userguide.icu-project.org/icufaq#TOC-How-do-I-get-32--or-64-bit-versions-of-the-ICU-libraries- >>> >>> >>> String length is platform dependent. >> >> It is not only PHP that has hidden gems of information buried in the >> documentation, but ... >> "For UTF-8 strings, ICU normally uses (const) char * pointers and >> int32_t lengths" >> >> The question here is how UTF-8 default works in ICU as we want to >> actually avoid using UChar altogether using UText instead - I think? >> > > http://www.icu-project.org/apiref/icu4c/utext_8h.html > > int64_t utext_nativeLength (UText *ut) Get the length of the > text. > > Looks like UText is utf-16. > ICU Text Access allows other formats, such as UTF-8 or non-contiguous UTF-16 strings, to be placed in a UText wrapper and then passed to ICU services.