Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:73073 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 22728 invoked from network); 12 Mar 2014 09:50:07 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Mar 2014 09:50:07 -0000 Authentication-Results: pb1.pair.com smtp.mail=cryptocompress@googlemail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=cryptocompress@googlemail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain googlemail.com designates 209.85.214.43 as permitted sender) X-PHP-List-Original-Sender: cryptocompress@googlemail.com X-Host-Fingerprint: 209.85.214.43 mail-bk0-f43.google.com Received: from [209.85.214.43] ([209.85.214.43:37523] helo=mail-bk0-f43.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 3F/B0-17005-DCD20235 for ; Wed, 12 Mar 2014 04:50:06 -0500 Received: by mail-bk0-f43.google.com with SMTP id v15so1376249bkz.16 for ; Wed, 12 Mar 2014 02:50:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=qvnQeoPpwy6BM3tUUj0o2ZJkjwkf3e1+UypGP39rEKA=; b=yuWzaMXwNmOIzsP3KFV4J4lFLdPsZeUYccZh75v9e4KFvFklOfVZCYRUqFLV3eIxTv AQ/9i75piuFNtWbFmwYOSEJgncP7PGRF+ObK93yTDlpCFEYzSxwFH3fiQ4vBSY1S8Zzd IkyQemjRYPH69Idgmz/S9d5aqXahJq1dLj0TFovf5YoEfUQ0ZCwvmEtpPg/FNjQgLgvE 4Ki0TOTMbKhlUG0youfdaCBwc6Bz2JjpuHQqD8Vn+1Q3qDQA+6Hy8LtD/C9sJ5ubSu20 S3Vtu5NWCr8DLnNCBc4Je2LDWgURnzQ8J8CjDyp8CWfP4+0cKF1J0iC2ycDzN3BXWKBy USTw== X-Received: by 10.205.14.196 with SMTP id pr4mr10bkb.89.1394617801099; Wed, 12 Mar 2014 02:50:01 -0700 (PDT) Received: from [192.168.1.115] (mnch-5d854468.pool.mediaWays.net. [93.133.68.104]) by mx.google.com with ESMTPSA id p5sm17796175bkh.1.2014.03.12.02.49.58 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 12 Mar 2014 02:49:59 -0700 (PDT) Message-ID: <53202DC5.4010306@googlemail.com> Date: Wed, 12 Mar 2014 10:49:57 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Lester Caine , PHP Developers Mailing List References: <531EE602.3090207@lsces.co.uk> <531EEE2A.2000602@googlemail.com> <531F0146.5010701@lsces.co.uk> In-Reply-To: <531F0146.5010701@lsces.co.uk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Unicode strings? From: cryptocompress@googlemail.com (Crypto Compress) Hi, Am 11.03.2014 13:27, schrieb Lester Caine: > Crypto Compress wrote: >>> I'm slowly working through a long list of things relating to unicode >>> strings >>> trying to work out just where the main problems are. >>> >>> The very first problem I hit is ICU's limitation to 32bit string >>> lengths. How >>> does the switch to 64bit string length on 64 bit platforms impinge >>> on this. >>> While I can see the advantage of this particular change, would that >>> also now >>> require our own version of ICU capable of also handling longer >>> strings? This >>> probably falls out in the wash of my next point ... >> >> Where have you found this information? Can you please provide source >> for this? > > This information has been published in several places on the list and > in the wiki already ... > http://userguide.icu-project.org/strings/utf-8 for the ICU, and the > RFC's here for 64 bit improvements to PHP ... Quote #1: "You can request 64 or 32 bits with the --with-library-bits= option, ..." Quote #2: "Strings are represented as UChar * as the base string type." http://userguide.icu-project.org/icufaq#TOC-How-do-I-get-32--or-64-bit-versions-of-the-ICU-libraries- String length is platform dependent. > >>> Currently strings are simply strings? I'm sure we have already had this >>> discussion, and it will be necessary to switch from simple strings >>> to a string >>> object which can handle the intricacies of unicode? >> >> Yes, currently we have so called binary strings (simple bytes, 8 bits). >> No, we should not create an string-object to handle all intricacies >> of unicode. > > How do you provide a holder for the various additional items required > for a unicode 'object'? While I can see one would get away with > calling functions all the time on a single string object, having > calculated different versions of the same string or complex character > counts, they need to be cached so they can be used again? Or does one > maintain each answer in different variables? > I think of this as a "immutable ValueObject". If a string is converted, there is no reason to cache the old string. binary => convert to utf-8 as de_de.iso-8859-15@euro => {"utf-8", "de_DE_EURO", binary} => convert to utf-32 => {"utf-32", "de_DE_EURO", bigger-binary} What other data is needed in here to be doubtless unicode? Is locale needed at all? Should it be nullable? Case-(in)sensitive? cryptocompress