Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47265 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 73751 invoked from network); 14 Mar 2010 14:23:31 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Mar 2010 14:23:31 -0000 Authentication-Results: pb1.pair.com smtp.mail=j.boggiano@seld.be; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=j.boggiano@seld.be; sender-id=pass Received-SPF: pass (pb1.pair.com: domain seld.be designates 74.125.82.42 as permitted sender) X-PHP-List-Original-Sender: j.boggiano@seld.be X-Host-Fingerprint: 74.125.82.42 mail-ww0-f42.google.com Received: from [74.125.82.42] ([74.125.82.42:52111] helo=mail-ww0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id D1/20-07348-161FC9B4 for ; Sun, 14 Mar 2010 09:23:31 -0500 Received: by wwc33 with SMTP id 33so1817602wwc.29 for ; Sun, 14 Mar 2010 07:23:26 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.162.202 with SMTP id y52mr2433163wek.76.1268576606795; Sun, 14 Mar 2010 07:23:26 -0700 (PDT) In-Reply-To: <13008E62F851429F84B9FE2F3F230286@pc> References: <4B9C9007.1080802@lsces.co.uk> <4B9C91D7.2050402@rowe-clan.net> <13008E62F851429F84B9FE2F3F230286@pc> Date: Sun, 14 Mar 2010 15:23:26 +0100 Message-ID: <4bcbf4711003140723s712c2653xa61e8f6053983553@mail.gmail.com> To: Stan Vassilev Cc: "William A. Rowe Jr." , internals@lists.php.net Content-Type: text/plain; charset=ISO-8859-1 Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: j.boggiano@seld.be (Jordi Boggiano) On Sun, Mar 14, 2010 at 12:03 PM, Stan Vassilev wrote: > UTF8 also takes 4 bytes for representing characters in the higher bit > planes, as quite a lot of bits are lost for every char in order to describe > how long the code point is, and when it ends and so on. This means > memory-wise it may not be of big benefit to asian countries. I remember Brian Aker saying that they chose to work internally with UTF-8 for Drizzle. His explanation of it was that asian countries have so much english content mixed in that on average even for them UTF-8 still had a lower footprint than UTF-16/32. I do not know where the stats came from, but if it holds any truth it is worth considering. Cheers, Jordi