Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47268 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 78164 invoked from network); 14 Mar 2010 14:34:51 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Mar 2010 14:34:51 -0000 Authentication-Results: pb1.pair.com smtp.mail=mozo@mozo.jp; spf=permerror; sender-id=permerror Authentication-Results: pb1.pair.com header.from=mozo@mozo.jp; sender-id=permerror Received-SPF: error (pb1.pair.com: domain mozo.jp from 209.85.211.204 cause and error) X-PHP-List-Original-Sender: mozo@mozo.jp X-Host-Fingerprint: 209.85.211.204 mail-yw0-f204.google.com Received: from [209.85.211.204] ([209.85.211.204:35270] helo=mail-yw0-f204.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 18/11-07348-A04FC9B4 for ; Sun, 14 Mar 2010 09:34:51 -0500 Received: by ywh42 with SMTP id 42so1065412ywh.7 for ; Sun, 14 Mar 2010 07:34:44 -0700 (PDT) MIME-Version: 1.0 Received: by 10.90.14.14 with SMTP id 14mr1285673agn.34.1268577284141; Sun, 14 Mar 2010 07:34:44 -0700 (PDT) In-Reply-To: <4bcbf4711003140723s712c2653xa61e8f6053983553@mail.gmail.com> References: <4B9C9007.1080802@lsces.co.uk> <4B9C91D7.2050402@rowe-clan.net> <13008E62F851429F84B9FE2F3F230286@pc> <4bcbf4711003140723s712c2653xa61e8f6053983553@mail.gmail.com> Date: Sun, 14 Mar 2010 23:34:24 +0900 Message-ID: To: Jordi Boggiano Cc: Stan Vassilev , "William A. Rowe Jr." , internals@lists.php.net Content-Type: text/plain; charset=ISO-8859-1 Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: mozo@mozo.jp (Moriyoshi Koizumi) On Sun, Mar 14, 2010 at 11:23 PM, Jordi Boggiano wrote: > On Sun, Mar 14, 2010 at 12:03 PM, Stan Vassilev wrote: >> UTF8 also takes 4 bytes for representing characters in the higher bit >> planes, as quite a lot of bits are lost for every char in order to describe >> how long the code point is, and when it ends and so on. This means >> memory-wise it may not be of big benefit to asian countries. > > I remember Brian Aker saying that they chose to work internally with > UTF-8 for Drizzle. His explanation of it was that asian countries have > so much english content mixed in that on average even for them UTF-8 > still had a lower footprint than UTF-16/32. I do not know where the > stats came from, but if it holds any truth it is worth considering. This is true, as most of the text data that are interchanged in the Internet should be represented in HTML, in which such characters and alphabetic tags always appear alternatively. Moriyoshi