Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47325 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 96381 invoked from network); 16 Mar 2010 18:25:26 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 16 Mar 2010 18:25:26 -0000 Authentication-Results: pb1.pair.com header.from=php@hristov.com; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=php@hristov.com; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain hristov.com from 85.92.87.36 cause and error) X-PHP-List-Original-Sender: php@hristov.com X-Host-Fingerprint: 85.92.87.36 iko.gotobg.net Linux 2.6 Received: from [85.92.87.36] ([85.92.87.36:40692] helo=iko.gotobg.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id E3/3F-15129-41DCF9B4 for ; Tue, 16 Mar 2010 13:25:25 -0500 Received: from g226141143.adsl.alicedsl.de ([92.226.141.143] helo=[192.168.1.127]) by iko.gotobg.net with esmtpa (Exim 4.69) (envelope-from ) id 1NrbRz-0001GK-81; Tue, 16 Mar 2010 20:25:07 +0200 Message-ID: <4B9FCD0E.1060405@hristov.com> Date: Tue, 16 Mar 2010 19:25:18 +0100 User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: dreamcat four CC: Lester Caine , PHP internals References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com> <4B9F4196.9030404@lsces.co.uk> <99cf22521003160448k5028ae61y70e1e61428d13280@mail.gmail.com> <99cf22521003161040x4dba08fblb7e088cef16b64a9@mail.gmail.com> In-Reply-To: <99cf22521003161040x4dba08fblb7e088cef16b64a9@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - iko.gotobg.net X-AntiAbuse: Original Domain - lists.php.net X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - hristov.com X-Source: X-Source-Args: X-Source-Dir: Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: php@hristov.com (Andrey Hristov) dreamcat four wrote: > On Tue, Mar 16, 2010 at 11:48 AM, dreamcat four wrote: >> On Tue, Mar 16, 2010 at 8:30 AM, Lester Caine wrote: >>> '3' is not a very processor friendly number, so working with 4 even though >>> wasteful on memory, does make perfect sense. How long is it since we had a >>> 640k limit on working memory? SERVERS should have a good amount of memory >>> for caching information anyway. SO is UTF-16 the right approach for >>> processing wide strings? It needs special code to handle everything wider >>> than 16 bits, but at what gain really? If all core functionality is handled >>> as 32 bit characters is there that much of an overhead over the additional >>> processing to get around strings of dissimilar sizes in UTF-16 ? >> Just to re-enforce some of Lester's points above here. >> >> 4-byte per character is never slower that 2-bytes per character... its >> faster if anything. Bear in mind that 4-byte has been the defacto size >> for all modern cpu registers / 32-bit microarchitectures since.... >> like... Forever. Give a c compiler 4bytes of data... it'll say: thank >> you very much, and more of the same please! It keeps em happy ;) >> >> Sure UTF-16 can make sense. But only if your external representations >> are also in UTF-16. So whats the default Unicode settings for MYSQL, >> POSTGRE, etc? Well, are they always set to UTF-8, or UTF-16? >> > > To answer my own question, I have done some some further research. > > It seems that both MySQL and Postgre recommend / default to Latin1 > (8-bit ASCII) and 'C' (7-bit ASCII) respectively. So that is to say > neither set themselves to any unicode standard by default. > > In the case of Postgre, the ASCII default is often overiden to UTF-8 > by the distro / os / package managers. From the $LOCALE environment > variable. So then its UTF-8. > > In the case of MySQL, it may be left as latin1. But most competent web > developers decide to set it to utf-8. Again, its not generally > believed that very many people (by comparison) actively chooses > utf-16. The most common encoding issue people run into is that their > web application has sent their database utf-8 encoded data. But their > (usually a MySQL) database still has the factory default encoding > Latin-1 (8-bit ascii). People who discover this almost always solve > the problem by converting their databases into utf-8. MySQL doesn't support UTF-16 in any GA release. UCS-2 can be used though. > As for text files on disk, if they are unicode, they are most commonly > utf-8 too. So then, why use utf-16 as internal unicode representation > in Php? It doesn't really make a lot of sense for most regular people > who want to use Php for their web application. Unless they don't > really care how slow its gonna be converting everything, constantly... > Andrey