Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47321 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 84494 invoked from network); 16 Mar 2010 17:41:01 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 16 Mar 2010 17:41:01 -0000 Authentication-Results: pb1.pair.com smtp.mail=dreamcat4@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=dreamcat4@gmail.com; sender-id=pass; domainkeys=bad Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.220.215 as permitted sender) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: dreamcat4@gmail.com X-Host-Fingerprint: 209.85.220.215 mail-fx0-f215.google.com Received: from [209.85.220.215] ([209.85.220.215:41254] helo=mail-fx0-f215.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 20/DC-15129-CA2CF9B4 for ; Tue, 16 Mar 2010 12:41:01 -0500 Received: by fxm7 with SMTP id 7so197491fxm.23 for ; Tue, 16 Mar 2010 10:40:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:cc:content-type; bh=iZij7wD/kI00a0Dl4/kSCtYFCfGqUtHQwJGT1mW+nT0=; b=bZibpmyiH7NvvWyOtjA+1no1tH1qenzxtk8qREUPxnYhYh41pxf25C7U4JbjFGX1Lm iI14sos2h4Q9HMmToY5RCYrdQcuSNeKDUvwqWEOXLvaJTdqO5iE5vR4H4DhyxvY6QTDv 1GltrRWqamGyGlT75G7p7JTCQs7L3+G6RSvfM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; b=hOcxk2/KQ4+hN2LEylGlLqGC/FjOMbu7R4UxL3i0DmhbF5sULh28nIM6BWcnoTRpRw wcbM3oHbgboAF06c1FFqUB0ipPjeFQhhY5DzIZlOJZ3OiIGzItNpcxHkPzoazM4EzfxO jPIElokv5ngIfSZW2oAiCgjoGDB3iKlb3YFjU= MIME-Version: 1.0 Received: by 10.223.77.136 with SMTP id g8mr6620912fak.10.1268761257496; Tue, 16 Mar 2010 10:40:57 -0700 (PDT) In-Reply-To: <99cf22521003160448k5028ae61y70e1e61428d13280@mail.gmail.com> References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com> <4B9F4196.9030404@lsces.co.uk> <99cf22521003160448k5028ae61y70e1e61428d13280@mail.gmail.com> Date: Tue, 16 Mar 2010 17:40:37 +0000 Message-ID: <99cf22521003161040x4dba08fblb7e088cef16b64a9@mail.gmail.com> To: Lester Caine Cc: PHP internals Content-Type: text/plain; charset=UTF-8 Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: dreamcat4@gmail.com (dreamcat four) On Tue, Mar 16, 2010 at 11:48 AM, dreamcat four wrote: > On Tue, Mar 16, 2010 at 8:30 AM, Lester Caine wrote: >> '3' is not a very processor friendly number, so working with 4 even though >> wasteful on memory, does make perfect sense. How long is it since we had a >> 640k limit on working memory? SERVERS should have a good amount of memory >> for caching information anyway. SO is UTF-16 the right approach for >> processing wide strings? It needs special code to handle everything wider >> than 16 bits, but at what gain really? If all core functionality is handled >> as 32 bit characters is there that much of an overhead over the additional >> processing to get around strings of dissimilar sizes in UTF-16 ? > > Just to re-enforce some of Lester's points above here. > > 4-byte per character is never slower that 2-bytes per character... its > faster if anything. Bear in mind that 4-byte has been the defacto size > for all modern cpu registers / 32-bit microarchitectures since.... > like... Forever. Give a c compiler 4bytes of data... it'll say: thank > you very much, and more of the same please! It keeps em happy ;) > > Sure UTF-16 can make sense. But only if your external representations > are also in UTF-16. So whats the default Unicode settings for MYSQL, > POSTGRE, etc? Well, are they always set to UTF-8, or UTF-16? > To answer my own question, I have done some some further research. It seems that both MySQL and Postgre recommend / default to Latin1 (8-bit ASCII) and 'C' (7-bit ASCII) respectively. So that is to say neither set themselves to any unicode standard by default. In the case of Postgre, the ASCII default is often overiden to UTF-8 by the distro / os / package managers. From the $LOCALE environment variable. So then its UTF-8. In the case of MySQL, it may be left as latin1. But most competent web developers decide to set it to utf-8. Again, its not generally believed that very many people (by comparison) actively chooses utf-16. The most common encoding issue people run into is that their web application has sent their database utf-8 encoded data. But their (usually a MySQL) database still has the factory default encoding Latin-1 (8-bit ascii). People who discover this almost always solve the problem by converting their databases into utf-8. As for text files on disk, if they are unicode, they are most commonly utf-8 too. So then, why use utf-16 as internal unicode representation in Php? It doesn't really make a lot of sense for most regular people who want to use Php for their web application. Unless they don't really care how slow its gonna be converting everything, constantly...