Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47252 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 3376 invoked from network); 14 Mar 2010 07:35:57 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Mar 2010 07:35:57 -0000 Authentication-Results: pb1.pair.com header.from=wrowe@rowe-clan.net; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=wrowe@rowe-clan.net; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain rowe-clan.net from 72.167.82.87 cause and error) X-PHP-List-Original-Sender: wrowe@rowe-clan.net X-Host-Fingerprint: 72.167.82.87 p3plsmtpa01-07.prod.phx3.secureserver.net Linux 2.6 Received: from [72.167.82.87] ([72.167.82.87:58042] helo=p3plsmtpa01-07.prod.phx3.secureserver.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id A7/6D-15916-BD19C9B4 for ; Sun, 14 Mar 2010 02:35:57 -0500 Received: (qmail 4826 invoked from network); 14 Mar 2010 07:35:53 -0000 Received: from unknown (76.252.112.72) by p3plsmtpa01-07.prod.phx3.secureserver.net (72.167.82.87) with ESMTP; 14 Mar 2010 07:35:53 -0000 Message-ID: <4B9C91D7.2050402@rowe-clan.net> Date: Sun, 14 Mar 2010 01:35:51 -0600 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100227 Lightning/1.0b1 Thunderbird/3.0.3 MIME-Version: 1.0 To: internals@lists.php.net References: <4B9C9007.1080802@lsces.co.uk> In-Reply-To: <4B9C9007.1080802@lsces.co.uk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: wrowe@rowe-clan.net ("William A. Rowe Jr.") If Unicode were the solution, the PHP project was on the right page with 6.0. Sure there remained work to do, but... How long did it take to realize UTF16 wasn't the end of the story? UCS-4 is the minimum to solve this, and we all agree that 32 bits aren't storing a single char in the western world, no way, no how. The UTF-8 solution is probably the right answer... you maintain 95% of char *UTF behavior, and you gain international character representation. The only Unicode OS I can think of offhand is NT, and of course they hit the UCS-4 problem early. They found this out 15+ years ago. Sure it doesn't appear as atomic, one Xword per char, but the existing library frameworks contain most of the string processing that is required. There is no 16-bit network transmission API that I can think of, you are still devolving to UTF-8 for client results. To move forward with accepting -and preferring- UTF-8 as the representation of characters throughout PHP, recognizing UTF-8 for char-length representations, and so forth, would do wonders to move forwards. And 8-bit octet data can be set aside in the same data structures. It is the straightforward answer, which is probably why Linux did not repeat Windows NT decision, and adopted utf-8.