Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47261 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 35550 invoked from network); 14 Mar 2010 11:04:00 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Mar 2010 11:04:00 -0000 Authentication-Results: pb1.pair.com smtp.mail=sv_forums@fmethod.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=sv_forums@fmethod.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain fmethod.com from 69.16.228.148 cause and error) X-PHP-List-Original-Sender: sv_forums@fmethod.com X-Host-Fingerprint: 69.16.228.148 unknown Linux 2.4/2.6 Received: from [69.16.228.148] ([69.16.228.148:42761] helo=host.fmethod.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id F6/82-15916-F92CC9B4 for ; Sun, 14 Mar 2010 06:04:00 -0500 Received: from [83.228.56.37] (port=1231 helo=pc) by host.fmethod.com with esmtpa (Exim 4.69) (envelope-from ) id 1Nqlbw-0005tE-L4; Sun, 14 Mar 2010 06:03:57 -0500 Message-ID: <13008E62F851429F84B9FE2F3F230286@pc> To: "William A. Rowe Jr." , References: <4B9C9007.1080802@lsces.co.uk> <4B9C91D7.2050402@rowe-clan.net> Date: Sun, 14 Mar 2010 13:03:47 +0200 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="UTF-8"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5843 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - host.fmethod.com X-AntiAbuse: Original Domain - lists.php.net X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - fmethod.com Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: sv_forums@fmethod.com ("Stan Vassilev") > If Unicode were the solution, the PHP project was on the right page with > 6.0. > Sure there remained work to do, but... > > How long did it take to realize UTF16 wasn't the end of the story? UCS-4 > is > the minimum to solve this, and we all agree that 32 bits aren't storing a > single > char in the western world, no way, no how. > > The UTF-8 solution is probably the right answer... you maintain 95% of > char *UTF > behavior, and you gain international character representation. The only > Unicode > OS I can think of offhand is NT, and of course they hit the UCS-4 problem > early. > They found this out 15+ years ago. > > Sure it doesn't appear as atomic, one Xword per char, but the existing > library > frameworks contain most of the string processing that is required. There > is no > 16-bit network transmission API that I can think of, you are still > devolving to > UTF-8 for client results. > > To move forward with accepting -and preferring- UTF-8 as the > representation of > characters throughout PHP, recognizing UTF-8 for char-length > representations, > and so forth, would do wonders to move forwards. And 8-bit octet data can > be > set aside in the same data structures. It is the straightforward answer, > which > is probably why Linux did not repeat Windows NT decision, and adopted > utf-8. Hi, UTF8 is good for text that contains mostly ASCII chars and the occasional Unicode international chars. It's also generally ok for storing and passing strings between apps. However, it's a really poor representation of a string in memory as a code point can vary between 1 and 4 bytes. Doing simple calculations like $string[$x] means you need to walk and interpret the string from the start until you count to the codepoint you needed. UTF8 also takes 4 bytes for representing characters in the higher bit planes, as quite a lot of bits are lost for every char in order to describe how long the code point is, and when it ends and so on. This means memory-wise it may not be of big benefit to asian countries. Since the western world, as you put it, wouldn't want to waste 4 bytes for characters they use that fit in 1 byte, we could opt to store the encoding of a string as a byte enumerating all possible encodings supported by PHP (I believe they're less than 255..), so the string functions know how to operate and convert between them. This means you can use Unicode only when you need it, which reduces the impact of using full 4 bytes per code point, as you can still use Latin-1 1-byte encoding and freely mix it with Unicode, and still produce UTF8 output in the end, for the web (the final output encoding to UTF8 from *anything* is cheap). Another alternative is doing what JavaScript does. JavaScript uses 2-byte encoding for Unicode, and when a code point needs more than 2 bytes, it's encoded in 4 bytes. JavaScript will count that codepoint as 2 chars, although it's technically one codepoint. It's awkward, but since PHP is a web language, consistency with JavaScript may even be beneficial. It also solves the $string[$x] problem as you no longer need to walk the array, you just blindly report the 2 bytes at address string points + 2 * $x. With this approach, all characters in the BMP will report correct offsets with char index and substr functions as they fit in 2 bytes. Workarounds and helper functions can be introduced for handling 4 byte codepoints for the other planes. It of course makes certain operations harder, such as character ranges between two 4-byte codepoints in regex will produce unexpected results, and regex will see these chars: [2bytes2bytes-2bytes2bytes] i.e.: [a b-c d] and not this: [4bytes-4bytes] Still, having variable-width encoding UTF8 or UTF16 doesn't cut it for general use to me as in tests it shows drastic slowdown when the script needs to do heavy string processing. I'd rather have it take more RAM for Unicode strings while being fast, and use Latin-1 when what I need is Latin-1. Regards, Stan Vassilev