Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:47261
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain fmethod.com from 69.16.228.148 cause and error)
Message-ID: <13008E62F851429F84B9FE2F3F230286@pc>
To: "William A. Rowe Jr." <wrowe@rowe-clan.net>,
	<internals@lists.php.net>
References: <4B9C9007.1080802@lsces.co.uk> <4B9C91D7.2050402@rowe-clan.net>
Date: Sun, 14 Mar 2010 13:03:47 +0200
MIME-Version: 1.0
Content-Type: text/plain;
	format=flowed;
	charset="UTF-8";
	reply-type=original
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode?
From: sv_forums@fmethod.com ("Stan Vassilev")

> If Unicode were the solution, the PHP project was on the right page with 
> 6.0.
> Sure there remained work to do, but...
>
> How long did it take to realize UTF16 wasn't the end of the story?  UCS-4 
> is
> the minimum to solve this, and we all agree that 32 bits aren't storing a 
> single
> char in the western world, no way, no how.
>
> The UTF-8 solution is probably the right answer... you maintain 95% of 
> char *UTF
> behavior, and you gain international character representation.  The only 
> Unicode
> OS I can think of offhand is NT, and of course they hit the UCS-4 problem 
> early.
> They found this out 15+ years ago.
>
> Sure it doesn't appear as atomic, one Xword per char, but the existing 
> library
> frameworks contain most of the string processing that is required.  There 
> is no
> 16-bit network transmission API that I can think of, you are still 
> devolving to
> UTF-8 for client results.
>
> To move forward with accepting -and preferring- UTF-8 as the 
> representation of
> characters throughout PHP, recognizing UTF-8 for char-length 
> representations,
> and so forth, would do wonders to move forwards.  And 8-bit octet data can 
> be
> set aside in the same data structures.  It is the straightforward answer, 
> which
> is probably why Linux did not repeat Windows NT decision, and adopted 
> utf-8.


Hi,

UTF8 is good for text that contains mostly ASCII chars and the occasional 
Unicode international chars. It's also generally ok for storing and passing 
strings between apps.

However, it's a really poor representation of a string in memory as a code 
point can vary between 1 and 4 bytes. Doing simple calculations like 
$string[$x] means you need to walk and interpret the string from the start 
until you count to the codepoint you needed.

UTF8 also takes 4 bytes for representing characters in the higher bit 
planes, as quite a lot of bits are lost for every char in order to describe 
how long the code point is, and when it ends and so on. This means 
memory-wise it may not be of big benefit to asian countries.

Since the western world, as you put it, wouldn't want to waste 4 bytes for 
characters they use that fit in 1 byte, we could opt to store the encoding 
of a string as a byte enumerating all possible encodings supported by PHP (I 
believe they're less than 255..), so the string functions know how to 
operate and convert between them.

This means you can use Unicode only when you need it, which reduces the 
impact of using full 4 bytes per code point, as you can still use Latin-1 
1-byte encoding and freely mix it with Unicode, and still produce UTF8 
output in the end, for the web (the final output encoding to UTF8 from 
*anything* is cheap).

Another alternative is doing what JavaScript does. JavaScript uses 2-byte 
encoding for Unicode, and when a code point needs more than 2 bytes, it's 
encoded in 4 bytes. JavaScript will count that codepoint as 2 chars, 
although it's technically one codepoint. It's awkward, but since PHP is a 
web language, consistency with JavaScript may even be beneficial. It also 
solves the $string[$x] problem as you no longer need to walk the array, you 
just blindly report the 2 bytes at address string points + 2 * $x.

With this approach, all characters in the BMP will report correct offsets 
with char index and substr functions as they fit in 2 bytes. Workarounds and 
helper functions can be introduced for handling 4 byte codepoints for the 
other planes.

It of course makes certain operations harder, such as character ranges 
between two 4-byte codepoints in regex will produce unexpected results, and 
regex will see these chars:

[2bytes2bytes-2bytes2bytes] i.e.:   [a b-c d]

and not this:

[4bytes-4bytes]

Still, having variable-width encoding UTF8 or UTF16 doesn't cut it for 
general use to me as in tests it shows drastic slowdown when the script 
needs to do heavy string processing. I'd rather have it take more RAM for 
Unicode strings while being fast, and use Latin-1 when what I need is 
Latin-1.

Regards,
Stan Vassilev