Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:78041
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.82.43 as permitted sender)
Message-ID: <543D64E5.8000706@gmail.com>
Date: Tue, 14 Oct 2014 19:01:09 +0100
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: internals@lists.php.net
References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me>
In-Reply-To: <4575A816-43F4-462D-8150-A2D35516D914@ajf.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] Unicode support
From: rowan.collins@gmail.com (Rowan Collins)

On 14/10/2014 14:50, Andrea Faulds wrote:
>> 2. What is currently missing in that regard?
> Unicode string support.

I know that was probably deliberately flippant, but I think there is a 
genuine question to be asked here. A lot of people talk about "Unicode 
support" like they talk about "XPath support"; but XPath is an API you 
can adhere to, Unicode is a whole lot more (and less) than that.

What it probably means to most people is "string functions which do what 
I expect with a vast range of obscure Unicode code point sequences". 
Those expectations need to be documented *before* an API is written, 
rather than writing a whole load of functions which use a Unicode 
library, but don't actually provide the tools that people need.

> If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring

It looks like a good prototype, but glancing at the documentation, I'm 
not clear exactly what the assumptions of some of the functions are.

There's a lot of talk of "characters", which is a *very* slippery notion 
in Unicode; charAt() returns a single code point, and $length returns a 
number of code points. This makes me wonder if it will pass "the noël 
test" [1] - does a combining diacritic move onto a different letter when 
you run ->reverse()?

As I've mentioned before, a lot of the time what people actually want to 
deal with is "grapheme clusters" - the kind of thing that you'd think of 
as a character if you were writing by hand. Most people, if asked the 
length of the string "noël", would answer 4, but there may be 5 code 
points. (That's not just a case of normalisation choices; most 
combinations of letter+diacritic have no single code point, that's why 
the combining forms exist.)

A good Unicode string API should probably give clear labels and choices 
for such things - $string->codePointAt(3) is not the same as 
$string->graphemeAt(3), $string->codePointCount is not the same as 
$string->graphemeCount, and so forth. A single property $length seems 
more user-friendly, until the user finds it means something different to 
what they wanted.

Similarly, an automatic __toString() function is handy, but what 
encoding does it output, and why? UTF-8? The same encoding that the 
string was constructed with?

If I know that my database is expecting UTF-8, I probably want to say 
$string->getByteString('UTF-8'). I may also want to say 
$string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact number 
of graphemes into a 20-byte binary space; something that neither 
$string->substring(0, 20)->getByteString('UTF-8') nor substr( 
$string->getByteString('UTF-8'), 0, 20 ) can do.

In short, we can only abstract so much - supporting Unicode 
automatically means supporting its complexity, not just pretending it's 
a really big version of ASCII.

[1] http://mortoray.com/2013/11/27/the-string-type-is-broken/

-- 
Rowan Collins
[IMSoP]