Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:78059
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain ajf.me designates 192.64.116.208 as permitted sender)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
In-Reply-To: <543D64E5.8000706@gmail.com>
Date: Tue, 14 Oct 2014 20:51:21 +0100
Cc: internals@lists.php.net
Content-Transfer-Encoding: quoted-printable
Message-ID: <69D87398-4BE9-483C-95D3-1AC1A77C6A39@ajf.me>
References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com>
To: Rowan Collins <rowan.collins@gmail.com>
Subject: Re: [PHP-DEV] Unicode support
From: ajf@ajf.me (Andrea Faulds)


On 14 Oct 2014, at 19:01, Rowan Collins <rowan.collins@gmail.com> wrote:

>=20
>> If you want to see a pragmatic, actually working, work-in-progress =
attempt at better PHP unicode support, see this: =
https://github.com/krakjoe/ustring
>=20
> It looks like a good prototype, but glancing at the documentation, I'm =
not clear exactly what the assumptions of some of the functions are.
>=20
> There's a lot of talk of "characters", which is a *very* slippery =
notion in Unicode; charAt() returns a single code point, and $length =
returns a number of code points. This makes me wonder if it will pass =
"the no=EBl test" [1] - does a combining diacritic move onto a different =
letter when you run ->reverse()?
>=20
> As I've mentioned before, a lot of the time what people actually want =
to deal with is "grapheme clusters" - the kind of thing that you'd think =
of as a character if you were writing by hand. Most people, if asked the =
length of the string "no=EBl", would answer 4, but there may be 5 code =
points. (That's not just a case of normalisation choices; most =
combinations of letter+diacritic have no single code point, that's why =
the combining forms exist.)
>=20
> A good Unicode string API should probably give clear labels and =
choices for such things - $string->codePointAt(3) is not the same as =
$string->graphemeAt(3), $string->codePointCount is not the same as =
$string->graphemeCount, and so forth. A single property $length seems =
more user-friendly, until the user finds it means something different to =
what they wanted.

This is true. It ought to talk about code points but doesn=92t. Length =
is primarily needed for iterating through strings and the like. If you =
went length in characters, you probably need to implement your own =
algorithm, as it really depends on your specific use case.

It will, however, always produce valid UTF8 strings for output. That=92s =
better than standard string functions which can mangle UTF8.

> Similarly, an automatic __toString() function is handy, but what =
encoding does it output, and why? UTF-8? The same encoding that the =
string was constructed with?

Always UTF-8.

> If I know that my database is expecting UTF-8, I probably want to say =
$string->getByteString('UTF-8=92).

You can do that.

> I may also want to say $string->getByteStringWithMaxLength('UTF-8', =
20) to fit an exact number of graphemes into a 20-byte binary space; =
something that neither $string->substring(0, 20)->getByteString('UTF-8') =
nor substr( $string->getByteString('UTF-8'), 0, 20 ) can do.

I=92m not sure quite how you=92d do that. There might be a function in =
mbstring for that.

> In short, we can only abstract so much - supporting Unicode =
automatically means supporting its complexity, not just pretending it's =
a really big version of ASCII.

Sure. But just handling code points safely is hard enough as it is. This =
handles that. It doesn=92t handle characters, sure, but it=92s a start. =
And for many applications, you do not need to handle characters.
--
Andrea Faulds
http://ajf.me/