Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78059 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 80838 invoked from network); 14 Oct 2014 19:51:30 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Oct 2014 19:51:30 -0000 Authentication-Results: pb1.pair.com smtp.mail=ajf@ajf.me; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=ajf@ajf.me; sender-id=pass Received-SPF: pass (pb1.pair.com: domain ajf.me designates 192.64.116.208 as permitted sender) X-PHP-List-Original-Sender: ajf@ajf.me X-Host-Fingerprint: 192.64.116.208 imap2-3.ox.privateemail.com Received: from [192.64.116.208] ([192.64.116.208:40929] helo=imap2-3.ox.privateemail.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 6A/54-18603-1CE7D345 for ; Tue, 14 Oct 2014 15:51:29 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.privateemail.com (Postfix) with ESMTP id 44A908C0086; Tue, 14 Oct 2014 15:51:26 -0400 (EDT) X-Virus-Scanned: Debian amavisd-new at imap2.ox.privateemail.com Received: from mail.privateemail.com ([127.0.0.1]) by localhost (imap2.ox.privateemail.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id n6GQ00C-gcTm; Tue, 14 Oct 2014 15:51:26 -0400 (EDT) Received: from [10.0.110.86] (border-converged.hackerdeen.org [89.104.225.218]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.privateemail.com (Postfix) with ESMTPSA id 613448C0083; Tue, 14 Oct 2014 15:51:24 -0400 (EDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) In-Reply-To: <543D64E5.8000706@gmail.com> Date: Tue, 14 Oct 2014 20:51:21 +0100 Cc: internals@lists.php.net Content-Transfer-Encoding: quoted-printable Message-ID: <69D87398-4BE9-483C-95D3-1AC1A77C6A39@ajf.me> References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> To: Rowan Collins X-Mailer: Apple Mail (2.1878.6) Subject: Re: [PHP-DEV] Unicode support From: ajf@ajf.me (Andrea Faulds) On 14 Oct 2014, at 19:01, Rowan Collins wrote: >=20 >> If you want to see a pragmatic, actually working, work-in-progress = attempt at better PHP unicode support, see this: = https://github.com/krakjoe/ustring >=20 > It looks like a good prototype, but glancing at the documentation, I'm = not clear exactly what the assumptions of some of the functions are. >=20 > There's a lot of talk of "characters", which is a *very* slippery = notion in Unicode; charAt() returns a single code point, and $length = returns a number of code points. This makes me wonder if it will pass = "the no=EBl test" [1] - does a combining diacritic move onto a different = letter when you run ->reverse()? >=20 > As I've mentioned before, a lot of the time what people actually want = to deal with is "grapheme clusters" - the kind of thing that you'd think = of as a character if you were writing by hand. Most people, if asked the = length of the string "no=EBl", would answer 4, but there may be 5 code = points. (That's not just a case of normalisation choices; most = combinations of letter+diacritic have no single code point, that's why = the combining forms exist.) >=20 > A good Unicode string API should probably give clear labels and = choices for such things - $string->codePointAt(3) is not the same as = $string->graphemeAt(3), $string->codePointCount is not the same as = $string->graphemeCount, and so forth. A single property $length seems = more user-friendly, until the user finds it means something different to = what they wanted. This is true. It ought to talk about code points but doesn=92t. Length = is primarily needed for iterating through strings and the like. If you = went length in characters, you probably need to implement your own = algorithm, as it really depends on your specific use case. It will, however, always produce valid UTF8 strings for output. That=92s = better than standard string functions which can mangle UTF8. > Similarly, an automatic __toString() function is handy, but what = encoding does it output, and why? UTF-8? The same encoding that the = string was constructed with? Always UTF-8. > If I know that my database is expecting UTF-8, I probably want to say = $string->getByteString('UTF-8=92). You can do that. > I may also want to say $string->getByteStringWithMaxLength('UTF-8', = 20) to fit an exact number of graphemes into a 20-byte binary space; = something that neither $string->substring(0, 20)->getByteString('UTF-8') = nor substr( $string->getByteString('UTF-8'), 0, 20 ) can do. I=92m not sure quite how you=92d do that. There might be a function in = mbstring for that. > In short, we can only abstract so much - supporting Unicode = automatically means supporting its complexity, not just pretending it's = a really big version of ASCII. Sure. But just handling code points safely is hard enough as it is. This = handles that. It doesn=92t handle characters, sure, but it=92s a start. = And for many applications, you do not need to handle characters. -- Andrea Faulds http://ajf.me/