Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78041 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 50238 invoked from network); 14 Oct 2014 18:01:17 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Oct 2014 18:01:17 -0000 Authentication-Results: pb1.pair.com header.from=rowan.collins@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=rowan.collins@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.82.43 as permitted sender) X-PHP-List-Original-Sender: rowan.collins@gmail.com X-Host-Fingerprint: 74.125.82.43 mail-wg0-f43.google.com Received: from [74.125.82.43] ([74.125.82.43:38276] helo=mail-wg0-f43.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 73/DD-18603-BE46D345 for ; Tue, 14 Oct 2014 14:01:16 -0400 Received: by mail-wg0-f43.google.com with SMTP id m15so11418763wgh.26 for ; Tue, 14 Oct 2014 11:01:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=p6L0c3NCgxVKjftKpGpB8Y3s5lvB0XxJ/E0uZY31DXA=; b=a1KwvqiO+dw8/oFRMX3C2C0zhJC7cwO9EqvctW3eHxEW/3bE9e9CX7tPs3ekN/qdtP igie+iyY2mC2C7FHgkgp56LZJos9gusw7R4P7frdF7TtoDcVLCPo7HwESo5uAZwUNL3L Nei4c1ZjDTY81n+Ar9cgl8N2FBEBB9OFreoW17VNkyW814x/VAcptXMZ7zTjPctHqHv2 QgYIHXhi6TWaaWsWIJs1kjJKZA+x9s5rS0zTAHETUmG8jjKae/NwrpA1DuaZE8OPR08o u0K5pwnUedW1/o1fc97+E49U3Ull0nigl1Ae0drFGBlv0NQxChQE9Ul7lS7IHz5Jo6ZM xTVg== X-Received: by 10.180.38.34 with SMTP id d2mr6956447wik.55.1413309672548; Tue, 14 Oct 2014 11:01:12 -0700 (PDT) Received: from [192.168.0.2] (cpc68956-brig15-2-0-cust215.3-3.cable.virginm.net. [82.6.24.216]) by mx.google.com with ESMTPSA id q9sm558057wix.6.2014.10.14.11.01.11 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 14 Oct 2014 11:01:11 -0700 (PDT) Message-ID: <543D64E5.8000706@gmail.com> Date: Tue, 14 Oct 2014 19:01:09 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: internals@lists.php.net References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> In-Reply-To: <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Unicode support From: rowan.collins@gmail.com (Rowan Collins) On 14/10/2014 14:50, Andrea Faulds wrote: >> 2. What is currently missing in that regard? > Unicode string support. I know that was probably deliberately flippant, but I think there is a genuine question to be asked here. A lot of people talk about "Unicode support" like they talk about "XPath support"; but XPath is an API you can adhere to, Unicode is a whole lot more (and less) than that. What it probably means to most people is "string functions which do what I expect with a vast range of obscure Unicode code point sequences". Those expectations need to be documented *before* an API is written, rather than writing a whole load of functions which use a Unicode library, but don't actually provide the tools that people need. > If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring It looks like a good prototype, but glancing at the documentation, I'm not clear exactly what the assumptions of some of the functions are. There's a lot of talk of "characters", which is a *very* slippery notion in Unicode; charAt() returns a single code point, and $length returns a number of code points. This makes me wonder if it will pass "the noël test" [1] - does a combining diacritic move onto a different letter when you run ->reverse()? As I've mentioned before, a lot of the time what people actually want to deal with is "grapheme clusters" - the kind of thing that you'd think of as a character if you were writing by hand. Most people, if asked the length of the string "noël", would answer 4, but there may be 5 code points. (That's not just a case of normalisation choices; most combinations of letter+diacritic have no single code point, that's why the combining forms exist.) A good Unicode string API should probably give clear labels and choices for such things - $string->codePointAt(3) is not the same as $string->graphemeAt(3), $string->codePointCount is not the same as $string->graphemeCount, and so forth. A single property $length seems more user-friendly, until the user finds it means something different to what they wanted. Similarly, an automatic __toString() function is handy, but what encoding does it output, and why? UTF-8? The same encoding that the string was constructed with? If I know that my database is expecting UTF-8, I probably want to say $string->getByteString('UTF-8'). I may also want to say $string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact number of graphemes into a 20-byte binary space; something that neither $string->substring(0, 20)->getByteString('UTF-8') nor substr( $string->getByteString('UTF-8'), 0, 20 ) can do. In short, we can only abstract so much - supporting Unicode automatically means supporting its complexity, not just pretending it's a really big version of ASCII. [1] http://mortoray.com/2013/11/27/the-string-type-is-broken/ -- Rowan Collins [IMSoP]