Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78213 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 75296 invoked from network); 21 Oct 2014 22:21:44 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 Oct 2014 22:21:44 -0000 Authentication-Results: pb1.pair.com header.from=ajf@ajf.me; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=ajf@ajf.me; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain ajf.me designates 192.64.116.208 as permitted sender) X-PHP-List-Original-Sender: ajf@ajf.me X-Host-Fingerprint: 192.64.116.208 imap2-3.ox.privateemail.com Received: from [192.64.116.208] ([192.64.116.208:57052] helo=imap2-3.ox.privateemail.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 2C/9C-02077-67CD6445 for ; Tue, 21 Oct 2014 18:21:43 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.privateemail.com (Postfix) with ESMTP id 1D6BD8C007D; Tue, 21 Oct 2014 18:21:40 -0400 (EDT) X-Virus-Scanned: Debian amavisd-new at imap2.ox.privateemail.com Received: from mail.privateemail.com ([127.0.0.1]) by localhost (imap2.ox.privateemail.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id KR8kz15P685Z; Tue, 21 Oct 2014 18:21:39 -0400 (EDT) Received: from oa-edu-169-138.wireless.abdn.ac.uk (oa-edu-169-138.wireless.abdn.ac.uk [137.50.169.138]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.privateemail.com (Postfix) with ESMTPSA id 7F7328C007B; Tue, 21 Oct 2014 18:21:39 -0400 (EDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1990.1\)) In-Reply-To: <5446C552.2080702@gmail.com> Date: Tue, 21 Oct 2014 23:21:37 +0100 Cc: internals@lists.php.net Content-Transfer-Encoding: quoted-printable Message-ID: <2EAD2B07-1469-440B-9843-BB1D5CD56E68@ajf.me> References: <1413875212.2624.3.camel@localhost.localdomain> <5446C552.2080702@gmail.com> To: Rowan Collins X-Mailer: Apple Mail (2.1990.1) Subject: Re: [PHP-DEV] [RFC] UString From: ajf@ajf.me (Andrea Faulds) > On 21 Oct 2014, at 21:42, Rowan Collins = wrote: >=20 > The only case I can see where a default encoding would be sensible = would be where source code itself is in a different encoding, so that = u('literal string') works as expected. This is only a good idea if we can somehow make it file-local. Otherwise = if one library uses Latin-1 and another uses UTF-8 for some reason, = bang! > 2) Clarify relationship to a "byte string" >=20 > Most of the API acts like this is an abstract object representing a = bunch of Unicode code points. As such, I'm not sure what getCodepage() = does - a code page (or more properly encoding) is a property of a stream = of bytes, so has no meaning in this context, surely? The internal = implementation could use UTF-8, UTF-16, or some made-up encoding (like = Perl6's "NFG" system) and the user should never need to know (other than = to understand performance implications). >=20 > On the other hand, when you *do* want a stream of bytes, the class = doesn't seem to have an explicit way to get one. The (currently = undocumented) behaviour is apparently to spit out UTF-8 if cast to a = string, but it would be nice to have an explicit function which could be = passed a parameter in order to serialise to, say, UTF-16, instead. I agree on both these points. ->toBytes or ->encode with an explicit = charset parameter would be good. I don=E2=80=99t see the point of = getCodepage(). > 3) The Grapheme Question >=20 > This has been raised a few times, so I won't labour the point, just = mention my current thinking. >=20 > Unicode is complicated. Partly, that's because of a series of = compromises in its design; but partly, it's because writing systems are = complicated, and Unicode tries harder than most previous systems to = acknowledge that. So, there's a tradeoff to be made between giving users = what they think they need, thus hiding the messy details, and giving = users the power to do things right, in a more complex way. >=20 > There is also a namespace mess if you insist on every function and = property having to declare what level of abstraction it's talking about = - e.g. $codePointLength instead of $length. >=20 > An idea I've been toying with is rather than having one class = representing the slippery notion of "a Unicode string", having (at = least) two, closely tied, classes: CodePointString (roughly =3D UString = right now) and GraphemeString (a higher level abstraction tied to the = same internal representation). >=20 > I intend to mock this up as a set of interfaces at some point, but the = basic idea is that you could write this: >=20 > // Get an abstract object from a byte string, probably a = GraphemeString, parsing the input as UTF-8 > $str =3D u('some text'); > // Perform an operation that explicitly deals in Code Points > $str =3D $str->asCodePoints()->normalise('NFC'); > // Get information using a higher level of abstraction > $length =3D $str->asGraphemes()->length; > // Perform a high-level mutation, then convert right back to a = concrete string of bytes > echo $str->asGraphemes()->reverse()->asByteString('UTF-16'); >=20 > Calling asGraphemes() on a GraphemeString or asCodePoints() on a = CodePointString would be legal but a no-op, so it would be safe to = accept both as input to a function, then switch to whichever level the = task required. >=20 > I'm not sure if this finds a good balance between complexity and = user-friendliness, and would welcome anyone's thoughts. I=E2=80=99d rather have some grapheme-specific functions and some code = point functions on the same class. Make array-like indexing with [] be = by code points as you may be able to do that in constant time, and = because there might be multiple approaches to choosing graphemes. Have = ->codepointAt(), but also ->nthGrapheme() or something like it. = There=E2=80=99s no need for grapheme versions of all functions, but = others would need them. Though your approach has its own merits. -- Andrea Faulds http://ajf.me/