Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78260 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 49932 invoked from network); 23 Oct 2014 08:18:38 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 23 Oct 2014 08:18:38 -0000 Authentication-Results: pb1.pair.com header.from=pthreads@pthreads.org; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=pthreads@pthreads.org; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain pthreads.org from 209.85.212.177 cause and error) X-PHP-List-Original-Sender: pthreads@pthreads.org X-Host-Fingerprint: 209.85.212.177 mail-wi0-f177.google.com Received: from [209.85.212.177] ([209.85.212.177:52878] helo=mail-wi0-f177.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 20/F0-41150-DD9B8445 for ; Thu, 23 Oct 2014 04:18:38 -0400 Received: by mail-wi0-f177.google.com with SMTP id ex7so1076519wid.10 for ; Thu, 23 Oct 2014 01:18:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to :references:content-type:mime-version:content-transfer-encoding; bh=MA8JAQ4QJ/sKCbGz29UH//GGxsGLr7fm+0J33//JKbI=; b=LcnfoTIq+ll3gCaBGZ6moUF5ab6VeskRS7T9A83AQwYw5dQ3mcYwX+X+P7DFJBX5K8 MXk3VD8bqkhuwsDPiuD5osZcu124D/T0bnr7xPjNxkT1tionSe2g4RAqBgJfssVVcM0B XdHq7RBOTRU3ElPc0a+r6L36/ESWweOsj8v6jl7lMjaVl7/IYtZa6kyRc77e6nGQTP17 wcSW7c242nCoCN1K07a+r4AKoLxEy5xi792vQp4kn3Zbp9lzIXZPey5yy2oIlfj12zL3 pnaAjic/zOWh4pTUPIIRmXzmSOdYwOzKhJJwXlHqX9IqVWMBFUY01Be+j1TthXcAv/wN Y+Lg== X-Gm-Message-State: ALoCoQmXiRO1I+ExAbeD+RI6GHQ87Plv+pqOWYXioeloKZSXBJF4qcmSY1zxzIFS3QvkL0myC4Ks X-Received: by 10.180.212.78 with SMTP id ni14mr11171532wic.2.1414052315224; Thu, 23 Oct 2014 01:18:35 -0700 (PDT) Received: from [192.168.1.67] (host86-136-245-20.range86-136.btcentralplus.com. [86.136.245.20]) by mx.google.com with ESMTPSA id wl1sm1294745wjb.4.2014.10.23.01.18.34 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 23 Oct 2014 01:18:34 -0700 (PDT) Message-ID: <1414052313.2624.40.camel@localhost.localdomain> To: Rowan Collins Cc: internals@lists.php.net Date: Thu, 23 Oct 2014 09:18:33 +0100 In-Reply-To: <5446C552.2080702@gmail.com> References: <1413875212.2624.3.camel@localhost.localdomain> <5446C552.2080702@gmail.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4 (3.10.4-4.fc20) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] [RFC] UString From: pthreads@pthreads.org (Joe Watkins) On Tue, 2014-10-21 at 21:42 +0100, Rowan Collins wrote: > On 21/10/2014 08:06, Joe Watkins wrote: > > Morning internalz, > > > > https://wiki.php.net/rfc/ustring > > > > This is the result of work done by a few of us, we won't be opening any > > vote in a fortnight. We have a long time before 7, there is no rush > > whatever. > > > > Now seems like a good time to start the conversation so we can hash out > > the details, or get on with other things ;) > > > > Cheers > > Joe > > > > > > I think this looks like a really great start at creating something > actually useful, rather than getting stuck at the drawing board. I like > that the scope is quite small initially - where does the "single > responsibility" of a class that represents a string end, anyway? :) > > A few opinions: > > 1) Global / static defaults are bad. > > The existence of the setDefaultCodepage method feels like an > anti-pattern to me. It means libraries can't rely on this class working > the same way in two different host environments, or even at two > re-entries in the same program. Effectively, if you don't know what the > second argument to the constructor will default to, you can't actually > treat it as optional unless you're writing monolithic code. This is a > common pattern in PHP, but http_build_query() would be so much more > pleasant if I could safely call it with 1 argument instead of 3. > > I think the default should be hard-coded to UTF-8, which according to > previous discussion is always the default *output* encoding, so would > mean this would always work: $aUString = new UString( (string)$aUString > ); Any other encoding will be dependent on, and known from, the context > where the object is created - if grabbing data from an HTTP request, a > header should tell them; if from a database, a connection parameter; and > so on. > Could be true, it feels quite horrible to me today too, I think someone else suggested it, but it might have been me. I'll look at doing something about that ... > The only case I can see where a default encoding would be sensible would > be where source code itself is in a different encoding, so that > u('literal string') works as expected. I guess if we ever went down the > route of special literal syntax like u'literal string', the declared > source encoding could be used. > > Actually, the u() shortcut function appears to be missing the encoding > parameter completely; is this deliberate? > Fixed that. > 2) Clarify relationship to a "byte string" > > Most of the API acts like this is an abstract object representing a > bunch of Unicode code points. As such, I'm not sure what getCodepage() > does - a code page (or more properly encoding) is a property of a stream > of bytes, so has no meaning in this context, surely? The internal > implementation could use UTF-8, UTF-16, or some made-up encoding (like > Perl6's "NFG" system) and the user should never need to know (other than > to understand performance implications). > > On the other hand, when you *do* want a stream of bytes, the class > doesn't seem to have an explicit way to get one. The (currently > undocumented) behaviour is apparently to spit out UTF-8 if cast to a > string, but it would be nice to have an explicit function which could be > passed a parameter in order to serialise to, say, UTF-16, instead. > I reused the terminology used by ICU, it made sense in their documentation. So we want a ::getBytes or something like that ... I'll do that ... > 3) The Grapheme Question > > This has been raised a few times, so I won't labour the point, just > mention my current thinking. > > Unicode is complicated. Partly, that's because of a series of > compromises in its design; but partly, it's because writing systems are > complicated, and Unicode tries harder than most previous systems to > acknowledge that. So, there's a tradeoff to be made between giving users > what they think they need, thus hiding the messy details, and giving > users the power to do things right, in a more complex way. > > There is also a namespace mess if you insist on every function and > property having to declare what level of abstraction it's talking about > - e.g. $codePointLength instead of $length. > > An idea I've been toying with is rather than having one class > representing the slippery notion of "a Unicode string", having (at > least) two, closely tied, classes: CodePointString (roughly = UString > right now) and GraphemeString (a higher level abstraction tied to the > same internal representation). > > I intend to mock this up as a set of interfaces at some point, but the > basic idea is that you could write this: > > // Get an abstract object from a byte string, probably a GraphemeString, > parsing the input as UTF-8 > $str = u('some text'); > // Perform an operation that explicitly deals in Code Points > $str = $str->asCodePoints()->normalise('NFC'); > // Get information using a higher level of abstraction > $length = $str->asGraphemes()->length; > // Perform a high-level mutation, then convert right back to a concrete > string of bytes > echo $str->asGraphemes()->reverse()->asByteString('UTF-16'); > > Calling asGraphemes() on a GraphemeString or asCodePoints() on a > CodePointString would be legal but a no-op, so it would be safe to > accept both as input to a function, then switch to whichever level the > task required. > > I'm not sure if this finds a good balance between complexity and > user-friendliness, and would welcome anyone's thoughts. > I'd rather higher level stuff existed at a higher level, I'd rather solve for ustring the problems that are solved for normal strings and leave the rest up to whatever the framework/component/library or wants to do. > -- > Rowan Collins > [IMSoP] > >