Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:78260
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain pthreads.org from 209.85.212.177 cause and error)
Message-ID: <1414052313.2624.40.camel@localhost.localdomain>
To: Rowan Collins <rowan.collins@gmail.com>
Cc: internals@lists.php.net
Date: Thu, 23 Oct 2014 09:18:33 +0100
In-Reply-To: <5446C552.2080702@gmail.com>
References: <1413875212.2624.3.camel@localhost.localdomain>
	 <5446C552.2080702@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] [RFC] UString
From: pthreads@pthreads.org (Joe Watkins)

On Tue, 2014-10-21 at 21:42 +0100, Rowan Collins wrote:
> On 21/10/2014 08:06, Joe Watkins wrote:
> > Morning internalz,
> >
> > 	https://wiki.php.net/rfc/ustring
> >
> > 	This is the result of work done by a few of us, we won't be opening any
> > vote in a fortnight. We have a long time before 7, there is no rush
> > whatever.
> >
> > 	Now seems like a good time to start the conversation so we can hash out
> > the details, or get on with other things ;)
> >
> > Cheers
> > Joe
> >
> >
> 
> I think this looks like a really great start at creating something 
> actually useful, rather than getting stuck at the drawing board. I like 
> that the scope is quite small initially - where does the "single 
> responsibility" of a class that represents a string end, anyway? :)
> 
> A few opinions:
> 
> 1) Global / static defaults are bad.
> 
> The existence of the setDefaultCodepage method feels like an 
> anti-pattern to me. It means libraries can't rely on this class working 
> the same way in two different host environments, or even at two 
> re-entries in the same program. Effectively, if you don't know what the 
> second argument to the constructor will default to, you can't actually 
> treat it as optional unless you're writing monolithic code. This is a 
> common pattern in PHP, but http_build_query() would be so much more 
> pleasant if I could safely call it with 1 argument instead of 3.
> 
> I think the default should be hard-coded to UTF-8, which according to 
> previous discussion is always the default *output* encoding, so would 
> mean this would always work: $aUString = new UString( (string)$aUString 
> ); Any other encoding will be dependent on, and known from, the context 
> where the object is created - if grabbing data from an HTTP request, a 
> header should tell them; if from a database, a connection parameter; and 
> so on.
> 

Could be true, it feels quite horrible to me today too, I think someone
else suggested it, but it might have been me.

I'll look at doing something about that ...

> The only case I can see where a default encoding would be sensible would 
> be where source code itself is in a different encoding, so that 
> u('literal string') works as expected. I guess if we ever went down the 
> route of special literal syntax like u'literal string', the declared 
> source encoding could be used.
> 
> Actually, the u() shortcut function appears to be missing the encoding 
> parameter completely; is this deliberate?
> 

Fixed that.

> 2) Clarify relationship to a "byte string"
> 
> Most of the API acts like this is an abstract object representing a 
> bunch of Unicode code points. As such, I'm not sure what getCodepage() 
> does - a code page (or more properly encoding) is a property of a stream 
> of bytes, so has no meaning in this context, surely? The internal 
> implementation could use UTF-8, UTF-16, or some made-up encoding (like 
> Perl6's "NFG" system) and the user should never need to know (other than 
> to understand performance implications).
> 
> On the other hand, when you *do* want a stream of bytes, the class 
> doesn't seem to have an explicit way to get one. The (currently 
> undocumented) behaviour is apparently to spit out UTF-8 if cast to a 
> string, but it would be nice to have an explicit function which could be 
> passed a parameter in order to serialise to, say, UTF-16, instead.
> 

I reused the terminology used by ICU, it made sense in their
documentation. 

So we want a ::getBytes or something like that ... I'll do that ...

> 3) The Grapheme Question
> 
> This has been raised a few times, so I won't labour the point, just 
> mention my current thinking.
> 
> Unicode is complicated. Partly, that's because of a series of 
> compromises in its design; but partly, it's because writing systems are 
> complicated, and Unicode tries harder than most previous systems to 
> acknowledge that. So, there's a tradeoff to be made between giving users 
> what they think they need, thus hiding the messy details, and giving 
> users the power to do things right, in a more complex way.
> 
> There is also a namespace mess if you insist on every function and 
> property having to declare what level of abstraction it's talking about 
> - e.g. $codePointLength instead of $length.
> 
> An idea I've been toying with is rather than having one class 
> representing the slippery notion of "a Unicode string", having (at 
> least) two, closely tied, classes: CodePointString (roughly = UString 
> right now) and GraphemeString (a higher level abstraction tied to the 
> same internal representation).
> 
> I intend to mock this up as a set of interfaces at some point, but the 
> basic idea is that you could write this:
> 
> // Get an abstract object from a byte string, probably a GraphemeString, 
> parsing the input as UTF-8
> $str = u('some text');
> // Perform an operation that explicitly deals in Code Points
> $str = $str->asCodePoints()->normalise('NFC');
> // Get information using a higher level of abstraction
> $length = $str->asGraphemes()->length;
> // Perform a high-level mutation, then convert right back to a concrete 
> string of bytes
> echo $str->asGraphemes()->reverse()->asByteString('UTF-16');
> 
> Calling asGraphemes() on a GraphemeString or asCodePoints() on a 
> CodePointString would be legal but a no-op, so it would be safe to 
> accept both as input to a function, then switch to whichever level the 
> task required.
> 
> I'm not sure if this finds a good balance between complexity and 
> user-friendliness, and would welcome anyone's thoughts.
> 

I'd rather higher level stuff existed at a higher level, I'd rather
solve for ustring the problems that are solved for normal strings and
leave the rest up to whatever the framework/component/library or wants
to do.

> -- 
> Rowan Collins
> [IMSoP]
> 
>