Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:84119 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 61525 invoked from network); 1 Mar 2015 20:38:59 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 1 Mar 2015 20:38:59 -0000 Authentication-Results: pb1.pair.com smtp.mail=derick@php.net; spf=unknown; sender-id=unknown Authentication-Results: pb1.pair.com header.from=derick@php.net; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 82.113.146.227 as permitted sender) X-PHP-List-Original-Sender: derick@php.net X-Host-Fingerprint: 82.113.146.227 xdebug.org Linux 2.6 Received: from [82.113.146.227] ([82.113.146.227:38309] helo=xdebug.org) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 2C/61-53678-2E873F45 for ; Sun, 01 Mar 2015 15:38:58 -0500 Received: from localhost (localhost [IPv6:::1]) by xdebug.org (Postfix) with ESMTPS id 0FB4EEA0C0; Sun, 1 Mar 2015 20:38:55 +0000 (GMT) Date: Sun, 1 Mar 2015 20:38:54 +0000 (GMT) X-X-Sender: derick@whisky.home.derickrethans.nl To: Rowan Collins cc: PHP Developers Mailing List In-Reply-To: <54F201A4.7070506@gmail.com> Message-ID: References: <1413875212.2624.3.camel@localhost.localdomain> <54469840.3070708@sugarcrm.com> <1414051917.2624.35.camel@localhost.localdomain> <1414060726.2624.60.camel@localhost.localdomain> <1414072403.3228.3.camel@kuechenschabe> <87D717D5-273B-4A32-A3E5-83EBDFD314CB@ajf.me> <1414077690.3228.12.camel@kuechenschabe> <54495CF6.30608@sugarcrm.com> <1414130585.2624.64.camel@localhost.localdomain> <54F201A4.7070506@gmail.com> User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="8323329-1401101927-1425242335=:5941" Subject: Re: [PHP-DEV] [RFC] UString From: derick@php.net (Derick Rethans) --8323329-1401101927-1425242335=:5941 Content-Type: TEXT/PLAIN; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE On Sat, 28 Feb 2015, Rowan Collins wrote: > On 28/02/2015 06:48, Joe Watkins wrote: > > Morning internals, > >=20 > > This is just a quick note to announce my intention to ready this R= FC > > for voting next week. > >=20 > > I know I'm a little late maybe, I was real sick most of last week,= so > > couldn't do anything useful. > >=20 > > A couple of us intend to fix outstanding issues on github and thos= e > > raised here, tidy the RFC and open the vote for 7. > >=20 > > I would ask anyone interested to scan through this thread and annou= nce > > concerns that are not mentioned asap. >=20 > I still think this class is trying to do several jobs, and not doing any = of > them very well, and I fear that people will see this class and expect it = to > solve problems which it actually ignores. >=20 > Here are some concrete use cases I would like a simple interface to solve= for > me: >=20 > - Take text from an ISO 88592-2 data source, pass it through generic text > filters, and pass it to a UTF-16 data target. > - Given a long string of Unicode text, give me a valid UTF-8 string which= fits > into a buffer with fixed byte size; i.e. give me the largest number of wh= ole > code points which fit into that number of bytes once encoded. > - As above, but without stripping diacritics off the last character of th= e > resulting string, i.e. give me the largest number of whole graphemes whic= h > fit. > - Split a string into equal sized chunks of readable characters (grapheme= s), > regardless of how many bytes or code points each chunk contains. >=20 > UString currently falls short of all of these: >=20 > - I can specify my input encoding (in the constructor or helper method, > over-riding a static default, which is equivalent to ext/mbstring's globa= l > setting), but not my output encoding (there is no method to ask for a byt= e > representation other than a string cast, which by definition has no > parameters). Yeah, there should be an output method to convert to a target encoding. > - I can ask for a fixed number of code points, but don't know how many by= tes > these will take until I cast to a UTF-8 string. As I said before, indexes into strings should not be done on code=20 points, as the following would then break the characters: $s =3D new Text("A=CC=8As"); echo $s->substring(1); The output would be: =CC=8A =20 Where as: $s =3D new Text("=C3=85s); echo $s->substring(1); would output "s". Which is not what people would expect. > - I can't manipulate anything at the grapheme level at all, even though t= his > is the most meaningful level of operation in most cases. Yes - graphemes should be the base blocks, not code points. > Things it does do: >=20 > - a handful of methods give meaningful international text support: toUppe= r(), > toLower(), trim() > - some methods could be done on byte strings if I ensure they're all in U= TF-8: > replace(), contains(), startsWith(), endsWith(), repeat() That doesn't always work when you have graphemes, or text in different=20 normalisation forms. Ie, it should consider =C3=85 U+00C5 and A=CC=8A=C2=A0= (U+0041 +=20 U+030A) the same for contains and startsWith =E2=80=94 ie, handle normalisa= tion=20 for comparison. > - there may be limited situations where I want to dive into the code poin= ts > which make up a string, although I can't think of many: $length, pad(), > indexOf(), lastIndexOf(), charAt(), replaceSlice() Break iterators on either code points, or graphemes, might work here? > - remaining methods avoid me creating invalid UTF-8, but don't help me=20 > much with real-life text: chunk(), split(), substring() - I can ask=20 > what codepage my Unicode string is in; I don't even understand what=20 > this means >=20 > I think an efficient OO wrapper around ICU is a great idea, but more=20 > thought needs to go into what methods are exposed, and how people are=20 > going to use them in real code. Yes - I agree. I think this current proposal is a good start, but it=20 needs to be worked out a little bit more before I think we should vote=20 on it =E2=80=94 how much I would like to see something like this in PHP. cheers, Derick --=20 http://derickrethans.nl | http://xdebug.org Like Xdebug? Consider a donation: http://xdebug.org/donate.php twitter: @derickr and @xdebug Posted with an email client that doesn't mangle email: alpine --8323329-1401101927-1425242335=:5941--