Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:84079
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.82.48 as permitted sender)
Message-ID: <54F201A4.7070506@gmail.com>
Date: Sat, 28 Feb 2015 17:57:56 +0000
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: internals@lists.php.net
References: <1413875212.2624.3.camel@localhost.localdomain>	<CA+9eiLtVUsbRQA9S0eagVfOc3aovhuak=QQmnBKp_BmgN4n28w@mail.gmail.com>	<CAGKNXQEYquY_YQvTRv7uAhaycyCNG1FRwX_GYZ1dnHfhC-biZg@mail.gmail.com>	<54469840.3070708@sugarcrm.com>	<1414051917.2624.35.camel@localhost.localdomain>	<CA+9eiLsC1dD9vp=uWqdYTXj7hC=mBnOMSs5B6KFngOQPd1m4OA@mail.gmail.com>	<1414060726.2624.60.camel@localhost.localdomain>	<1414072403.3228.3.camel@kuechenschabe>	<87D717D5-273B-4A32-A3E5-83EBDFD314CB@ajf.me>	<1414077690.3228.12.camel@kuechenschabe>	<54495CF6.30608@sugarcrm.com>	<1414130585.2624.64.camel@localhost.localdomain>	<CAGAGxbaKK4P2D5epQTzEzNx3jcBP7XCPexGA4gyA3hdx3p+UKg@mail.gmail.com> <CAL=_i_nDz0XoVA5d4gqinYg9GinJUMW55+HLQ4y9O3Nj_URoxw@mail.gmail.com>
In-Reply-To: <CAL=_i_nDz0XoVA5d4gqinYg9GinJUMW55+HLQ4y9O3Nj_URoxw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] [RFC] UString
From: rowan.collins@gmail.com (Rowan Collins)

On 28/02/2015 06:48, Joe Watkins wrote:
> Morning internals,
>
>      This is just a quick note to announce my intention to ready this RFC
> for voting next week.
>
>      I know I'm a little late maybe, I was real sick most of last week, so
> couldn't do anything useful.
>
>      A couple of us intend to fix outstanding issues on github and those
> raised here, tidy the RFC and open the vote for 7.
>
>     I would ask anyone interested to scan through this thread and announce
> concerns that are not mentioned asap.

I still think this class is trying to do several jobs, and not doing any 
of them very well, and I fear that people will see this class and expect 
it to solve problems which it actually ignores.

Here are some concrete use cases I would like a simple interface to 
solve for me:

- Take text from an ISO 88592-2 data source, pass it through generic 
text filters, and pass it to a UTF-16 data target.
- Given a long string of Unicode text, give me a valid UTF-8 string 
which fits into a buffer with fixed byte size; i.e. give me the largest 
number of whole code points which fit into that number of bytes once 
encoded.
- As above, but without stripping diacritics off the last character of 
the resulting string, i.e. give me the largest number of whole graphemes 
which fit.
- Split a string into equal sized chunks of readable characters 
(graphemes), regardless of how many bytes or code points each chunk 
contains.

UString currently falls short of all of these:

- I can specify my input encoding (in the constructor or helper method, 
over-riding a static default, which is equivalent to ext/mbstring's 
global setting), but not my output encoding (there is no method to ask 
for a byte representation other than a string cast, which by definition 
has no parameters).
- I can ask for a fixed number of code points, but don't know how many 
bytes these will take until I cast to a UTF-8 string.
- I can't manipulate anything at the grapheme level at all, even though 
this is the most meaningful level of operation in most cases.

Things it does do:

- a handful of methods give meaningful international text support: 
toUpper(), toLower(),  trim()
- some methods could be done on byte strings if I ensure they're all in 
UTF-8: replace(), contains(), startsWith(), endsWith(), repeat()
- there may be limited situations where I want to dive into the code 
points which make up a string, although I can't think of many: $length, 
pad(), indexOf(), lastIndexOf(), charAt(), replaceSlice()
- remaining methods avoid me creating invalid UTF-8, but don't help me 
much with real-life text: chunk(), split(), substring()
- I can ask what codepage my Unicode string is in; I don't even 
understand what this means

I think an efficient OO wrapper around ICU is a great idea, but more 
thought needs to go into what methods are exposed, and how people are 
going to use them in real code.

Regards,
-- 
Rowan Collins
[IMSoP]