Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:84079 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 23959 invoked from network); 28 Feb 2015 17:58:12 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 28 Feb 2015 17:58:12 -0000 Authentication-Results: pb1.pair.com smtp.mail=rowan.collins@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=rowan.collins@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.82.48 as permitted sender) X-PHP-List-Original-Sender: rowan.collins@gmail.com X-Host-Fingerprint: 74.125.82.48 mail-wg0-f48.google.com Received: from [74.125.82.48] ([74.125.82.48:36215] helo=mail-wg0-f48.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id D0/90-20255-3B102F45 for ; Sat, 28 Feb 2015 12:58:11 -0500 Received: by wghk14 with SMTP id k14so26074885wgh.3 for ; Sat, 28 Feb 2015 09:58:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=fisDgvu7J/FuetrCuo9fxLiUY2z4F2WEXTujGRWPsec=; b=sTVRW0xSpIwpjKKtsPqBp8Ml7CYcOO++nsuiFltb8Srw/oL6rFryXBe5874DQplI3U iFklux7oDQISke0nDnWo/tQSyxQNSRSej5Hm5VCmdizGwn9c/dIutdSppnpKbk1V60CQ Zj0hSXUo+75tjbw37amUCPzz5ec+LI9cRQcMzl1U5YWykltg253N7xmBmwvnAq9L1S14 dTyCE1VHcI4zYI0MTaRk+L+EQYTBy3FQnSY5Hixgaiif6TSYKhbJfn1CEgrkVS/7fbn1 LKQUzLHhqKNmJS6mDGECtBRpFgf3atnIxA2yI7OmA8c3nllxUymjJ+tiwEC+uig1HDVQ //uA== X-Received: by 10.180.206.98 with SMTP id ln2mr18530920wic.94.1425146288775; Sat, 28 Feb 2015 09:58:08 -0800 (PST) Received: from [192.168.0.3] (cpc68956-brig15-2-0-cust215.3-3.cable.virginm.net. [82.6.24.216]) by mx.google.com with ESMTPSA id t9sm7902706wia.15.2015.02.28.09.58.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 28 Feb 2015 09:58:07 -0800 (PST) Message-ID: <54F201A4.7070506@gmail.com> Date: Sat, 28 Feb 2015 17:57:56 +0000 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: internals@lists.php.net References: <1413875212.2624.3.camel@localhost.localdomain> <54469840.3070708@sugarcrm.com> <1414051917.2624.35.camel@localhost.localdomain> <1414060726.2624.60.camel@localhost.localdomain> <1414072403.3228.3.camel@kuechenschabe> <87D717D5-273B-4A32-A3E5-83EBDFD314CB@ajf.me> <1414077690.3228.12.camel@kuechenschabe> <54495CF6.30608@sugarcrm.com> <1414130585.2624.64.camel@localhost.localdomain> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] [RFC] UString From: rowan.collins@gmail.com (Rowan Collins) On 28/02/2015 06:48, Joe Watkins wrote: > Morning internals, > > This is just a quick note to announce my intention to ready this RFC > for voting next week. > > I know I'm a little late maybe, I was real sick most of last week, so > couldn't do anything useful. > > A couple of us intend to fix outstanding issues on github and those > raised here, tidy the RFC and open the vote for 7. > > I would ask anyone interested to scan through this thread and announce > concerns that are not mentioned asap. I still think this class is trying to do several jobs, and not doing any of them very well, and I fear that people will see this class and expect it to solve problems which it actually ignores. Here are some concrete use cases I would like a simple interface to solve for me: - Take text from an ISO 88592-2 data source, pass it through generic text filters, and pass it to a UTF-16 data target. - Given a long string of Unicode text, give me a valid UTF-8 string which fits into a buffer with fixed byte size; i.e. give me the largest number of whole code points which fit into that number of bytes once encoded. - As above, but without stripping diacritics off the last character of the resulting string, i.e. give me the largest number of whole graphemes which fit. - Split a string into equal sized chunks of readable characters (graphemes), regardless of how many bytes or code points each chunk contains. UString currently falls short of all of these: - I can specify my input encoding (in the constructor or helper method, over-riding a static default, which is equivalent to ext/mbstring's global setting), but not my output encoding (there is no method to ask for a byte representation other than a string cast, which by definition has no parameters). - I can ask for a fixed number of code points, but don't know how many bytes these will take until I cast to a UTF-8 string. - I can't manipulate anything at the grapheme level at all, even though this is the most meaningful level of operation in most cases. Things it does do: - a handful of methods give meaningful international text support: toUpper(), toLower(), trim() - some methods could be done on byte strings if I ensure they're all in UTF-8: replace(), contains(), startsWith(), endsWith(), repeat() - there may be limited situations where I want to dive into the code points which make up a string, although I can't think of many: $length, pad(), indexOf(), lastIndexOf(), charAt(), replaceSlice() - remaining methods avoid me creating invalid UTF-8, but don't help me much with real-life text: chunk(), split(), substring() - I can ask what codepage my Unicode string is in; I don't even understand what this means I think an efficient OO wrapper around ICU is a great idea, but more thought needs to go into what methods are exposed, and how people are going to use them in real code. Regards, -- Rowan Collins [IMSoP]