Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:84116 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 56810 invoked from network); 1 Mar 2015 20:25:54 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 1 Mar 2015 20:25:54 -0000 Authentication-Results: pb1.pair.com header.from=derick@php.net; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=derick@php.net; spf=unknown; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 82.113.146.227 as permitted sender) X-PHP-List-Original-Sender: derick@php.net X-Host-Fingerprint: 82.113.146.227 xdebug.org Linux 2.6 Received: from [82.113.146.227] ([82.113.146.227:59599] helo=xdebug.org) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id CA/60-53678-FC573F45 for ; Sun, 01 Mar 2015 15:25:52 -0500 Received: from localhost (localhost [IPv6:::1]) by xdebug.org (Postfix) with ESMTPS id 4B4D8EA0C0; Sun, 1 Mar 2015 20:25:48 +0000 (GMT) Date: Sun, 1 Mar 2015 20:25:47 +0000 (GMT) X-X-Sender: derick@whisky.home.derickrethans.nl To: Joe Watkins cc: PHP Internals In-Reply-To: Message-ID: References: <1413875212.2624.3.camel@localhost.localdomain> <54469840.3070708@sugarcrm.com> <1414051917.2624.35.camel@localhost.localdomain> <1414060726.2624.60.camel@localhost.localdomain> <1414072403.3228.3.camel@kuechenschabe> <87D717D5-273B-4A32-A3E5-83EBDFD314CB@ajf.me> <1414077690.3228.12.camel@kuechenschabe> <54495CF6.30608@sugarcrm.com> <1414130585.2624.64.camel@localhost.localdomain> User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="8323329-2081682986-1425241548=:5941" Subject: Re: [PHP-DEV] [RFC] UString From: derick@php.net (Derick Rethans) --8323329-2081682986-1425241548=:5941 Content-Type: TEXT/PLAIN; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Hey Joe, I think there are a few issues with the proposal, although I like the=20 general idea. I've had the tab with the RFC open since October... but=20 never looked at it until now :-/. So, a few comments: - UString as a name. I think I am going to prefer "Text" as a class name. Unicode (and=20 intl/icu) have lots of operators acting on items containing unicode=20 strings. But they are really pieces of text. For example sentences, word=20 break iterators, etc. UString *feels* clunky, and not "standard". If=20 it's going to be part of PHP core, then we should pick a "core" name. (I=20 might prefer String, but that's going to cause a whole lot of issues=20 obviously). - "Needs More Methods" I had a look at the API that that links to, and I miss operators like=20 iterators. Over words, sentences, characters, etc. Basically the=20 functionality of =20 http://docs.php.net/manual/en/class.intlbreakiterator.php,=20 http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and=20 http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php I realize intl already immplements, this, but it's really beneficial to=20 have for a "Text" class - especially for replacing functionality where=20 people now look over a string - with a character index.=20 - "Not a full String API Replacement" I would certainly expect more from it than just the UnicodeString API.=20 Perhaps not for a first iteration, but certainly for subsequent=20 versions. Things like transliterations, and specifically iterators would=20 be high on my list. - "Patch" toUpper/toLower, there is a missing one for toTitle - In the code's README: "Note: UString is interchangable with zend strings for method parameters=20 and can be cast for output/conversion to zend strings" How does that work? And what would it convert to? - How are "characters" counted? Is a character a Code Point, or is a character a base character +=20 combining diacritics. In the first form, A + =C2=B0 is considered as=20 characters, in the second option, just one. For wordwrap, splice,=20 substring, it is really important that only the *full sequence* is=20 considered as a character. And hence, a character really should be the=20 full sequence. The text in "charAt" seems to contradict that, and that=20 is a mistake. In the original PHP 6 we didn't do that due to perormance reasons, but=20 that point is moot now as only people who opt into using "Text" will=20 suffer from this. - "trim" What is a leading or trailing space? Is it just U+0020, or other Unicode=20 defined space characters as well? ( , U+00A0 comes to mind here) - What is "UG(defaultpad)," about? - For the code: - there is some interesting, non standard whitespaceing going on: - { goes on next line after a func decl - sometimes 4 spaces in stead of a tab are used for indentation,=20 - Why is there no __toString() ? - How can other extensions, not really making use of "Text", use there=20 strings (as UTF8 strings f.e.) cheers, Derick On Sat, 28 Feb 2015, Joe Watkins wrote: > Morning internals, >=20 > This is just a quick note to announce my intention to ready this RFC > for voting next week. >=20 > I know I'm a little late maybe, I was real sick most of last week, so > couldn't do anything useful. >=20 > A couple of us intend to fix outstanding issues on github and those > raised here, tidy the RFC and open the vote for 7. >=20 > I would ask anyone interested to scan through this thread and announce > concerns that are not mentioned asap. >=20 > Cheers > Joe >=20 > On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright wrote: >=20 > > On 24 October 2014 07:03, Joe Watkins wrote: > > > >> On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote: > >> > Hi! > >> > > >> > > P.S. u() is a bad name, will break lots of code, i.e. > >> > > >> > Maybe __u()? It's a bit ugly but you're not allowed to use __ so it'= s > >> safe. > >> > > >> > >> /me cringes ... > >> > >> I wonder how much of a problem it really is, usually when we say some > >> function name is a problem is because of hundreds and hundreds of > >> results on github. > >> > >> If it's a huge problem then we should rename it, if we have to dig > >> around for a single project that's incompatible, or even a handful, th= en > >> it's not really a problem. > >> > >> Cheers > >> Joe > > > > > > I can see this being something relatively common. While I personally wo= uld > > never do it, there are a few reasons I can think of that people *might*= do > > it: > > > > - Wrapper for creating HTML output > > - urlencode() shortcut > > - (obviously) various unicode-related things > > > > Searching on codesearch [1] revealed (amongst a few other hits on the > > first page) another interesting use of it in the hhvm test suite [2]. I= t's > > difficult to search for this because all the available public search > > engines that I know of do fuzzy matching. > > > > Sorry. This sucks, because every other option we have for this is sucks= =2E > > > > On the bright side, anything chosen could always be aliased at the top = of > > the file: > > > > use function __u as u; > > > > This also sucks, but it sucks a little bit less because the collisions = are > > avoided - or at least, avoided in such a way that the onus is on the us= er - > > and one can still have the sane name. > > > > First-class support at the syntax level (presumably $foo =3D u"unicode > > string" since we already have $foo =3D b"binary string") would IMO be b= etter > > and (hopefully?) a long-term goal, but I am aware that it is - and prob= ably > > should be - outside the scope of the current proposal. > > > > [1] https://searchcode.com/?q=3Dfunction+u+lang%3Aphp > > [2] > > https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/usp= oof.php#L13 > > >=20 --=20 http://derickrethans.nl | http://xdebug.org Like Xdebug? Consider a donation: http://xdebug.org/donate.php twitter: @derickr and @xdebug Posted with an email client that doesn't mangle email: alpine --8323329-2081682986-1425241548=:5941--