Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:84121 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 65265 invoked from network); 1 Mar 2015 21:00:23 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 1 Mar 2015 21:00:23 -0000 Authentication-Results: pb1.pair.com header.from=yohgaki@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=yohgaki@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.172 as permitted sender) X-PHP-List-Original-Sender: yohgaki@gmail.com X-Host-Fingerprint: 209.85.216.172 mail-qc0-f172.google.com Received: from [209.85.216.172] ([209.85.216.172:36569] helo=mail-qc0-f172.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 75/22-53678-6ED73F45 for ; Sun, 01 Mar 2015 16:00:22 -0500 Received: by qcqi8 with SMTP id i8so21684068qcq.3 for ; Sun, 01 Mar 2015 13:00:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=zNfyhmFyoyoA+qOQiCK/yQ1zbgJQdmqO+k+Hy7B5obw=; b=ecvItUTWSB1r4Z45oW24tKNNWmQ2bpFeUA9j5bV0fgMOyramgbOzj0ZbTsv5ZfWLSE tSoKgrOctaUqPc0AY4RfbwLvaJ5DzKy3xkMcaJdLQL31Zwc+/HpcbsWjsevnvbKIAnSF JOa4n+rnIxFvomLXqTIMr3lYprmJyqyC+XRhzq/Y+rrneftaijdCgc0CO6nAMhgy3+Jn Y0FK/hYbJ793bsoxwKkCC9jAoeRZp4xqlF35jZYkPXnmWUxmdBnHFbHKPycaX1v0fh6b oVKj6/JoO0mAm93yJLbi7QJxX7OalHr56VZcFlQxnshbrPXvXi92abo5fE1gC67jY0Su dgHQ== X-Received: by 10.140.195.5 with SMTP id q5mr47168486qha.58.1425243619235; Sun, 01 Mar 2015 13:00:19 -0800 (PST) MIME-Version: 1.0 Sender: yohgaki@gmail.com Received: by 10.229.198.8 with HTTP; Sun, 1 Mar 2015 12:59:39 -0800 (PST) In-Reply-To: References: <1413875212.2624.3.camel@localhost.localdomain> <54469840.3070708@sugarcrm.com> <1414051917.2624.35.camel@localhost.localdomain> <1414060726.2624.60.camel@localhost.localdomain> <1414072403.3228.3.camel@kuechenschabe> <87D717D5-273B-4A32-A3E5-83EBDFD314CB@ajf.me> <1414077690.3228.12.camel@kuechenschabe> <54495CF6.30608@sugarcrm.com> <1414130585.2624.64.camel@localhost.localdomain> Date: Mon, 2 Mar 2015 05:59:39 +0900 X-Google-Sender-Auth: rV214jf-DnnIwJ3EW09VmL_AW3A Message-ID: To: Derick Rethans Cc: Joe Watkins , PHP Internals Content-Type: multipart/alternative; boundary=001a11433500a49e2405104063ed Subject: Re: [PHP-DEV] [RFC] UString From: yohgaki@ohgaki.net (Yasuo Ohgaki) --001a11433500a49e2405104063ed Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Joe and Derick, On Mon, Mar 2, 2015 at 5:25 AM, Derick Rethans wrote: > I think there are a few issues with the proposal, although I like the > general idea. I've had the tab with the RFC open since October... but > never looked at it until now :-/. So, a few comments: > > - UString as a name. > > I think I am going to prefer "Text" as a class name. Unicode (and > intl/icu) have lots of operators acting on items containing unicode > strings. But they are really pieces of text. For example sentences, word > break iterators, etc. UString *feels* clunky, and not "standard". If > it's going to be part of PHP core, then we should pick a "core" name. (I > might prefer String, but that's going to cause a whole lot of issues > obviously). > I think it's better to have "string/text" data as certain encoding/codepage= . Although Unicode encoding conversion is cheap, (I mean cheap compare to conversion to other encodings, like SJIS, EUC, ISO-2022, etc), UTF-8 is better because - PCRE only supports UTF-8 - SQLite only supports UTF-8 - PHP uses UTF-8 as the default now - Recent web apps uses UTF-8 as encoding - Single encoding for stored text/string is simpler - Considering normalization, having UTF-8 with NFC is less confusing. However, I don't mind too much allowing any encoding stored in "Text"/ "UString" object. IIRC, Ruby does this and have not much problem. If we have multiple encoding support. We should resolve $new =3D $str_utf8 . $str_sjis; // $new is UTF-8 or SJIS? Raise error? $new =3D $str_nfc . $str_nfd; // $new is NFC or NFD, mixed? Raise error? $new =3D $str_utf16le . $str_utf16be; // $new is ?? How BOM is handled? > - "Needs More Methods" > > I had a look at the API that that links to, and I miss operators like > iterators. Over words, sentences, characters, etc. Basically the > functionality of > http://docs.php.net/manual/en/class.intlbreakiterator.php, > http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and > http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php > > I realize intl already immplements, this, but it's really beneficial to > have for a "Text" class - especially for replacing functionality where > people now look over a string - with a character index. > > There are missing features... We may implement most of them before release. > - "Not a full String API Replacement" > > I would certainly expect more from it than just the UnicodeString API. > Perhaps not for a first iteration, but certainly for subsequent > versions. Things like transliterations, and specifically iterators would > be high on my list. > Sounds good. > > - "Patch" > > toUpper/toLower, there is a missing one for toTitle > > - In the code's README: > > "Note: UString is interchangable with zend strings for method parameters > and can be cast for output/conversion to zend strings" > > How does that work? And what would it convert to? > I guess Joe means it's using zend_string internally? > > - How are "characters" counted? > > Is a character a Code Point, or is a character a base character + > combining diacritics. In the first form, A + =C2=B0 is considered as > characters, in the second option, just one. For wordwrap, splice, > substring, it is really important that only the *full sequence* is > considered as a character. And hence, a character really should be the > full sequence. The text in "charAt" seems to contradict that, and that > is a mistake. > One reason I prefer NFC. > > In the original PHP 6 we didn't do that due to perormance reasons, but > that point is moot now as only people who opt into using "Text" will > suffer from this. > > - "trim" > > What is a leading or trailing space? Is it just U+0020, or other Unicode > defined space characters as well? ( , U+00A0 comes to mind here) > Any "space" is better to be trimmed. > > - What is "UG(defaultpad)," about? > > - For the code: > > - there is some interesting, non standard whitespaceing going on: > > - { goes on next line after a func decl > - sometimes 4 spaces in stead of a tab are used for indentation, > > - Why is there no __toString() ? > If this is missing, there should be __toString() > > - How can other extensions, not really making use of "Text", use there > strings (as UTF8 strings f.e.) > I agree that Internal API needs improvement. Overall, I think it's good for starting if basic issue is resolved. The most important is "if it supports single or multiple encoding for stored text/string?". There are many things programmers should know if multiple encoding is supported, but I don't object strongly to have multiple encoding support. It's nice to have ability to handle SJIS, ISO-2022, etc natively. Regards, -- Yasuo Ohgaki yohgaki@ohgaki.net --001a11433500a49e2405104063ed--