Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:84121
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.172 as permitted sender)
MIME-Version: 1.0
Sender: yohgaki@gmail.com
In-Reply-To: <alpine.DEB.2.11.1503012002140.5941@whisky.home.derickrethans.nl>
References: <1413875212.2624.3.camel@localhost.localdomain>
 <CA+9eiLtVUsbRQA9S0eagVfOc3aovhuak=QQmnBKp_BmgN4n28w@mail.gmail.com>
 <CAGKNXQEYquY_YQvTRv7uAhaycyCNG1FRwX_GYZ1dnHfhC-biZg@mail.gmail.com>
 <54469840.3070708@sugarcrm.com> <1414051917.2624.35.camel@localhost.localdomain>
 <CA+9eiLsC1dD9vp=uWqdYTXj7hC=mBnOMSs5B6KFngOQPd1m4OA@mail.gmail.com>
 <1414060726.2624.60.camel@localhost.localdomain> <1414072403.3228.3.camel@kuechenschabe>
 <87D717D5-273B-4A32-A3E5-83EBDFD314CB@ajf.me> <1414077690.3228.12.camel@kuechenschabe>
 <54495CF6.30608@sugarcrm.com> <1414130585.2624.64.camel@localhost.localdomain>
 <CAGAGxbaKK4P2D5epQTzEzNx3jcBP7XCPexGA4gyA3hdx3p+UKg@mail.gmail.com>
 <CAL=_i_nDz0XoVA5d4gqinYg9GinJUMW55+HLQ4y9O3Nj_URoxw@mail.gmail.com> <alpine.DEB.2.11.1503012002140.5941@whisky.home.derickrethans.nl>
Date: Mon, 2 Mar 2015 05:59:39 +0900
Message-ID: <CAGa2bXY+hJBwhe3DGpLj2JpGxu0c2+N9dzhH+TpNX480W0-A=w@mail.gmail.com>
To: Derick Rethans <derick@php.net>
Cc: Joe Watkins <pthreads@pthreads.org>, PHP Internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary=001a11433500a49e2405104063ed
Subject: Re: [PHP-DEV] [RFC] UString
From: yohgaki@ohgaki.net (Yasuo Ohgaki)

--001a11433500a49e2405104063ed
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Joe and Derick,

On Mon, Mar 2, 2015 at 5:25 AM, Derick Rethans <derick@php.net> wrote:

> I think there are a few issues with the proposal, although I like the
> general idea. I've had the tab with the RFC open since October... but
> never looked at it until now :-/. So, a few comments:
>
> - UString as a name.
>
> I think I am going to prefer "Text" as a class name. Unicode (and
> intl/icu) have lots of operators acting on items containing unicode
> strings. But they are really pieces of text. For example sentences, word
> break iterators, etc. UString *feels* clunky, and not "standard". If
> it's going to be part of PHP core, then we should pick a "core" name. (I
> might prefer String, but that's going to cause a whole lot of issues
> obviously).
>

I think it's better to have "string/text" data as certain encoding/codepage=
.
Although Unicode encoding conversion is cheap, (I mean cheap compare
to conversion to other encodings, like SJIS, EUC, ISO-2022, etc), UTF-8
is better because

 - PCRE only supports UTF-8
 - SQLite only supports UTF-8
 - PHP uses UTF-8 as the default now
 - Recent web apps uses UTF-8 as encoding
 - Single encoding for stored text/string is simpler
 - Considering normalization, having UTF-8 with NFC is less confusing.

However, I don't mind too much allowing any encoding stored in "Text"/
"UString" object. IIRC, Ruby does this and have not much problem.

If we have multiple encoding support. We should resolve

$new =3D $str_utf8 . $str_sjis; // $new is UTF-8 or SJIS? Raise error?
$new =3D $str_nfc . $str_nfd; // $new is NFC or NFD, mixed? Raise error?
$new =3D $str_utf16le . $str_utf16be; // $new is ?? How BOM is handled?


> - "Needs More Methods"
>
> I had a look at the API that that links to, and I miss operators like
> iterators. Over words, sentences, characters, etc. Basically the
> functionality of
> http://docs.php.net/manual/en/class.intlbreakiterator.php,
> http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and
> http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php
>
> I realize intl already immplements, this, but it's really beneficial to
> have for a "Text" class - especially for replacing functionality where
> people now look over a string - with a character index.
>
>
There are missing features... We may implement most of them before
release.


> - "Not a full String API Replacement"
>
> I would certainly expect more from it than just the UnicodeString API.
> Perhaps not for a first iteration, but certainly for subsequent
> versions. Things like transliterations, and specifically iterators would
> be high on my list.
>

Sounds good.


>
> - "Patch"
>
> toUpper/toLower, there is a missing one for toTitle
>
> - In the code's README:
>
> "Note: UString is interchangable with zend strings for method parameters
> and can be cast for output/conversion to zend strings"
>
> How does that work? And what would it convert to?
>

I guess Joe means it's using zend_string internally?


>
> - How are "characters" counted?
>
> Is a character a Code Point, or is a character a base character +
> combining diacritics. In the first form, A + =C2=B0 is considered as
> characters, in the second option, just one. For wordwrap, splice,
> substring, it is really important that only the *full sequence* is
> considered as a character. And hence, a character really should be the
> full sequence. The text in "charAt" seems to contradict that, and that
> is a mistake.
>

One reason I prefer NFC.


>
> In the original PHP 6 we didn't do that due to perormance reasons, but
> that point is moot now as only people who opt into using "Text" will
> suffer from this.
>
> - "trim"
>
> What is a leading or trailing space? Is it just U+0020, or other Unicode
> defined space characters as well? (&nbsp;, U+00A0 comes to mind here)
>

Any "space" is better to be trimmed.


>
> - What is "UG(defaultpad)," about?
>
> - For the code:
>
>   - there is some interesting, non standard whitespaceing going on:
>
>     - { goes on next line after a func decl
>     - sometimes 4 spaces in stead of a tab are used for indentation,
>
> - Why is there no __toString() ?
>

If this is missing, there should be __toString()


>
> - How can other extensions, not really making use of "Text", use there
>   strings (as UTF8 strings f.e.)
>

I agree that Internal API needs improvement.

Overall, I think it's good for starting if basic issue is resolved.
The most important is "if it supports single or multiple encoding for
stored text/string?".
There are many things programmers should know if multiple encoding is
supported,
but I don't object strongly to have multiple encoding support. It's nice to
have ability
to handle SJIS, ISO-2022, etc natively.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

--001a11433500a49e2405104063ed--