[RFC] UString

10 years ago by Leigh — view source

unread

Morning internalz,
    https://wiki.php.net/rfc/ustring

    This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
    Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Breaks nothing, faster than mbstring, seems like win/win to me.

On the flip side, implementing UString as a scalar object would be inconsistent. At time of writing, array, int, float, bool, etc have no implementation available for this.

I agree it shouldn't be a scalar object, but how about some operator
overloading like the GMP object has, so that you don't have to cast to
string for expected behaviour with type coercion etc.

Right now there are user-space libraries out there that cover a lot more functionality than UString.

Do you need help implementing these? Do you think it would be
beneficial to briefly list which areas need attention on the RFC, so
they can be checked off over time?

Overall +1 on the concept.

10 years ago by Joe Watkins — view source

unread

Morning internalz,
    https://wiki.php.net/rfc/ustring

    This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
    Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)
Breaks nothing, faster than mbstring, seems like win/win to me.

On the flip side, implementing UString as a scalar object would be inconsistent. At time of writing, array, int, float, bool, etc have no implementation available for this.

I agree it shouldn't be a scalar object, but how about some operator
overloading like the GMP object has, so that you don't have to cast to
string for expected behaviour with type coercion etc.

Right now there are user-space libraries out there that cover a lot more functionality than UString.

Do you need help implementing these? Do you think it would be
beneficial to briefly list which areas need attention on the RFC, so
they can be checked off over time?

Overall +1 on the concept.

Morning Leigh,

ZEND_CONCAT is overloaded, as well as read_dimension and cast (to
string) handlers. This seems to cover everything, unless I missed
something ?

Cheers
Joe

10 years ago by Leigh — view source

unread

ZEND_CONCAT is overloaded, as well as read_dimension and cast (to
string) handlers. This seems to cover everything, unless I missed
something ?

ZEND_CONCAT and ZEND_ASSIGN_CONCAT were my primary concerns, I didn't
see any mention of these in the RFC which is why I brought it up
(maybe it should be documented there).

May not be desirable at all, but obviously with ordinary strings we
can do int + "str containing int", and if the UString object
contains an int then int + (string)ustring will still achieve that.

My thought was to make the remaining operators that don't make sense
on an object implicitly cast to string before the operation takes
place.

Feel free to "do not want". :)

10 years ago by Zeev Suraski — view source

unread

-----Original Message-----
From: Joe Watkins [mailto:pthreads@pthreads.org]
Sent: Tuesday, October 21, 2014 10:07 AM
To: internals@lists.php.net
Subject: [PHP-DEV] [RFC] UString

Morning internalz,

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening
any vote in a fortnight. We have a long time before 7, there is no rush
whatever.

Now seems like a good time to start the conversation so we can
hash out the details, or get on with other things ;)

+1 from me. I think it's the right way to tackle Unicode.

Zeev

10 years ago by Lester Caine — view source

unread

Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Does this address the problem of sorting array keys using a particular
language or collation?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

10 years ago by Joe Watkins — view source

unread

Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Does this address the problem of sorting array keys using a particular
language or collation?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

No.

Cheers
Joe

10 years ago by Nicolas Grekas — view source

unread

This is great thanks for the work!
I think we should have an opinion on grapheme clusters and tell about it in
the RFC.

I do support the idea that PHP users need to handle "characters" in term of
"graphemes". We need a core way to deal with code points of course, but
things like "reverse" have very low value without graphemes.

toLower/toUpper also misses the turkish specifics - or is the Ustring class
"locale" dependent?
Should we add "toCaseFold"? Where are the "i" version of strpos, etc. Do we
want them in core PHP7? An other point we should add to the RFC.

For reference here is my grapheme cluster aware string handling:
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/Utf8.php

and the same but turkish variant:
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/TurkishUtf8.php

About unicode equivalence:
For all the string matching functions (contains, startsWith, etc.) do they
handling unicode equivalence?
How do we compare two Ustrings? Does the == operator handle unicode
equivalence? What is the way to go otherwise? Normalize is before on our
own?
The RFC should tell about it also IMHO (and tell that collation/sorting
handling is out of scope).

Complex topic :)

Cheers,
NIcolas

10 years ago by Dmitry Stogov — view source

unread

Hi Joe,

As an extension it looks fine.
I assume, you don't propose to use UString objects in engine and other
extensions.
Unfortunately, it's yet another incomplete solution.

It won't allow Unicode strings as array keys;
concatenation using "." (probably may be done),
no auto-conversion from/to script/output encoding,
no auto-conversion of strings coming from database extensions, etc

The "right" approach, would be extending zend_string with "encoding" and
then adopting near all functions working with zend_string to take
"encoding" into account. But, of course, this is going to lead to much more
complicated solution (with some slowdown).

If we don't care about complete solution, UString proposal may make sense
at lest as a faster replacement of ext/mbstring.

Thanks. Dmitry.

Morning internalz,
    https://wiki.php.net/rfc/ustring

    This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
    Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)

Cheers
Joe

10 years ago by Philip Hofstetter — view source

unread

Hello

Tangentially related:

It won't allow Unicode strings as array keys;

I wish there was a way for specific objects to opt into this.

Using __toString() we have something that mostly behaves just like a
string and can be used wherever a string is required - with the exception
of array keys.

I seem to remember some earlier discussion that led to this being
intentionally made impossible (and I understand why), but maybe there could
be support for another magic underscore method that's called when an object
is about to be put into an array as a key (or similar situations)

Philip

--
Sensational AG
Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
info@sensational.ch, http://www.sensational.ch

10 years ago by Florian Margaine — view source

unread

Hi,

@Philip: please read the discussion that happened a month ago (and follow
up on it if necessary):
http://marc.info/?l=php-internals&m=141145952422734&w=2

Regards,

On Tue, Oct 21, 2014 at 11:19 AM, Philip Hofstetter <
phofstetter@sensational.ch> wrote:

Hello

Tangentially related:

It won't allow Unicode strings as array keys;

I wish there was a way for specific objects to opt into this.

Using __toString() we have something that mostly behaves just like a
string and can be used wherever a string is required - with the exception
of array keys.

I seem to remember some earlier discussion that led to this being
intentionally made impossible (and I understand why), but maybe there could
be support for another magic underscore method that's called when an object
is about to be put into an array as a key (or similar situations)

Philip

--
Sensational AG
Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
info@sensational.ch, http://www.sensational.ch

--
Florian Margaine

10 years ago by Stas Malyshev — view source

unread

Hi!

I wish there was a way for specific objects to opt into this.

There will be, if __hashKey() or whatever would be the properly
bikeshedded name, becomes reality as discussed elsewhere. It shouldn't
be hard to do and it's exactly what many other languages do when trying
to use objects as keys for maps.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

10 years ago by Joe Watkins — view source

unread

Hi!

I wish there was a way for specific objects to opt into this.

There will be, if __hashKey() or whatever would be the properly
bikeshedded name, becomes reality as discussed elsewhere. It shouldn't
be hard to do and it's exactly what many other languages do when trying
to use objects as keys for maps.

Not ready for discussion yet ...

https://wiki.php.net/rfc/hashkey

But it exists, I think it solves a problem for ustring in particular but
it solves the problem in general too. No time to write about it or
discuss it at this moment, but in pipeline, hopefully ...

Cheers
Joe

10 years ago by Dmitry Stogov — view source

unread

this won't completely solve the problem, because array keys won't be
UString anymore.

Thanks. Dmtiry.

Hi!

I wish there was a way for specific objects to opt into this.

There will be, if __hashKey() or whatever would be the properly
bikeshedded name, becomes reality as discussed elsewhere. It shouldn't
be hard to do and it's exactly what many other languages do when trying
to use objects as keys for maps.

Not ready for discussion yet ...

https://wiki.php.net/rfc/hashkey

But it exists, I think it solves a problem for ustring in particular but
it solves the problem in general too. No time to write about it or
discuss it at this moment, but in pipeline, hopefully ...

Cheers
Joe

10 years ago by Joe Watkins — view source

unread

this won't completely solve the problem, because array keys won't be
UString anymore.

http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()

Others solve this problem in exactly this way, the Java implementation
requires that you return an int.

The one in that draft will allow you to return any scalar. This is much
more suitable for PHP.

It doesn't solve the problem directly but allows the programmer to solve
it for themselves, just like Object.hashCode in Java.

Thanks. Dmtiry.

On Thu, Oct 23, 2014 at 12:11 PM, Joe Watkins pthreads@pthreads.org
wrote:

    > Hi!
    >
    > > I wish there was a way for specific objects to opt into
    this.
    >
    > There will be, if __hashKey() or whatever would be the
    properly
    > bikeshedded name, becomes reality as discussed elsewhere. It
    shouldn't
    > be hard to do and it's exactly what many other languages do
    when trying
    > to use objects as keys for maps.
    >
    >
    
    Not ready for discussion yet ...
    
    https://wiki.php.net/rfc/hashkey
    
    But it exists, I think it solves a problem for ustring in
    particular but
    it solves the problem in general too. No time to write about
    it or
    discuss it at this moment, but in pipeline, hopefully ...
    
    Cheers
    Joe

Cheers
Joe

10 years ago by johannes@schlueters.de — view source

unread

It doesn't solve the problem directly but allows the programmer to solve
it for themselves, just like Object.hashCode in Java.

The point is that it won't work in this way:

$a = [ $ustring => $value ];
foreach ($a as $key => $v) {
$key->ustring_method();
}

but one needs something along the lines of

$a = [ $ustring => $value ];
foreach ($a as $key => $v) {
Utring::fromHashCode($key)->ustring_method();
}

which likely looses object identity.

It works but is not really nice :-)

johannes

10 years ago by Andrea Faulds — view source

unread

It doesn't solve the problem directly but allows the programmer to solve
it for themselves, just like Object.hashCode in Java.

The point is that it won't work in this way:

$a = [ $ustring => $value ];
foreach ($a as $key => $v) {
$key->ustring_method();
}

but one needs something along the lines of

$a = [ $ustring => $value ];
foreach ($a as $key => $v) {
Utring::fromHashCode($key)->ustring_method();
}

which likely looses object identity.

It works but is not really nice :-)

u($key)->split(',')->... works :)

--
Andrea Faulds
http://ajf.me/

10 years ago by johannes@schlueters.de — view source

unread

It doesn't solve the problem directly but allows the programmer to solve
it for themselves, just like Object.hashCode in Java.

The point is that it won't work in this way:

$a = [ $ustring => $value ];
foreach ($a as $key => $v) {
$key->ustring_method();
}

but one needs something along the lines of

$a = [ $ustring => $value ];
foreach ($a as $key => $v) {
Utring::fromHashCode($key)->ustring_method();
}

which likely looses object identity.

It works but is not really nice :-)

u($key)->split(',')->... works :)

While that's something else from the original example and makes this
behave not like an integral part of the language.

The proper solution would be a unicode type, but PHP 6 showed that this
is not going to work out and this is way better than what we have right
now, though and a good step in the right direction. We probably might
integrate it in the core language more and more.

My point is to stress that this is incomplete, as Dmitry said, and that
we should not take this alone as the final solution forever.

johannes

P.S. u() is a bad name, will break lots of code, i.e.
https://code.openhub.net/file?fid=wRj6MYm-GPDxPidisWYoLa23wFc&cid=CCYlIMOwTks&s=fndef%3Au&pp=0&fl=PHP&ff=1&filterChecked=true&fp=126888&mp,=1&ml=1&me=1&md=1&projSelected=true#L0 will give "weird" runtime behavior as their definition is guarded by a function_exists check but both functions do completely different things..

10 years ago by Stas Malyshev — view source

unread

Hi!

P.S. u() is a bad name, will break lots of code, i.e.

Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

10 years ago by Andrea Faulds — view source

unread

P.S. u() is a bad name, will break lots of code, i.e.

Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe.

I don't like that. This might sound crazy, but what about adding Unicode string literals to the parser, e.g. u"foo bar\u{202e}你好"? If the UString extension isn't available, just error. It wouldn't be the first time we had disableable syntax features (``), and this avoids any possible conflicts.

Andrea Faulds
http://ajf.me/

10 years ago by Joe Watkins — view source

unread

Hi!

P.S. u() is a bad name, will break lots of code, i.e.

Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe.

/me cringes ...

I wonder how much of a problem it really is, usually when we say some
function name is a problem is because of hundreds and hundreds of
results on github.

If it's a huge problem then we should rename it, if we have to dig
around for a single project that's incompatible, or even a handful, then
it's not really a problem.

Cheers
Joe

10 years ago by Chris Wright — view source

unread

Hi!

P.S. u() is a bad name, will break lots of code, i.e.

Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's
safe.

/me cringes ...

I wonder how much of a problem it really is, usually when we say some
function name is a problem is because of hundreds and hundreds of
results on github.

If it's a huge problem then we should rename it, if we have to dig
around for a single project that's incompatible, or even a handful, then
it's not really a problem.

Cheers
Joe

I can see this being something relatively common. While I personally would
never do it, there are a few reasons I can think of that people might do
it:

Wrapper for creating <u> HTML output
urlencode() shortcut
(obviously) various unicode-related things

Searching on codesearch [1] revealed (amongst a few other hits on the first
page) another interesting use of it in the hhvm test suite [2]. It's
difficult to search for this because all the available public search
engines that I know of do fuzzy matching.

Sorry. This sucks, because every other option we have for this is sucks.

On the bright side, anything chosen could always be aliased at the top of
the file:

use function __u as u;

This also sucks, but it sucks a little bit less because the collisions are
avoided - or at least, avoided in such a way that the onus is on the user -
and one can still have the sane name.

First-class support at the syntax level (presumably $foo = u"unicode
string" since we already have $foo = b"binary string") would IMO be better
and (hopefully?) a long-term goal, but I am aware that it is - and probably
should be - outside the scope of the current proposal.

[1] https://searchcode.com/?q=function+u+lang%3Aphp
[2]
https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13

10 years ago by Joe Watkins — view source

unread

Morning internals,

This is just a quick note to announce my intention to ready this RFC

for voting next week.

I know I'm a little late maybe, I was real sick most of last week, so

couldn't do anything useful.

A couple of us intend to fix outstanding issues on github and those

raised here, tidy the RFC and open the vote for 7.

I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.

Cheers
Joe

Hi!

P.S. u() is a bad name, will break lots of code, i.e.

Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's
safe.

/me cringes ...

I wonder how much of a problem it really is, usually when we say some
function name is a problem is because of hundreds and hundreds of
results on github.

If it's a huge problem then we should rename it, if we have to dig
around for a single project that's incompatible, or even a handful, then
it's not really a problem.

Cheers
Joe

I can see this being something relatively common. While I personally would
never do it, there are a few reasons I can think of that people might do
it:

Wrapper for creating <u> HTML output

urlencode() shortcut

(obviously) various unicode-related things

Searching on codesearch [1] revealed (amongst a few other hits on the
first page) another interesting use of it in the hhvm test suite [2]. It's
difficult to search for this because all the available public search
engines that I know of do fuzzy matching.

Sorry. This sucks, because every other option we have for this is sucks.

On the bright side, anything chosen could always be aliased at the top of
the file:

use function __u as u;

This also sucks, but it sucks a little bit less because the collisions are
avoided - or at least, avoided in such a way that the onus is on the user -
and one can still have the sane name.

First-class support at the syntax level (presumably $foo = u"unicode
string" since we already have $foo = b"binary string") would IMO be better
and (hopefully?) a long-term goal, but I am aware that it is - and probably
should be - outside the scope of the current proposal.

[1] https://searchcode.com/?q=function+u+lang%3Aphp
[2]
https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13

10 years ago by Rowan Collins — view source

unread

Morning internals,
 This is just a quick note to announce my intention to ready this RFC
for voting next week.
 I know I'm a little late maybe, I was real sick most of last week, so
couldn't do anything useful.
 A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.
I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.

I still think this class is trying to do several jobs, and not doing any
of them very well, and I fear that people will see this class and expect
it to solve problems which it actually ignores.

Here are some concrete use cases I would like a simple interface to
solve for me:

Take text from an ISO 88592-2 data source, pass it through generic
text filters, and pass it to a UTF-16 data target.
Given a long string of Unicode text, give me a valid UTF-8 string
which fits into a buffer with fixed byte size; i.e. give me the largest
number of whole code points which fit into that number of bytes once
encoded.
As above, but without stripping diacritics off the last character of
the resulting string, i.e. give me the largest number of whole graphemes
which fit.
Split a string into equal sized chunks of readable characters
(graphemes), regardless of how many bytes or code points each chunk
contains.

UString currently falls short of all of these:

I can specify my input encoding (in the constructor or helper method,
over-riding a static default, which is equivalent to ext/mbstring's
global setting), but not my output encoding (there is no method to ask
for a byte representation other than a string cast, which by definition
has no parameters).
I can ask for a fixed number of code points, but don't know how many
bytes these will take until I cast to a UTF-8 string.
I can't manipulate anything at the grapheme level at all, even though
this is the most meaningful level of operation in most cases.

Things it does do:

a handful of methods give meaningful international text support:
toUpper(), toLower(), trim()
some methods could be done on byte strings if I ensure they're all in
UTF-8: replace(), contains(), startsWith(), endsWith(), repeat()
there may be limited situations where I want to dive into the code
points which make up a string, although I can't think of many: $length,
pad(), indexOf(), lastIndexOf(), charAt(), replaceSlice()
remaining methods avoid me creating invalid UTF-8, but don't help me
much with real-life text: chunk(), split(), substring()
I can ask what codepage my Unicode string is in; I don't even
understand what this means

I think an efficient OO wrapper around ICU is a great idea, but more
thought needs to go into what methods are exposed, and how people are
going to use them in real code.

Regards,

Rowan Collins
[IMSoP]

10 years ago by Derick Rethans — view source

unread

Morning internals,
 This is just a quick note to announce my intention to ready this RFC
for voting next week.
 I know I'm a little late maybe, I was real sick most of last week, so
couldn't do anything useful.
 A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.
I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.
I still think this class is trying to do several jobs, and not doing any of
them very well, and I fear that people will see this class and expect it to
solve problems which it actually ignores.

Here are some concrete use cases I would like a simple interface to solve for
me:

Take text from an ISO 88592-2 data source, pass it through generic text
filters, and pass it to a UTF-16 data target.

Given a long string of Unicode text, give me a valid UTF-8 string which fits
into a buffer with fixed byte size; i.e. give me the largest number of whole
code points which fit into that number of bytes once encoded.

As above, but without stripping diacritics off the last character of the
resulting string, i.e. give me the largest number of whole graphemes which
fit.

Split a string into equal sized chunks of readable characters (graphemes),
regardless of how many bytes or code points each chunk contains.

UString currently falls short of all of these:

I can specify my input encoding (in the constructor or helper method,
over-riding a static default, which is equivalent to ext/mbstring's global
setting), but not my output encoding (there is no method to ask for a byte
representation other than a string cast, which by definition has no
parameters).

Yeah, there should be an output method to convert to a target encoding.

I can ask for a fixed number of code points, but don't know how many bytes
these will take until I cast to a UTF-8 string.

As I said before, indexes into strings should not be done on code
points, as the following would then break the characters:

$s = new Text("Ås");
echo $s->substring(1);

The output would be: ̊

Where as:

$s = new Text("Ås);
echo $s->substring(1);

would output "s".

Which is not what people would expect.

I can't manipulate anything at the grapheme level at all, even though this
is the most meaningful level of operation in most cases.

Yes - graphemes should be the base blocks, not code points.

Things it does do:

a handful of methods give meaningful international text support: toUpper(),
toLower(), trim()

some methods could be done on byte strings if I ensure they're all in UTF-8:
replace(), contains(), startsWith(), endsWith(), repeat()

That doesn't always work when you have graphemes, or text in different
normalisation forms. Ie, it should consider Å U+00C5 and Å (U+0041 +
U+030A) the same for contains and startsWith — ie, handle normalisation
for comparison.

there may be limited situations where I want to dive into the code points
which make up a string, although I can't think of many: $length, pad(),
indexOf(), lastIndexOf(), charAt(), replaceSlice()

Break iterators on either code points, or graphemes, might work here?

remaining methods avoid me creating invalid UTF-8, but don't help me
much with real-life text: chunk(), split(), substring() - I can ask
what codepage my Unicode string is in; I don't even understand what
this means

I think an efficient OO wrapper around ICU is a great idea, but more
thought needs to go into what methods are exposed, and how people are
going to use them in real code.

Yes - I agree. I think this current proposal is a good start, but it
needs to be worked out a little bit more before I think we should vote
on it — how much I would like to see something like this in PHP.

cheers,
Derick

--
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug
Posted with an email client that doesn't mangle email: alpine

10 years ago by Yasuo Ohgaki — view source

unread

Hi Joe,

This is just a quick note to announce my intention to ready this RFC
for voting next week.
I know I'm a little late maybe, I was real sick most of last week, so
couldn't do anything useful.
A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.

I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.

I appreciate your proposal!
Rowan pointed out some important things. I don't understand details as I
don't read your code yet. I'll try to read and comment in a few days.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Yasuo Ohgaki — view source

unread

Hi Joe,

On Sat, Feb 28, 2015 at 3:48 PM, Joe Watkins pthreads@pthreads.org
wrote:
This is just a quick note to announce my intention to ready this RFC
for voting next week.
I know I'm a little late maybe, I was real sick most of last week, so
couldn't do anything useful.
A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.

I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.
I appreciate your proposal!
Rowan pointed out some important things. I don't understand details as I
don't read your code yet. I'll try to read and comment in a few days.

I guess you would like to start voting today or tomorrow, so I briefly read
your code.
I think your approach is good. I like UString be UTF-8 always by default
regardless
of other settings. i.e. default_charset, internal_encoding.

I see few missing key APIs that would be critical for multibyte char
handling, like
string length, string width, normalization, string conversions like Zenkaku
to Hankaku,
encoding(codepage) converter. However, all of these may be added later as
they
are already implemented in ICU.

I think UString may be better to use UTF-8 always to make users life a
little simpler.
Your constructor only have codepage setting that is used as UString
codepage to support
other codepage(encodings).

Rather than to have various encoding support, I think constructor needs
encoding(codepage)
conversion feature. Codepage parameter is better to be used as "from
encoding(codepage)"
parameter and convert any encoding(codepage) to UTF-8. If conversion fails,
it should raise
exception. It's better to have forgiving API for malformed strings if user
explicitly specified to do so.

Constructor may be

public function __construct([string $string [, string $source_codepage
[, string $substitute_char] ]);

$soure_codepage is source string encoding(codepage) and $string is
converted to UTF-8 always.
If $substitute_char is omitted, raise exception for invalid $string.
If $substitute_char is specified (it can be '' empty string), convert
$string according to $source_codepage
and just remove/replace invalid byte stream in $string.

With this constructor, string stored in UString object is always valid
UTF-8. Any character encoding
(including UTF-16/32 and 200 encoding names supported by ICU) may be used
as source string.

Since there will be no variable codepage setting for UString object,
followings may be removed.

public static function getDefaultCodepage();
public static function setDefaultCodepage(string $codepage);

ICU uses "codepage" as "character encoding", but it may be better to use
"character
encoding" as people are not used to ICU terminology.

This is what I thought. I didn't read your code carefully, so I might be
wrong. Please
correct me if I'm mistaken.

I suppose there are other people working on Unicode string based simpler
libraries.
I would like to hear opinion from them.

BTW, we really need byte_len(). strlen() is just confusing API... It's not
a scope of
this RFC, though.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Yasuo Ohgaki — view source

unread

Hi Joe,

public function __construct([string $string [, string $source_codepage
[, string $substitute_char] ]);

One additional comment for constructor. It should have default
normalization. I think
it should be NFC as most system uses it. (OSX uses NFD for filenames! I
hate it and
most of Japanese developers hate it)

The API may be

public function __construct([string $string [, string $source_codepage [,
string $substitute_char [, $normalization] ]);

If $substitute_char is NULL, disallow invalid encoding.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Derick Rethans — view source

unread

Hi Joe,

public function __construct([string $string [, string $source_codepage
[, string $substitute_char] ]);

One additional comment for constructor. It should have default
normalization. I think it should be NFC as most system uses it. (OSX
uses NFD for filenames! I hate it and most of Japanese developers hate
it)

The API may be

public function __construct([string $string [, string $source_codepage [,
string $substitute_char [, $normalization] ]);

I wouldn't leave normalization as an option, and certainly not done
by default. I would suggest other (mutable) methods, to convert between
normalisation forms.

If $substitute_char is NULL, disallow invalid encoding.

I don't think substitions (ie, data loss) should be allowed at all. This
should thrown an immediate exception. If you really want this, I suggest
adding a factory method for this. i.e. Text::createWithSubstitutions -
or whatever better name.

cheers,
Derick

10 years ago by Derick Rethans — view source

unread

Hey Joe,

I think there are a few issues with the proposal, although I like the
general idea. I've had the tab with the RFC open since October... but
never looked at it until now :-/. So, a few comments:

UString as a name.

I think I am going to prefer "Text" as a class name. Unicode (and
intl/icu) have lots of operators acting on items containing unicode
strings. But they are really pieces of text. For example sentences, word
break iterators, etc. UString feels clunky, and not "standard". If
it's going to be part of PHP core, then we should pick a "core" name. (I
might prefer String, but that's going to cause a whole lot of issues
obviously).

"Needs More Methods"

I had a look at the API that that links to, and I miss operators like
iterators. Over words, sentences, characters, etc. Basically the
functionality of
http://docs.php.net/manual/en/class.intlbreakiterator.php,
http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and
http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php

I realize intl already immplements, this, but it's really beneficial to
have for a "Text" class - especially for replacing functionality where
people now look over a string - with a character index.

"Not a full String API Replacement"

I would certainly expect more from it than just the UnicodeString API.
Perhaps not for a first iteration, but certainly for subsequent
versions. Things like transliterations, and specifically iterators would
be high on my list.

"Patch"

toUpper/toLower, there is a missing one for toTitle

In the code's README:

"Note: UString is interchangable with zend strings for method parameters
and can be cast for output/conversion to zend strings"

How does that work? And what would it convert to?

How are "characters" counted?

Is a character a Code Point, or is a character a base character +
combining diacritics. In the first form, A + ° is considered as
characters, in the second option, just one. For wordwrap, splice,
substring, it is really important that only the full sequence is
considered as a character. And hence, a character really should be the
full sequence. The text in "charAt" seems to contradict that, and that
is a mistake.

In the original PHP 6 we didn't do that due to perormance reasons, but
that point is moot now as only people who opt into using "Text" will
suffer from this.

"trim"

What is a leading or trailing space? Is it just U+0020, or other Unicode
defined space characters as well? ( , U+00A0 comes to mind here)

What is "UG(defaultpad)," about?
For the code:
- there is some interesting, non standard whitespaceing going on:
  - { goes on next line after a func decl
  - sometimes 4 spaces in stead of a tab are used for indentation,
Why is there no __toString() ?
How can other extensions, not really making use of "Text", use there
strings (as UTF8 strings f.e.)

cheers,
Derick

Morning internals,
This is just a quick note to announce my intention to ready this RFC
for voting next week.
I know I'm a little late maybe, I was real sick most of last week, so
couldn't do anything useful.
A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.

I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.

Cheers
Joe

Hi!

P.S. u() is a bad name, will break lots of code, i.e.

Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's
safe.

/me cringes ...

I wonder how much of a problem it really is, usually when we say some
function name is a problem is because of hundreds and hundreds of
results on github.

If it's a huge problem then we should rename it, if we have to dig
around for a single project that's incompatible, or even a handful, then
it's not really a problem.

Cheers
Joe

I can see this being something relatively common. While I personally would
never do it, there are a few reasons I can think of that people might do
it:

Wrapper for creating <u> HTML output

urlencode() shortcut

(obviously) various unicode-related things

Searching on codesearch [1] revealed (amongst a few other hits on the
first page) another interesting use of it in the hhvm test suite [2]. It's
difficult to search for this because all the available public search
engines that I know of do fuzzy matching.

Sorry. This sucks, because every other option we have for this is sucks.

On the bright side, anything chosen could always be aliased at the top of
the file:

use function __u as u;

This also sucks, but it sucks a little bit less because the collisions are
avoided - or at least, avoided in such a way that the onus is on the user -
and one can still have the sane name.

First-class support at the syntax level (presumably $foo = u"unicode
string" since we already have $foo = b"binary string") would IMO be better
and (hopefully?) a long-term goal, but I am aware that it is - and probably
should be - outside the scope of the current proposal.

[1] https://searchcode.com/?q=function+u+lang%3Aphp
[2]
https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13

--
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug
Posted with an email client that doesn't mangle email: alpine

10 years ago by Florian Margaine — view source

unread

Hi,

Le 1 mars 2015 21:26, "Derick Rethans" derick@php.net a écrit :

Hey Joe,

I think there are a few issues with the proposal, although I like the
general idea. I've had the tab with the RFC open since October... but
never looked at it until now :-/. So, a few comments:

UString as a name.

I think I am going to prefer "Text" as a class name. Unicode (and
intl/icu) have lots of operators acting on items containing unicode
strings. But they are really pieces of text. For example sentences, word
break iterators, etc. UString feels clunky, and not "standard". If
it's going to be part of PHP core, then we should pick a "core" name. (I
might prefer String, but that's going to cause a whole lot of issues
obviously).

Isn't this "solved" if we use \php\String?

"Needs More Methods"

I had a look at the API that that links to, and I miss operators like
iterators. Over words, sentences, characters, etc. Basically the
functionality of
http://docs.php.net/manual/en/class.intlbreakiterator.php,
http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and
http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php

I realize intl already immplements, this, but it's really beneficial to
have for a "Text" class - especially for replacing functionality where
people now look over a string - with a character index.

"Not a full String API Replacement"

I would certainly expect more from it than just the UnicodeString API.
Perhaps not for a first iteration, but certainly for subsequent
versions. Things like transliterations, and specifically iterators would
be high on my list.

"Patch"

toUpper/toLower, there is a missing one for toTitle

In the code's README:

"Note: UString is interchangable with zend strings for method parameters
and can be cast for output/conversion to zend strings"

How does that work? And what would it convert to?

How are "characters" counted?

Is a character a Code Point, or is a character a base character +
combining diacritics. In the first form, A + ° is considered as
characters, in the second option, just one. For wordwrap, splice,
substring, it is really important that only the full sequence is
considered as a character. And hence, a character really should be the
full sequence. The text in "charAt" seems to contradict that, and that
is a mistake.

In the original PHP 6 we didn't do that due to perormance reasons, but
that point is moot now as only people who opt into using "Text" will
suffer from this.

"trim"

What is a leading or trailing space? Is it just U+0020, or other Unicode
defined space characters as well? ( , U+00A0 comes to mind here)

What is "UG(defaultpad)," about?

For the code:

there is some interesting, non standard whitespaceing going on:

{ goes on next line after a func decl

sometimes 4 spaces in stead of a tab are used for indentation,

Why is there no __toString() ?

How can other extensions, not really making use of "Text", use there
strings (as UTF8 strings f.e.)

cheers,
Derick
Morning internals,
This is just a quick note to announce my intention to ready this RFC
for voting next week.
I know I'm a little late maybe, I was real sick most of last week,

so

couldn't do anything useful.
A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.

I would ask anyone interested to scan through this thread and
announce
concerns that are not mentioned asap.

Cheers
Joe

On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright daverandom@php.net
wrote:

Hi!

P.S. u() is a bad name, will break lots of code, i.e.

Maybe __u()? It's a bit ugly but you're not allowed to use __ so
it's
safe.

/me cringes ...

I wonder how much of a problem it really is, usually when we say some
function name is a problem is because of hundreds and hundreds of
results on github.

If it's a huge problem then we should rename it, if we have to dig
around for a single project that's incompatible, or even a handful,
then
it's not really a problem.

Cheers
Joe

I can see this being something relatively common. While I personally
would
never do it, there are a few reasons I can think of that people
might do
it:

Wrapper for creating <u> HTML output

urlencode() shortcut

(obviously) various unicode-related things

Searching on codesearch [1] revealed (amongst a few other hits on the
first page) another interesting use of it in the hhvm test suite [2].
It's
difficult to search for this because all the available public search
engines that I know of do fuzzy matching.

Sorry. This sucks, because every other option we have for this is
sucks.

On the bright side, anything chosen could always be aliased at the
top of
the file:

use function __u as u;

This also sucks, but it sucks a little bit less because the
collisions are
avoided - or at least, avoided in such a way that the onus is on the
user -
and one can still have the sane name.

First-class support at the syntax level (presumably $foo = u"unicode
string" since we already have $foo = b"binary string") would IMO be
better
and (hopefully?) a long-term goal, but I am aware that it is - and
probably
should be - outside the scope of the current proposal.

[1] https://searchcode.com/?q=function+u+lang%3Aphp
[2]

https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13

--
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug
Posted with an email client that doesn't mangle email: alpine

--

Cheers,
Florian Margaine

10 years ago by Yasuo Ohgaki — view source

unread

Hi Florian,

On Mon, Mar 2, 2015 at 5:57 AM, Florian Margaine florian@margaine.com
wrote:

Le 1 mars 2015 21:26, "Derick Rethans" derick@php.net a écrit :

Hey Joe,

I think there are a few issues with the proposal, although I like the
general idea. I've had the tab with the RFC open since October... but
never looked at it until now :-/. So, a few comments:

UString as a name.

I think I am going to prefer "Text" as a class name. Unicode (and
intl/icu) have lots of operators acting on items containing unicode
strings. But they are really pieces of text. For example sentences, word
break iterators, etc. UString feels clunky, and not "standard". If
it's going to be part of PHP core, then we should pick a "core" name. (I
might prefer String, but that's going to cause a whole lot of issues
obviously).

Isn't this "solved" if we use \php\String?

I suppose we need "Context Sensitive Lexer" for "String", but I guess it
passes.

Let's use namespace for new internal classes at least.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Yasuo Ohgaki — view source

unread

Hi Joe and Derick,

I think there are a few issues with the proposal, although I like the
general idea. I've had the tab with the RFC open since October... but
never looked at it until now :-/. So, a few comments:

UString as a name.

I think I am going to prefer "Text" as a class name. Unicode (and
intl/icu) have lots of operators acting on items containing unicode
strings. But they are really pieces of text. For example sentences, word
break iterators, etc. UString feels clunky, and not "standard". If
it's going to be part of PHP core, then we should pick a "core" name. (I
might prefer String, but that's going to cause a whole lot of issues
obviously).

I think it's better to have "string/text" data as certain encoding/codepage.
Although Unicode encoding conversion is cheap, (I mean cheap compare
to conversion to other encodings, like SJIS, EUC, ISO-2022, etc), UTF-8
is better because

PCRE only supports UTF-8
SQLite only supports UTF-8
PHP uses UTF-8 as the default now
Recent web apps uses UTF-8 as encoding
Single encoding for stored text/string is simpler
Considering normalization, having UTF-8 with NFC is less confusing.

However, I don't mind too much allowing any encoding stored in "Text"/
"UString" object. IIRC, Ruby does this and have not much problem.

If we have multiple encoding support. We should resolve

$new = $str_utf8 . $str_sjis; // $new is UTF-8 or SJIS? Raise error?
$new = $str_nfc . $str_nfd; // $new is NFC or NFD, mixed? Raise error?
$new = $str_utf16le . $str_utf16be; // $new is ?? How BOM is handled?

"Needs More Methods"

I had a look at the API that that links to, and I miss operators like
iterators. Over words, sentences, characters, etc. Basically the
functionality of
http://docs.php.net/manual/en/class.intlbreakiterator.php,
http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and
http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php

I realize intl already immplements, this, but it's really beneficial to
have for a "Text" class - especially for replacing functionality where
people now look over a string - with a character index.

There are missing features... We may implement most of them before
release.

"Not a full String API Replacement"

I would certainly expect more from it than just the UnicodeString API.
Perhaps not for a first iteration, but certainly for subsequent
versions. Things like transliterations, and specifically iterators would
be high on my list.

Sounds good.

"Patch"

toUpper/toLower, there is a missing one for toTitle

In the code's README:

"Note: UString is interchangable with zend strings for method parameters
and can be cast for output/conversion to zend strings"

How does that work? And what would it convert to?

I guess Joe means it's using zend_string internally?

How are "characters" counted?

Is a character a Code Point, or is a character a base character +
combining diacritics. In the first form, A + ° is considered as
characters, in the second option, just one. For wordwrap, splice,
substring, it is really important that only the full sequence is
considered as a character. And hence, a character really should be the
full sequence. The text in "charAt" seems to contradict that, and that
is a mistake.

One reason I prefer NFC.

In the original PHP 6 we didn't do that due to perormance reasons, but
that point is moot now as only people who opt into using "Text" will
suffer from this.

"trim"

What is a leading or trailing space? Is it just U+0020, or other Unicode
defined space characters as well? ( , U+00A0 comes to mind here)

Any "space" is better to be trimmed.

What is "UG(defaultpad)," about?

For the code:

there is some interesting, non standard whitespaceing going on:

{ goes on next line after a func decl

sometimes 4 spaces in stead of a tab are used for indentation,

Why is there no __toString() ?

If this is missing, there should be __toString()

How can other extensions, not really making use of "Text", use there
strings (as UTF8 strings f.e.)

I agree that Internal API needs improvement.

Overall, I think it's good for starting if basic issue is resolved.
The most important is "if it supports single or multiple encoding for
stored text/string?".
There are many things programmers should know if multiple encoding is
supported,
but I don't object strongly to have multiple encoding support. It's nice to
have ability
to handle SJIS, ISO-2022, etc natively.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Rowan Collins — view source

unread

However, I don't mind too much allowing any encoding stored in "Text"/
"UString" object. IIRC, Ruby does this and have not much problem.

As I understand it, Ruby's string type is actually a whole bunch of
overloaded types, each responsible for re-implementing the various
methods available. This leads to a whole bunch of "partially supported"
encodings/codepages, which is a big pile of "leaky abstraction" for the
small benefit of removing re-encoding operations in a few scenarios.

Unicode is explicitly designed to supersede all previous encodings, so
it makes much perfect sense to me to use it to internally represent what
the user just wants to think of as "text". The fact that within that
internal representation you need some byte-level encoding then leads to
the optimisation of using a byte-level encoding the user is likely to
use as input and output, i.e. UTF-8.

Regards,

--
Rowan Collins
[IMSoP]

10 years ago by Lester Caine — view source

unread

This is just a quick note to announce my intention to ready this RFC

for voting next week.

Since there is nothing in this which needs any changes to the core then
surly it simply needs to exist in pecl until such time as a proper
replacement for unicode in core strings has been addressed? Since it
will still require intl to provide those areas it does not support, and
I question if we really need to provide yet another encoding converter.

A unicode string handler that just handles UTF8 strings may be yet
another stepping stone, but it still falls short of beings able to
handle all of the internationalization problems and is simply an
alternate to mbstring so one either runs both, or sit down and convert
all the third party libraries to eliminate mbstring.

Like http extension, it's not essential that it's loaded by default, and
leaving it in pecl allows development outside that of the core?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

10 years ago by Yasuo Ohgaki — view source

unread

Hi Lester,

This is just a quick note to announce my intention to ready this RFC
for voting next week.
Since there is nothing in this which needs any changes to the core then
surly it simply needs to exist in pecl until such time as a proper
replacement for unicode in core strings has been addressed? Since it
will still require intl to provide those areas it does not support, and
I question if we really need to provide yet another encoding converter.

A unicode string handler that just handles UTF8 strings may be yet
another stepping stone, but it still falls short of beings able to
handle all of the internationalization problems and is simply an
alternate to mbstring so one either runs both, or sit down and convert
all the third party libraries to eliminate mbstring.

Like http extension, it's not essential that it's loaded by default, and
leaving it in pecl allows development outside that of the core?

Although it seems current code does not have code like GMP. I'm sure
we'll have this before release. i.e.

$new = $some_ustring . 'abc'; // $new is UString object

To implement feature like this, it cannot be PECL.

My only concern for this RFC performance. It's loosely integrated into PHP
core, it may affect efficiency. I suppose other people are working on simple
and tighter integration into core. Any comments on this?

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Rowan Collins — view source

unread

Although it seems current code does not have code like GMP. I'm sure
we'll have this before release. i.e.

$new = $some_ustring . 'abc'; // $new is UString object

To implement feature like this, it cannot be PECL.

Why not? I would have thought any extension can hook into the operator
overloading API that GMP uses, just as they can hook into other object
behaviours.

Is there some difference between how "bundled" and PECL extensions are
loaded that would prevent this?

Regards,

--
Rowan Collins
[IMSoP]

10 years ago by Yasuo Ohgaki — view source

unread

Hi Rowan,

On Mon, Mar 2, 2015 at 6:32 AM, Rowan Collins rowan.collins@gmail.com
wrote:

Although it seems current code does not have code like GMP. I'm sure
we'll have this before release. i.e.

$new = $some_ustring . 'abc'; // $new is UString object

To implement feature like this, it cannot be PECL.

Why not? I would have thought any extension can hook into the operator
overloading API that GMP uses, just as they can hook into other object
behaviours.

Is there some difference between how "bundled" and PECL extensions are
loaded that would prevent this?

OK. I missed that GMP improvement includes generic operator overloading.
If current implementation is good enough for UString, it could be PECL.
Or add missing parts in core to make UString PECL.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Rowan Collins — view source

unread

 This is just a quick note to announce my intention to ready this RFC
for voting next week.
Since there is nothing in this which needs any changes to the core then
surly it simply needs to exist in pecl until such time as a proper
replacement for unicode in core strings has been addressed? Since it
will still require intl to provide those areas it does not support, and
I question if we really need to provide yet another encoding converter.
A unicode string handler that just handles UTF8 strings may be yet
another stepping stone, but it still falls short of beings able to
handle all of the internationalization problems and is simply an
alternate to mbstring so one either runs both, or sit down and convert
all the third party libraries to eliminate mbstring.

Like http extension, it's not essential that it's loaded by default, and
leaving it in pecl allows development outside that of the core?

I think this is probably a good idea at this stage. It will give people
a chance to play around with it in an "experimental" state before
committing to maintaining a particular API.

Since there's no real BC break here, there's no reason it couldn't be
bundled into 7.1 if it was deemed ready by then, so it seems unwise to
rush into including it in 7.0 straight from what feels like a prototype
implementation.

Regards,

--
Rowan Collins
[IMSoP]

10 years ago by Yasuo Ohgaki — view source

unread

Hi Joe and Rowan,

On Mon, Mar 2, 2015 at 6:37 AM, Rowan Collins rowan.collins@gmail.com
wrote:

 This is just a quick note to announce my intention to ready this RFC
for voting next week.
Since there is nothing in this which needs any changes to the core then
surly it simply needs to exist in pecl until such time as a proper
replacement for unicode in core strings has been addressed? Since it
will still require intl to provide those areas it does not support, and
I question if we really need to provide yet another encoding converter.

A unicode string handler that just handles UTF8 strings may be yet
another stepping stone, but it still falls short of beings able to
handle all of the internationalization problems and is simply an
alternate to mbstring so one either runs both, or sit down and convert
all the third party libraries to eliminate mbstring.

Like http extension, it's not essential that it's loaded by default, and
leaving it in pecl allows development outside that of the core?
I think this is probably a good idea at this stage. It will give people a
chance to play around with it in an "experimental" state before committing
to maintaining a particular API.

Since there's no real BC break here, there's no reason it couldn't be
bundled into 7.1 if it was deemed ready by then, so it seems unwise to rush
into including it in 7.0 straight from what feels like a prototype
implementation.

Sounds reasonable.

Joe, I don't have much time to help, but I'm willing to help UString
development.
I think it's better to keep it simple. Having unified internal encoding
(NFC normalized
UTF-8 string without BOM) for internal string representation would be much
simpler
than multiple encodings.

We may consider various issues/ideas like this in relatively long term.
http://websec.github.io/unicode-security-guide/character-transformations/
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Yasuo Ohgaki — view source

unread

Hi Joe and Rowan,

Hi Joe and Rowan,

On Mon, Mar 2, 2015 at 6:37 AM, Rowan Collins rowan.collins@gmail.com
wrote:
 This is just a quick note to announce my intention to ready this
RFC
for voting next week.
Since there is nothing in this which needs any changes to the core then
surly it simply needs to exist in pecl until such time as a proper
replacement for unicode in core strings has been addressed? Since it
will still require intl to provide those areas it does not support, and
I question if we really need to provide yet another encoding converter.

A unicode string handler that just handles UTF8 strings may be yet
another stepping stone, but it still falls short of beings able to
handle all of the internationalization problems and is simply an
alternate to mbstring so one either runs both, or sit down and convert
all the third party libraries to eliminate mbstring.

Like http extension, it's not essential that it's loaded by default, and
leaving it in pecl allows development outside that of the core?
I think this is probably a good idea at this stage. It will give people a
chance to play around with it in an "experimental" state before committing
to maintaining a particular API.

Since there's no real BC break here, there's no reason it couldn't be
bundled into 7.1 if it was deemed ready by then, so it seems unwise to rush
into including it in 7.0 straight from what feels like a prototype
implementation.
Sounds reasonable.

Joe, I don't have much time to help, but I'm willing to help UString
development.
I think it's better to keep it simple. Having unified internal encoding
(NFC normalized
UTF-8 string without BOM) for internal string representation would be much
simpler
than multiple encodings.

We may consider various issues/ideas like this in relatively long term.
http://websec.github.io/unicode-security-guide/character-transformations/
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html

We used to have EXPERIMENTAL module.
How about have this as EXPERIMENTAL module in source distribution?
It gets more attentions and development will be faster.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

10 years ago by Andrea Faulds — view source

unread

this won't completely solve the problem, because array keys won't be
UString anymore.

Sure, but unless we turn arrays into SplObjectStorage that won’t change. Nobody wants to touch arrays and make them support other key types. Heck, my bigint RFC doesn’t even do that.

--
Andrea Faulds
http://ajf.me/

10 years ago by Stas Malyshev — view source

unread

Hi!

Not ready for discussion yet ...

https://wiki.php.net/rfc/hashkey

Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess
we should combine them :)

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

10 years ago by Joe Watkins — view source

unread

Hi!

Not ready for discussion yet ...

https://wiki.php.net/rfc/hashkey

Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess
we should combine them :)

Happy to port patch already written to conform to your specification,
(more or less complies, other than name) you are welcome to go ahead and
do the RFC bit ?

Cheers
Joe

10 years ago by Joe Watkins — view source

unread

Hi!

Not ready for discussion yet ...

https://wiki.php.net/rfc/hashkey

Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess
we should combine them :)

Done, branch @ http://github.com/krakjoe/php-src/compare/hashkey

Cheers
Joe

10 years ago by Joe Watkins — view source

unread

> Hi Joe,
>
>
> As an extension it looks fine.
>
> I assume, you don't propose to use UString objects in engine and other
> extensions.

I'm not proposing it now, no.

> Unfortunately, it's yet another incomplete solution.
>
> It won't allow Unicode strings as array keys;

The engine doesn't allow that, couldn't we find a way of using objects
as array keys ?? It doesn't seem like a limitation of the extension, to
me ;)

> concatenation using "." (probably may be done),

That's already done.

> no auto-conversion from/to script/output encoding,

That could be arranged.

> no auto-conversion of strings coming from database extensions, etc

I'm not sure how important that is, it's not a big deal to create a new
object, nor would it be a big deal for those extensions that need to
always return unicode strings to do so.
>
> The "right" approach, would be extending zend_string with "encoding"
> and then adopting near all functions working with zend_string to take
> "encoding" into account. But, of course, this is going to lead to much
> more complicated solution (with some slowdown).

That seems a lot like bashing our head against a wall. We tried to
introduce support everywhere and it fails. Do we really want to step on
the performance gains introduced by recent changes by making all strings
unicode ?

That doesn't seem like a sensible thing to want, at least right now.

Having UString doesn't stop us approaching the problem differently in
the future, but it would have to be a very different future to even make
sense to me.

> If we don't care about complete solution, UString proposal may make
> sense at lest as a faster replacement of ext/mbstring.

As the RFC states, we are only approaching one problem, the problem that
ext/mbstring is not a good API.
>
> Thanks. Dmitry.

> On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins
> wrote:
> Morning internalz,
>
> https://wiki.php.net/rfc/ustring
>
> This is the result of work done by a few of us, we
> won't be opening any
> vote in a fortnight. We have a long time before 7, there is no
> rush
> whatever.
>
> Now seems like a good time to start the conversation
> so we can hash out
> the details, or get on with other things ;)
>
> Cheers
> Joe
>
>
> --
>
>

Cheers
Joe

10 years ago by Dmitry Stogov — view source

unread

>
> > Hi Joe,
> >
> >
> > As an extension it looks fine.
> >
> > I assume, you don't propose to use UString objects in engine and other
> > extensions.
>
> I'm not proposing it now, no.
>
> > Unfortunately, it's yet another incomplete solution.
> >
> > It won't allow Unicode strings as array keys;
>
> The engine doesn't allow that, couldn't we find a way of using objects
> as array keys ?? It doesn't seem like a limitation of the extension, to
> me ;)
>
> > concatenation using "." (probably may be done),
>
> That's already done.
>
> > no auto-conversion from/to script/output encoding,
>
> That could be arranged.
>
> > no auto-conversion of strings coming from database extensions, etc
>
> I'm not sure how important that is, it's not a big deal to create a new
> object, nor would it be a big deal for those extensions that need to
> always return unicode strings to do so.
> >
> > The "right" approach, would be extending zend_string with "encoding"
> > and then adopting near all functions working with zend_string to take
> > "encoding" into account. But, of course, this is going to lead to much
> > more complicated solution (with some slowdown).
>
> That seems a lot like bashing our head against a wall. We tried to
> introduce support everywhere and it fails. Do we really want to step on
> the performance gains introduced by recent changes by making all strings
> unicode ?
>

Yeah :)
I'm not sure, if it should be done, and I don't like to work on it in the
nearest future, but zend_string approach should be easier to implement than
separate IS_UNICODE + IS_STRING + IS_BINARY types in PHP6.

> That doesn't seem like a sensible thing to want, at least right now.
>
> Having UString doesn't stop us approaching the problem differently in
> the future, but it would have to be a very different future to even make
> sense to me.
>

Agree.

>
> > If we don't care about complete solution, UString proposal may make
> > sense at lest as a faster replacement of ext/mbstring.
>
> As the RFC states, we are only approaching one problem, the problem that
> ext/mbstring is not a good API.
>

Then, it's fine.

One note regarding implementation: why do you use C++ for ustring.cpp? I
understand it's necessary for ICU backend, but if in the future you might
switch to another backend (and it may not require C++) why to use C++ for
PHP extension part?

Thanks. Dmitry.

> >
> > Thanks. Dmitry.
>
>
> > On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins
> > wrote:
> > Morning internalz,
> >
> > https://wiki.php.net/rfc/ustring
> >
> > This is the result of work done by a few of us, we
> > won't be opening any
> > vote in a fortnight. We have a long time before 7, there is no
> > rush
> > whatever.
> >
> > Now seems like a good time to start the conversation
> > so we can hash out
> > the details, or get on with other things ;)
> >
> > Cheers
> > Joe
> >
> >
> > --
> >
> >
>
> Cheers
> Joe

10 years ago by Joe Watkins — view source

unread

On Tue, Oct 21, 2014 at 1:25 PM, Joe Watkins pthreads@pthreads.org
wrote:

    > Hi Joe,
    >
    >
    > As an extension it looks fine.
    >
    > I assume, you don't propose to use UString objects in engine
    and other
    > extensions.
    
    I'm not proposing it now, no.
    
    > Unfortunately, it's yet another incomplete solution.
    >
    > It won't allow Unicode strings as array keys;
    
    The engine doesn't allow that, couldn't we find a way of using
    objects
    as array keys ?? It doesn't seem like a limitation of the
    extension, to
    me ;)
    
    > concatenation using "." (probably may be done),
    
    That's already done.
    
    > no auto-conversion from/to script/output encoding,
    
    That could be arranged.
    
    > no auto-conversion of strings coming from database
    extensions, etc
    
    I'm not sure how important that is, it's not a big deal to
    create a new
    object, nor would it be a big deal for those extensions that
    need to
    always return unicode strings to do so.
    >
    > The "right" approach, would be extending zend_string with
    "encoding"
    > and then adopting near all functions working with
    zend_string to take
    > "encoding" into account. But, of course, this is going to
    lead to much
    > more complicated solution (with some slowdown).
    
    That seems a lot like bashing our head against a wall. We
    tried to
    introduce support everywhere and it fails. Do we really want
    to step on
    the performance gains introduced by recent changes by making
    all strings
    unicode ?

Yeah :)

You must like punishment :D

I'm not sure, if it should be done, and I don't like to work on it in
the nearest future, but zend_string approach should be easier to
implement than separate IS_UNICODE + IS_STRING + IS_BINARY types in
PHP6.

The implementation might be simpler, but the effect the same I think.

I can be wrong, but nothing has so drastically changed that will allow
us to absorb the kind of impact I think you are talking about.

    That doesn't seem like a sensible thing to want, at least
    right now.
    
    Having UString doesn't stop us approaching the problem
    differently in
    the future, but it would have to be a very different future to
    even make
    sense to me.
Agree.
    > If we don't care about complete solution, UString proposal
    may make
    > sense at lest as a faster replacement of ext/mbstring.
    
    As the RFC states, we are only approaching one problem, the
    problem that
    ext/mbstring is not a good API.
Then, it's fine.

One note regarding implementation: why do you use C++ for ustring.cpp?
I understand it's necessary for ICU backend, but if in the future you
might switch to another backend (and it may not require C++) why to
use C++ for PHP extension part?

Totally possible that we'll have to change, or that we should change. A
few people have said they would like to write a backend so we'll see
what comes in and where that leads us.

Thanks. Dmitry.

    >
    > Thanks. Dmitry.
    
    
    > On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins
    <pthreads@pthreads.org>
    > wrote:
    >         Morning internalz,
    >
    >                 https://wiki.php.net/rfc/ustring
    >
    >                 This is the result of work done by a few of
    us, we
    >         won't be opening any
    >         vote in a fortnight. We have a long time before 7,
    there is no
    >         rush
    >         whatever.
    >
    >                 Now seems like a good time to start the
    conversation
    >         so we can hash out
    >         the details, or get on with other things ;)
    >
    >         Cheers
    >         Joe
    >
    >
    >         --
    >         
    >         
    
    Cheers
    Joe

Cheers
Joe

10 years ago by Lester Caine — view source

unread

That seems a lot like bashing our head against a wall. We tried to

introduce support everywhere and it fails. Do we really want to step on
the performance gains introduced by recent changes by making all strings
unicode ?

Yeah :)
I'm not sure, if it should be done, and I don't like to work on it in the
nearest future, but zend_string approach should be easier to implement than
separate IS_UNICODE + IS_STRING + IS_BINARY types in PHP6.

Isn't this the first discussion?

If we are going down the root of keeping PHP7 as ascii only in the core,
then ustring probably makes sense, but it does not address many of the
areas where unicode is really needed. Handling unicode content outside
the core is working reasonably at the moment, it is the problems such as
using unicode keys for arrays which is the main area where unicoe is
needed in PHP7 and so a more embedded handling is needed which may cut
across yet another content wrapper?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

10 years ago by Rowan Collins — view source

unread

Lester Caine wrote (on 21/10/2014):

If we are going down the root of keeping PHP7 as ascii only in the core,
then ustring probably makes sense, but it does not address many of the
areas where unicode is really needed.

Just a quick point: most of the core is not ASCII. PHP strings are byte
strings, completely divorced from any encoding. A few native functions
assume ISO8859-1 (or possibly Windows CP1252), but mostly they just
juggle which ever bytes you give them.

The main exception I can think of is that numbers are often handled
specially, with digits and separators as defined by ASCII. But since
we're talking UTF-8, that doesn't need to change.

Handling unicode content outside
the core is working reasonably at the moment, it is the problems such as
using unicode keys for arrays which is the main area where unicoe is
needed in PHP7 and so a more embedded handling is needed which may cut
across yet another content wrapper?

I do think this is an important thing to consider, though. If this
extension is genuinely just meant as a more modern and more performant
way of doing things which mbstring and intl can already do, that needs
to be clear in the way it's documented and publicised. If this gets
publicised as "better Unicode support", users are naturally going to
expect UString objects to start appearing in core, and in other
extensions, and be disappointed that it's still just a toolbox for their
own string handling.

--
Rowan Collins
[IMSoP]

10 years ago by Lester Caine — view source

unread

Lester Caine wrote (on 21/10/2014):

If we are going down the root of keeping PHP7 as ascii only in the core,
then ustring probably makes sense, but it does not address many of the
areas where unicode is really needed.

Just a quick point: most of the core is not ASCII. PHP strings are byte
strings, completely divorced from any encoding. A few native functions
assume ISO8859-1 (or possibly Windows CP1252), but mostly they just
juggle which ever bytes you give them.

The main exception I can think of is that numbers are often handled
specially, with digits and separators as defined by ASCII. But since
we're talking UTF-8, that doesn't need to change.

Pierre had proposed restricting that to ascii as a way of addressing the
inconsistencies that arise because some areas do not currently make a
distinction.

Handling unicode content outside
the core is working reasonably at the moment, it is the problems such as
using unicode keys for arrays which is the main area where unicoe is
needed in PHP7 and so a more embedded handling is needed which may cut
across yet another content wrapper?

I do think this is an important thing to consider, though. If this
extension is genuinely just meant as a more modern and more performant
way of doing things which mbstring and intl can already do, that needs
to be clear in the way it's documented and publicised. If this gets
publicised as "better Unicode support", users are naturally going to
expect UString objects to start appearing in core, and in other
extensions, and be disappointed that it's still just a toolbox for their
own string handling.

This is where a proper discussion on just what is trying to be achieved
is important, before discussing tangents?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

10 years ago by Stas Malyshev — view source

unread

Hi!

Just a quick point: most of the core is not ASCII. PHP strings are byte
strings, completely divorced from any encoding. A few native functions
assume ISO8859-1 (or possibly Windows CP1252), but mostly they just
juggle which ever bytes you give them.

True, but not all extensions and functions behave this way. Some
(especially with intl, but not only) assume it's utf-8, for example, and
for some utf-8 is a changeable default, which in practice often becomes
the used encoding since people are not aware of need to track their
encoding and most of them do use utf-8 anyway.

The main exception I can think of is that numbers are often handled
specially, with digits and separators as defined by ASCII. But since
we're talking UTF-8, that doesn't need to change.

More interesting case actually is, well, case conversion. We unknowingly
used locale-dependent lowercasing routines until the inevitable
encounter with the dreaded Turkish 'i'. At which point we switched to
forced ASCII. So identifiers in the engine are kind of assumed to be
ASCII, even though you can somethimes sneak non-ASCII past it and it
will work, but weirdly.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

10 years ago by Pierre Joye — view source

unread

hi,

Hi Joe,

As an extension it looks fine.
I assume, you don't propose to use UString objects in engine and other
extensions.
Unfortunately, it's yet another incomplete solution.

I have to agree here.

As much as I like what has been done here, having UString as part of
the engine or at least main/ may help tighter integration. I am also
not sure about the driver approach (have to double check it again as I
stopped following it since a couple of weeks). Having UString in the
core is a great thing anyway. However there is no mention whether it
should be always enabled or not. I think it should be always enabled,
providing the base Unicode strings features by default. Having ICU as
default dependency is not really an issue imho.

We discussed that with Joe in the early UString days but we did not
agree. Mainly because he likes to keep UString independent, unbloated
etc. I think it is possible to keep it simple and having it tightly
integrated in the core. Advanced features can be done either in intl
or in userland (if we can avoid having every single project doing its
own unicode string class... that would keep the performance
improvement along other annoying APIs differences).

It won't allow Unicode strings as array keys;
concatenation using "." (probably may be done),
no auto-conversion from/to script/output encoding,
no auto-conversion of strings coming from database extensions, etc

The "right" approach, would be extending zend_string with "encoding" and
then adopting near all functions working with zend_string to take
"encoding" into account. But, of course, this is going to lead to much more
complicated solution (with some slowdown).

Fully agree here too.

If we don't care about complete solution, UString proposal may make sense
at lest as a faster replacement of ext/mbstring.

I agree here too. For one I do care about a complete solution, for the
basic Unicode features, integrated with the language.

Thanks. Dmitry.
Morning internalz,
    https://wiki.php.net/rfc/ustring

    This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
    Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)

Cheers
Joe

--

--
Pierre

@pierrejoye | http://www.libgd.org

10 years ago by Rowan Collins — view source

unread

Dmitry Stogov wrote on 21/10/2014 10:01:

The "right" approach, would be extending zend_string with "encoding" and
then adopting near all functions working with zend_string to take
"encoding" into account. But, of course, this is going to lead to much more
complicated solution (with some slowdown).

Isn't that kind of what ext/mbstring does?

I think that treating Unicode as nothing more than an encoding, and
trying to hide all its complexity from the user, is not particularly
wise. Unicode isn't just "ASCII, but bigger", so keeping the same API
but making the implementation "work" with more characters isn't really
"Unicode support".

For instance, what does "allowing Unicode strings as array keys"
actually mean? We already allow pretty much any sequence of bytes as an
array key, so what we're actually talking about is that array-handling
functions should be somehow "Unicode aware". In the case of sorting
functions, that means a mechanism for selecting a collation, even if you
know how the strings are encoded.

There are a handful of operations which have an obvious meaning under
Unicode - strtoupper(), for instance. It might be nice if those worked
transparently with UStrings, but I don't think that really constitutes
"complete Unicode support" either.

I think we're going to keep going round in circles unless we can really
pin down what it means for a language to "support Unicode".

Rowan Collins
[IMSoP]

10 years ago by Andrea Faulds — view source

unread

Dmitry Stogov wrote on 21/10/2014 10:01:

The "right" approach, would be extending zend_string with "encoding" and
then adopting near all functions working with zend_string to take
"encoding" into account. But, of course, this is going to lead to much more
complicated solution (with some slowdown).

Isn't that kind of what ext/mbstring does?

I think that treating Unicode as nothing more than an encoding, and trying to hide all its complexity from the user, is not particularly wise. Unicode isn't just "ASCII, but bigger", so keeping the same API but making the implementation "work" with more characters isn't really "Unicode support”.

I’m inclined to agree here. Having an encoding-aware zend_string vs. having a Unicode-aware string aren’t quite the same. Certain string operations are only possible for certain encodings, and by supporting any encoding we risk making things confusing. I’d rather we convert everything to Unicode.

Andrea Faulds
http://ajf.me/

10 years ago by Robert Stoll — view source

unread

Hi Joe,

I have not devoted myself to unicode and thus cannot give you a feedback on your implementation.
Nevertheless, I was wondering whether string interpolation is still supported by your solution (couldn't find anything in the RFC about it but maybe you thought that is implicit given).

Cheers,
Robert

-----Ursprüngliche Nachricht-----
Von: Joe Watkins [mailto:pthreads@pthreads.org]
Gesendet: Dienstag, 21. Oktober 2014 09:07
An: internals@lists.php.net
Betreff: [PHP-DEV] [RFC] UString

Morning internalz,

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time
before 7, there is no rush whatever.

Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;)

Cheers
Joe

10 years ago by Joe Watkins — view source

unread

Hi Joe,

I have not devoted myself to unicode and thus cannot give you a feedback on your implementation.
Nevertheless, I was wondering whether string interpolation is still supported by your solution (couldn't find anything in the RFC about it but maybe you thought that is implicit given).

Cheers,
Robert
-----Ursprüngliche Nachricht-----
Von: Joe Watkins [mailto:pthreads@pthreads.org]
Gesendet: Dienstag, 21. Oktober 2014 09:07
An: internals@lists.php.net
Betreff: [PHP-DEV] [RFC] UString

Morning internalz,
https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time
before 7, there is no rush whatever.
Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;)
Cheers
Joe

--

I did think implied, the extension readme mentions casting, I'll mention
in the RFC ...

Cheers
Joe

10 years ago by Matteo Beccati — view source

unread

Morning internalz,

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Nice job!

However, doesn't ICU use UTF-16 by default which is undesirable as most
of the times it requires converting from and to UTF-8?

Cheers

Matteo Beccati

Development & Consulting - http://www.beccati.com/

10 years ago by Christian Schneider — view source

unread

Am 21.10.2014 um 09:06 schrieb Joe Watkins pthreads@pthreads.org:

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

I have one concern I want to bring up: The RFC proposes a helper function u() to generate UStrings.

As this is a very handy function name for all sort of utility functions (as a matter of face we use it to create and sanitize URL strings to be embedded into HTML) I would assume that more than one project has a name clash there.

Maybe something like _u() could be used instead? Or do you have better alternatives for this?

PS: UString is also in the global name space but should be less of a problem I'd imagine.

Regards,

Chris

10 years ago by Michael Wallner — view source

unread

Am 21.10.2014 um 09:06 schrieb Joe Watkins pthreads@pthreads.org:
  https://wiki.php.net/rfc/ustring

  This is the result of work done by a few of us, we won't be
opening any

vote in a fortnight. We have a long time before 7, there is no rush
whatever.

I have one concern I want to bring up: The RFC proposes a helper function
u() to generate UStrings.

As this is a very handy function name for all sort of utility functions
(as a matter of face we use it to create and sanitize URL strings to be
embedded into HTML) I would assume that more than one project has a name
clash there.

Maybe something like _u() could be used instead? Or do you have better
alternatives for this?

PS: UString is also in the global name space but should be less of a
problem I'd imagine.

With the "use function" support, that could be located in a namespace.

But something else: wasn't there a big concern in another thread regarding
codepoint/grapheme support, like with $ustring->length()?

--
Regards,
Mike

10 years ago by Andrea Faulds — view source

unread

I have one concern I want to bring up: The RFC proposes a helper function u() to generate UStrings.

As this is a very handy function name for all sort of utility functions (as a matter of face we use it to create and sanitize URL strings to be embedded into HTML) I would assume that more than one project has a name clash there.

Maybe something like _u() could be used instead? Or do you have better alternatives for this?

PS: UString is also in the global name space but should be less of a problem I'd imagine.

I think we should reserve some way to do Unicode strings. I’d want u”foo”, but we’re not adding literals, so u(“foo”) it is.

Also, bear in mind that namespaces mean you can still have your own u() if it’s in your namespace (\u).

Andrea Faulds
http://ajf.me/

10 years ago by Andrea Faulds — view source

unread

So, one thing which I think is worth bringing up is code points vs. characters/graphemes.

This came up in another recent thread about Unicode on internals. While code-point manipulation is all well and good, we also need grapheme manipulation functions. Could we add these? That would make the API more useful.

On that note, ->charAt ought to be ->codepointAt to avoid being misleading.

Andrea Faulds
http://ajf.me/

10 years ago by Sara Golemon — view source

unread

Morning internalz,

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

The backend abstraction seems overengineered to me. It could also lead to inconsistencies in behavior if ICU and Windows implement something in subtly different ways.

Since we're linking ICU for the rest of the intl extension anyway, it seems to me like we should just focus on it as an ICU wrapper.

Also, I'd peopose a minor ammendment to this RFC that other intl classes be extended to support taking UString instances as arguments (avoiding the implicit conversion to UTF8). That work doesn't have to gate adoption of the base implementation, it'd just be useful to decide at the same time if we want to do so.

-Sara

10 years ago by Joe Watkins — view source

unread

Morning internalz,

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

The backend abstraction seems overengineered to me. It could also lead to inconsistencies in behavior if ICU and Windows implement something in subtly different ways.

Since we're linking ICU for the rest of the intl extension anyway, it seems to me like we should just focus on it as an ICU wrapper.

Also, I'd peopose a minor ammendment to this RFC that other intl classes be extended to support taking UString instances as arguments (avoiding the implicit conversion to UTF8). That work doesn't have to gate adoption of the base implementation, it'd just be useful to decide at the same time if we want to do so.

-Sara

Actually I agree, I just needed a few people to say WTF.

Backend gone, we are gonna use ICU, rfc/ext updated.

INTL is still an open question yeah, preference noted.

Cheers
Joe

10 years ago by Stas Malyshev — view source

unread

Hi!

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

Couple of thoughts:

I like the idea of having a unicode string class. May be a way to
figure out the right way to do it without messing up the whole core.
I wish there were more description of which API this class provides.
If it's planned to be direct copy of UnicodeString, some of the
operations there are not how PHP strings usually work (i.e. in-place
modification) and it's not really enough to make it useful - e.g. what
if I need to do regexps on it, for example? Or does it cover whole
mbstring API too? What about something mbstring doesn't cover, like
ucfirst or strrev?
Do we really need different encodings, different backends and so on,
internally? Note that each backend has its own quirks, limitations and
bugs, and there's nothing worse than dealing with unpredictable set of
dependencies. The user cares what they send into the class and what
comes out, but very rarely they care what happens inside - why not just
do it one way everywhere?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

10 years ago by Joe Watkins — view source

unread

Hi!
https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
Couple of thoughts:

I like the idea of having a unicode string class. May be a way to
figure out the right way to do it without messing up the whole core.

I wish there were more description of which API this class provides.
If it's planned to be direct copy of UnicodeString, some of the
operations there are not how PHP strings usually work (i.e. in-place
modification) and it's not really enough to make it useful - e.g. what
if I need to do regexps on it, for example? Or does it cover whole
mbstring API too? What about something mbstring doesn't cover, like
ucfirst or strrev?

API on github in readme.

Regexp not covered yet, ICU has a nicer Matcher/Pattern API like Java's,
I'm not sure what to do there, an ICU based API could certainly be
introduced.

Do we really need different encodings, different backends and so on,
internally? Note that each backend has its own quirks, limitations and
bugs, and there's nothing worse than dealing with unpredictable set of
dependencies. The user cares what they send into the class and what
comes out, but very rarely they care what happens inside - why not just
do it one way everywhere?

No, actually, I don't think we do. It was over complicating something
simple, so I removed the backend abstraction and will work towards
solving the rest too.

We'll use ICU, because battle tested like nothing else, and keeps
everything simple ... it doesn't make sense to introduce a possibly
unstable and as you rightly say different API with it's own quirks.

Cheers
Joe

10 years ago by Rowan Collins — view source

unread

Morning internalz,

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Cheers
Joe

I think this looks like a really great start at creating something
actually useful, rather than getting stuck at the drawing board. I like
that the scope is quite small initially - where does the "single
responsibility" of a class that represents a string end, anyway? :)

A few opinions:

Global / static defaults are bad.

The existence of the setDefaultCodepage method feels like an
anti-pattern to me. It means libraries can't rely on this class working
the same way in two different host environments, or even at two
re-entries in the same program. Effectively, if you don't know what the
second argument to the constructor will default to, you can't actually
treat it as optional unless you're writing monolithic code. This is a
common pattern in PHP, but http_build_query() would be so much more
pleasant if I could safely call it with 1 argument instead of 3.

I think the default should be hard-coded to UTF-8, which according to
previous discussion is always the default output encoding, so would
mean this would always work: $aUString = new UString( (string)$aUString
); Any other encoding will be dependent on, and known from, the context
where the object is created - if grabbing data from an HTTP request, a
header should tell them; if from a database, a connection parameter; and
so on.

The only case I can see where a default encoding would be sensible would
be where source code itself is in a different encoding, so that
u('literal string') works as expected. I guess if we ever went down the
route of special literal syntax like u'literal string', the declared
source encoding could be used.

Actually, the u() shortcut function appears to be missing the encoding
parameter completely; is this deliberate?

Clarify relationship to a "byte string"

Most of the API acts like this is an abstract object representing a
bunch of Unicode code points. As such, I'm not sure what getCodepage()
does - a code page (or more properly encoding) is a property of a stream
of bytes, so has no meaning in this context, surely? The internal
implementation could use UTF-8, UTF-16, or some made-up encoding (like
Perl6's "NFG" system) and the user should never need to know (other than
to understand performance implications).

On the other hand, when you do want a stream of bytes, the class
doesn't seem to have an explicit way to get one. The (currently
undocumented) behaviour is apparently to spit out UTF-8 if cast to a
string, but it would be nice to have an explicit function which could be
passed a parameter in order to serialise to, say, UTF-16, instead.

The Grapheme Question

This has been raised a few times, so I won't labour the point, just
mention my current thinking.

Unicode is complicated. Partly, that's because of a series of
compromises in its design; but partly, it's because writing systems are
complicated, and Unicode tries harder than most previous systems to
acknowledge that. So, there's a tradeoff to be made between giving users
what they think they need, thus hiding the messy details, and giving
users the power to do things right, in a more complex way.

There is also a namespace mess if you insist on every function and
property having to declare what level of abstraction it's talking about

e.g. $codePointLength instead of $length.

An idea I've been toying with is rather than having one class
representing the slippery notion of "a Unicode string", having (at
least) two, closely tied, classes: CodePointString (roughly = UString
right now) and GraphemeString (a higher level abstraction tied to the
same internal representation).

I intend to mock this up as a set of interfaces at some point, but the
basic idea is that you could write this:

// Get an abstract object from a byte string, probably a GraphemeString,
parsing the input as UTF-8
$str = u('some text');
// Perform an operation that explicitly deals in Code Points
$str = $str->asCodePoints()->normalise('NFC');
// Get information using a higher level of abstraction
$length = $str->asGraphemes()->length;
// Perform a high-level mutation, then convert right back to a concrete
string of bytes
echo $str->asGraphemes()->reverse()->asByteString('UTF-16');

Calling asGraphemes() on a GraphemeString or asCodePoints() on a
CodePointString would be legal but a no-op, so it would be safe to
accept both as input to a function, then switch to whichever level the
task required.

I'm not sure if this finds a good balance between complexity and
user-friendliness, and would welcome anyone's thoughts.

--
Rowan Collins
[IMSoP]

10 years ago by Andrea Faulds — view source

unread

The only case I can see where a default encoding would be sensible would be where source code itself is in a different encoding, so that u('literal string') works as expected.

This is only a good idea if we can somehow make it file-local. Otherwise if one library uses Latin-1 and another uses UTF-8 for some reason, bang!

Clarify relationship to a "byte string"

Most of the API acts like this is an abstract object representing a bunch of Unicode code points. As such, I'm not sure what getCodepage() does - a code page (or more properly encoding) is a property of a stream of bytes, so has no meaning in this context, surely? The internal implementation could use UTF-8, UTF-16, or some made-up encoding (like Perl6's "NFG" system) and the user should never need to know (other than to understand performance implications).

On the other hand, when you do want a stream of bytes, the class doesn't seem to have an explicit way to get one. The (currently undocumented) behaviour is apparently to spit out UTF-8 if cast to a string, but it would be nice to have an explicit function which could be passed a parameter in order to serialise to, say, UTF-16, instead.

I agree on both these points. ->toBytes or ->encode with an explicit charset parameter would be good. I don’t see the point of getCodepage().

The Grapheme Question

This has been raised a few times, so I won't labour the point, just mention my current thinking.

Unicode is complicated. Partly, that's because of a series of compromises in its design; but partly, it's because writing systems are complicated, and Unicode tries harder than most previous systems to acknowledge that. So, there's a tradeoff to be made between giving users what they think they need, thus hiding the messy details, and giving users the power to do things right, in a more complex way.

There is also a namespace mess if you insist on every function and property having to declare what level of abstraction it's talking about - e.g. $codePointLength instead of $length.

An idea I've been toying with is rather than having one class representing the slippery notion of "a Unicode string", having (at least) two, closely tied, classes: CodePointString (roughly = UString right now) and GraphemeString (a higher level abstraction tied to the same internal representation).

I intend to mock this up as a set of interfaces at some point, but the basic idea is that you could write this:

// Get an abstract object from a byte string, probably a GraphemeString, parsing the input as UTF-8
$str = u('some text');
// Perform an operation that explicitly deals in Code Points
$str = $str->asCodePoints()->normalise('NFC');
// Get information using a higher level of abstraction
$length = $str->asGraphemes()->length;
// Perform a high-level mutation, then convert right back to a concrete string of bytes
echo $str->asGraphemes()->reverse()->asByteString('UTF-16');

Calling asGraphemes() on a GraphemeString or asCodePoints() on a CodePointString would be legal but a no-op, so it would be safe to accept both as input to a function, then switch to whichever level the task required.

I'm not sure if this finds a good balance between complexity and user-friendliness, and would welcome anyone's thoughts.

I’d rather have some grapheme-specific functions and some code point functions on the same class. Make array-like indexing with [] be by code points as you may be able to do that in constant time, and because there might be multiple approaches to choosing graphemes. Have ->codepointAt(), but also ->nthGrapheme() or something like it. There’s no need for grapheme versions of all functions, but others would need them.

Though your approach has its own merits.

Andrea Faulds
http://ajf.me/

10 years ago by Rowan Collins — view source

unread

On 21 Oct 2014, at 21:42, Rowan Collins rowan.collins@gmail.com
wrote:

The only case I can see where a default encoding would be sensible
would be where source code itself is in a different encoding, so that
u('literal string') works as expected.

This is only a good idea if we can somehow make it file-local.
Otherwise if one library uses Latin-1 and another uses UTF-8 for some
reason, bang!

Yes, I used the word "declared" advisedly, because I was thinking it could take its default encoding (if we were to go down the route of special literal syntax rather than wrapper-function) from the existing declare(encoding='...') directive, rather than a global variable or setting.

http://php.net/manual/en/control-structures.declare.php#control-structures.declare.encoding

10 years ago by Joe Watkins — view source

unread

Morning internalz,
https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Cheers
Joe
I think this looks like a really great start at creating something
actually useful, rather than getting stuck at the drawing board. I like
that the scope is quite small initially - where does the "single
responsibility" of a class that represents a string end, anyway? :)

A few opinions:

Global / static defaults are bad.

The existence of the setDefaultCodepage method feels like an
anti-pattern to me. It means libraries can't rely on this class working
the same way in two different host environments, or even at two
re-entries in the same program. Effectively, if you don't know what the
second argument to the constructor will default to, you can't actually
treat it as optional unless you're writing monolithic code. This is a
common pattern in PHP, but http_build_query() would be so much more
pleasant if I could safely call it with 1 argument instead of 3.

I think the default should be hard-coded to UTF-8, which according to
previous discussion is always the default output encoding, so would
mean this would always work: $aUString = new UString( (string)$aUString
); Any other encoding will be dependent on, and known from, the context
where the object is created - if grabbing data from an HTTP request, a
header should tell them; if from a database, a connection parameter; and
so on.

Could be true, it feels quite horrible to me today too, I think someone
else suggested it, but it might have been me.

I'll look at doing something about that ...

The only case I can see where a default encoding would be sensible would
be where source code itself is in a different encoding, so that
u('literal string') works as expected. I guess if we ever went down the
route of special literal syntax like u'literal string', the declared
source encoding could be used.

Actually, the u() shortcut function appears to be missing the encoding
parameter completely; is this deliberate?

Fixed that.

Clarify relationship to a "byte string"

Most of the API acts like this is an abstract object representing a
bunch of Unicode code points. As such, I'm not sure what getCodepage()
does - a code page (or more properly encoding) is a property of a stream
of bytes, so has no meaning in this context, surely? The internal
implementation could use UTF-8, UTF-16, or some made-up encoding (like
Perl6's "NFG" system) and the user should never need to know (other than
to understand performance implications).

On the other hand, when you do want a stream of bytes, the class
doesn't seem to have an explicit way to get one. The (currently
undocumented) behaviour is apparently to spit out UTF-8 if cast to a
string, but it would be nice to have an explicit function which could be
passed a parameter in order to serialise to, say, UTF-16, instead.

I reused the terminology used by ICU, it made sense in their
documentation.

So we want a ::getBytes or something like that ... I'll do that ...

The Grapheme Question

This has been raised a few times, so I won't labour the point, just
mention my current thinking.

Unicode is complicated. Partly, that's because of a series of
compromises in its design; but partly, it's because writing systems are
complicated, and Unicode tries harder than most previous systems to
acknowledge that. So, there's a tradeoff to be made between giving users
what they think they need, thus hiding the messy details, and giving
users the power to do things right, in a more complex way.

There is also a namespace mess if you insist on every function and
property having to declare what level of abstraction it's talking about

e.g. $codePointLength instead of $length.

An idea I've been toying with is rather than having one class
representing the slippery notion of "a Unicode string", having (at
least) two, closely tied, classes: CodePointString (roughly = UString
right now) and GraphemeString (a higher level abstraction tied to the
same internal representation).

I intend to mock this up as a set of interfaces at some point, but the
basic idea is that you could write this:

// Get an abstract object from a byte string, probably a GraphemeString,
parsing the input as UTF-8
$str = u('some text');
// Perform an operation that explicitly deals in Code Points
$str = $str->asCodePoints()->normalise('NFC');
// Get information using a higher level of abstraction
$length = $str->asGraphemes()->length;
// Perform a high-level mutation, then convert right back to a concrete
string of bytes
echo $str->asGraphemes()->reverse()->asByteString('UTF-16');

Calling asGraphemes() on a GraphemeString or asCodePoints() on a
CodePointString would be legal but a no-op, so it would be safe to
accept both as input to a function, then switch to whichever level the
task required.

I'm not sure if this finds a good balance between complexity and
user-friendliness, and would welcome anyone's thoughts.

I'd rather higher level stuff existed at a higher level, I'd rather
solve for ustring the problems that are solved for normal strings and
leave the rest up to whatever the framework/component/library or wants
to do.

--
Rowan Collins
[IMSoP]

10 years ago by Rowan Collins — view source

unread

Joe Watkins wrote on 23/10/2014 09:18:

I'd rather higher level stuff existed at a higher level, I'd rather
solve for ustring the problems that are solved for normal strings and
leave the rest up to whatever the framework/component/library or wants
to do.

It's not really higher level in terms of the problem being solved, it's
the same functions applied to a higher abstraction of what "string"
means. It doesn't make much sense to say that u($foo)->length "solves
the same problem as" strlen($foo), but grapheme_strlen($foo) is somehow
"higher level". They're three different definitions of the word "length"
which can be applied to the same string, and it would be nice if they
were all accessible through the same API.

I get the feeling people are thinking of grapheme functions as something
exotic and hard to implement, but ext/intl seems to have a very
straight-forward set of functions for them:
http://php.net/manual/en/ref.intl.grapheme.php

The two-interfaces idea was just to get over the naming problem of
prefixing everything with codePointX or graphemeX, and wouldn't actually
require a separate data structure under the hood.

Rowan Collins
[IMSoP]

10 years ago by Nikita Popov — view source

unread

Morning internalz,
    https://wiki.php.net/rfc/ustring

    This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
    Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)

I'm not totally convinced by this proposal. We already have quite a number
of extensions that deal with unicode text in one way or another (at least
intl, mbstring and iconv). This adds yet another way of dealing with this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and intl
for any non-trivial operations. There's nothing wrong with adding another
approach for unicode handling per se, but I'd like to have more empahsis on
how this integrates with existing functionality and why it is implemented
separately from it (especially intl), etc.

On a more general note, I'd appreciate it if RFCs proposing the inclusion
of extensions moved more of their content into the actual RFC, as opposed
to being thin wrappers around the extension README/docs. We had this issue
with the pecl_http RFC and the same applies here. I think the suggested API
is a pretty important aspect of the proposal and as such should be included
in the RFC and maybe also commented a bit ;)

Nikita

10 years ago by Pierre Joye — view source

unread

Morning internalz,
    https://wiki.php.net/rfc/ustring

    This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
    Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)
I'm not totally convinced by this proposal. We already have quite a number
of extensions that deal with unicode text in one way or another (at least
intl, mbstring and iconv). This adds yet another way of dealing with this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and intl
for any non-trivial operations. There's nothing wrong with adding another
approach for unicode handling per se, but I'd like to have more empahsis on
how this integrates with existing functionality and why it is implemented
separately from it (especially intl), etc.

On a more general note, I'd appreciate it if RFCs proposing the inclusion
of extensions moved more of their content into the actual RFC, as opposed
to being thin wrappers around the extension README/docs. We had this issue
with the pecl_http RFC and the same applies here. I think the suggested API
is a pretty important aspect of the proposal and as such should be included
in the RFC and maybe also commented a bit ;)

Full ack. Both paragraph.

As of now, and based on the previous discussions pointed out the same
issues (minus the RFC one, but this is a detail, important, but a
detail), I am also not convinced this is the way to tackle the Unicode
text support. It should either be part of intl (and proposed to enable
intl always for 7, with other RFC) or main. Main has the advantage to
provide a easier integration with other extensions.

Cheers,

Pierre

@pierrejoye | http://www.libgd.org

10 years ago by Sara Golemon — view source

unread

    https://wiki.php.net/rfc/ustring

    This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
    Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)

Curious what the current state of the UString RFC is. I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!

I'm not totally convinced by this proposal. We already have quite a number
of extensions that deal with unicode text in one way or another (at least
intl, mbstring and iconv). This adds yet another way of dealing with this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and intl
for any non-trivial operations. There's nothing wrong with adding another
approach for unicode handling per se, but I'd like to have more empahsis on
how this integrates with existing functionality and why it is implemented
separately from it (especially intl), etc.

I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime. If not, let's talk about making that
the intent, because intl is where this belongs.

For my bikeshedding part, I'd recommend against the u() function
helper as it pollutes the global function namespace and takes a very
fundamental name. intl\u() might be worth considering, but we'll need
to have a discussion about namespacing for the intl extension as a
whole (separate topic).

I'd also recommend "IntlString" rather than "UString" as nearly all
the Intl classes follow this convention. The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

Otherwise, while there's room to quibble about specific API names and
arguments, the general concept seems a no-brainer.

-Sara

10 years ago by Joe Watkins — view source

unread

Morning Sara,

Curious what the current state of the UString RFC is. I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!

I was (semi) convinced by Dmitry that the superior implementation is one
for Zend, so I backed off ...

I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime. If not, let's talk about making that
the intent, because intl is where this belongs.

The folder the source code is in makes no nevermind, the real issue with
integration
is changing all of intl, and lots of other stuff, to accept UString, since
casting to basic type
, while acceptable for simple tests, would get extremely wasteful for an
application of any complexity.

Another possible issue is engine integration:

$string = (UString) $someString;
$string = (UString) "someString";

These aren't very different to 'new UString', but for an integrated
solution, kind of expected to work.

I don't know what the solutions are to these problems, I'm all ears ...

Cheers
Joe

On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita.ppv@gmail.com
wrote:
On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthreads@pthreads.org
wrote:
    https://wiki.php.net/rfc/ustring

    This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
    Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)
Curious what the current state of the UString RFC is. I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!

I'm not totally convinced by this proposal. We already have quite a
number
of extensions that deal with unicode text in one way or another (at least
intl, mbstring and iconv). This adds yet another way of dealing with this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and intl
for any non-trivial operations. There's nothing wrong with adding another
approach for unicode handling per se, but I'd like to have more empahsis
on
how this integrates with existing functionality and why it is implemented
separately from it (especially intl), etc.

I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime. If not, let's talk about making that
the intent, because intl is where this belongs.

For my bikeshedding part, I'd recommend against the u() function
helper as it pollutes the global function namespace and takes a very
fundamental name. intl\u() might be worth considering, but we'll need
to have a discussion about namespacing for the intl extension as a
whole (separate topic).

I'd also recommend "IntlString" rather than "UString" as nearly all
the Intl classes follow this convention. The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

Otherwise, while there's room to quibble about specific API names and
arguments, the general concept seems a no-brainer.

-Sara

10 years ago by Andreas Heigl — view source

unread

Hi Joe.

Am 01.07.15 um 07:36 schrieb Joe Watkins:

[..]

Another possible issue is engine integration:
$string = (UString) $someString;
$string = (UString) "someString";
These aren't very different to 'new UString', but for an integrated
solution, kind of expected to work.

Why would that be expected behaviour? I mean I can't do

$date = (DateTime) $timestring;

after all, can I? But I can use

$date = new DateTime($timestring);

Just my 2 Cent.

Cheers

Andreas

                                                          ,,,
                                                         (o o)

+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| http://andreas.heigl.org http://hei.gl/wiFKy7 |
+---------------------------------------------------------------------+
| http://hei.gl/root-ca |
+---------------------------------------------------------------------+

10 years ago by Joe Watkins — view source

unread

Morning,

Why would that be expected behaviour? I mean I can't do

$date = (DateTime) $timestring;

No, but you can't do:

 $string = (string) $datetime;

But can do:

$string = (string) $ustring;

Where $ustring is instanceof UString.

Even if you never write $string = (string) $ustring, the engine will
perform the same
action all the time, whenever you pass a UString to anything expecting
string.

It feels like a complete implementation should support both casts.

Cheers
Joe

Hi Joe.

Am 01.07.15 um 07:36 schrieb Joe Watkins:
[..]

Another possible issue is engine integration:
$string = (UString) $someString;
$string = (UString) "someString";
These aren't very different to 'new UString', but for an integrated
solution, kind of expected to work.
Why would that be expected behaviour? I mean I can't do
$date = (DateTime) $timestring;
after all, can I? But I can use
$date = new DateTime($timestring);
Just my 2 Cent.

Cheers

Andreas
                                                          ,,,
                                                         (o o)
+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| http://andreas.heigl.org http://hei.gl/wiFKy7 |
+---------------------------------------------------------------------+
| http://hei.gl/root-ca |
+---------------------------------------------------------------------+

10 years ago by Sara Golemon — view source

unread

Another possible issue is engine integration:
$string = (UString) $someString;
$string = (UString) "someString";

That sounds as a cool idea to discuss as a completely separate,
unrelated RFC, and not specific to UString.

e.g. $obj = (ClassName)$arg; /* turns into */ $obj = new ClassName($arg);

So you could use casting with any class which supports single-argument
constructors.

But again, orthogonal to this RFC.

-Sara

10 years ago by Aaron Piotrowski — view source

unread

Another possible issue is engine integration:

$string = (UString) $someString;
$string = (UString) "someString";

That sounds as a cool idea to discuss as a completely separate,
unrelated RFC, and not specific to UString.

e.g. $obj = (ClassName)$arg; /* turns into */ $obj = new ClassName($arg);

So you could use casting with any class which supports single-argument
constructors.

But again, orthogonal to this RFC.

-Sara

--

Expanding on this idea, a separate RFC could propose a magic __cast($value) static method that would be called for code like below:

$obj = (ClassName) $scalarOrObject; // Invokes ClassName::__cast($scalarOrObject);

This would allow UString to implement casting a string to a UString and allow users to implement such behavior with their own classes.

However, I would not implement such casting syntax for UString only. Being able to write $ustring = (UString) $string; without the ability to do so for other classes would be unusual and confusing in my opinion. If an RFC adding such behavior was implemented, UString could be updated to support casting.

Obviously a UString should be able to be cast to a scalar string using (string) $ustring. If performance is a concern, UString::__toString() should cache the result so multiple casts to the same object are quick.

Aaron Piotrowski

10 years ago by Anatol Belski — view source

unread

Hi,

-----Original Message-----
From: Aaron Piotrowski [mailto:aaron@icicle.io]
Sent: Wednesday, July 1, 2015 9:00 PM
To: Sara Golemon
Cc: pthreads@pthreads.org; internals@lists.php.net
Subject: Re: [PHP-DEV] [RFC] UString

On Tue, Jun 30, 2015 at 10:36 PM, Joe Watkins pthreads@pthreads.org
wrote:

Another possible issue is engine integration:

$string = (UString) $someString;
$string = (UString) "someString";

That sounds as a cool idea to discuss as a completely separate,
unrelated RFC, and not specific to UString.

e.g. $obj = (ClassName)$arg; /* turns into */ $obj = new
ClassName($arg);

So you could use casting with any class which supports single-argument
constructors.

But again, orthogonal to this RFC.

-Sara

--
To unsubscribe,
visit: http://www.php.net/unsub.php

Expanding on this idea, a separate RFC could propose a magic
__cast($value)
static method that would be called for code like below:

$obj = (ClassName) $scalarOrObject; // Invokes
ClassName::__cast($scalarOrObject);

This would allow UString to implement casting a string to a UString and
allow
users to implement such behavior with their own classes.

However, I would not implement such casting syntax for UString only. Being
able
to write $ustring = (UString) $string; without the ability to do so for
other classes
would be unusual and confusing in my opinion. If an RFC adding such
behavior
was implemented, UString could be updated to support casting.

Obviously a UString should be able to be cast to a scalar string using
(string)
$ustring. If performance is a concern, UString::__toString() should cache
the
result so multiple casts to the same object are quick.

One way doing this is already there thanks
https://wiki.php.net/rfc/operator_overloading_gmp . Consider

$n = gmp_init(42); var_dump($n, (int)$n);

However the other way round - could be done on case by case basis, IMHO.
Where it could make sense for class vs scalar, casting class to class is a
quite unpredictable thing.

While users could implement it, how is it handled with arbitrary objects?
How would it map properties, would those classes need to implement the same
interface, et cetera? We're not in C at this point, where we would just
force a block of memory to be interpreted as we want.

Regards

Anatol

10 years ago by Aaron Piotrowski — view source

unread

Expanding on this idea, a separate RFC could propose a magic
__cast($value)
static method that would be called for code like below:

$obj = (ClassName) $scalarOrObject; // Invokes
ClassName::__cast($scalarOrObject);

This would allow UString to implement casting a string to a UString and
allow
users to implement such behavior with their own classes.

However, I would not implement such casting syntax for UString only. Being
able
to write $ustring = (UString) $string; without the ability to do so for
other classes
would be unusual and confusing in my opinion. If an RFC adding such
behavior
was implemented, UString could be updated to support casting.

Obviously a UString should be able to be cast to a scalar string using
(string)
$ustring. If performance is a concern, UString::__toString() should cache
the
result so multiple casts to the same object are quick.

Hi,

One way doing this is already there thanks
https://wiki.php.net/rfc/operator_overloading_gmp . Consider

$n = gmp_init(42); var_dump($n, (int)$n);

However the other way round - could be done on case by case basis, IMHO.
Where it could make sense for class vs scalar, casting class to class is a
quite unpredictable thing.

While users could implement it, how is it handled with arbitrary objects?
How would it map properties, would those classes need to implement the same
interface, et cetera? We're not in C at this point, where we would just
force a block of memory to be interpreted as we want.

Regards

Anatol

Hello,

I was thinking that the __cast() static method would examine the parameter given, then use that value to build a new object and return it or return null (which would then result in the engine throwing an Error saying that $scalarOrValue could not be cast to ClassName). It was just a suggestion to see what others thought because someone suggested supporting casting syntax such as $ustring = (UString) $scalarString. I don’t really care for either method though (__cast() or enabling casting just for UString), as they don't offer any advantage over writing new UString($string) or UString::fromString($string).

Aaron Piotrowski

$\"Ivan Enderlin\"@Hoa$ 10 years ago by \"Ivan Enderlin\"@Hoa — view source

unread

Hello :-),

Just a small detail. Please, choose another name. The Hoa\String
https://packagist.org/packages/hoa/string library has been renamed to
Hoa\Ustring because of PHP7. So, please, don't force us to rename the
library again ;-).

Moreover, this library provides an API that is useful for daily use and
can be inspiring. Please, see
http://hoa-project.net/Literature/Hack/Ustring.html.

Regards.

     https://wiki.php.net/rfc/ustring

     This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
     Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)
Curious what the current state of the UString RFC is. I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!

I'm not totally convinced by this proposal. We already have quite a number
of extensions that deal with unicode text in one way or another (at least
intl, mbstring and iconv). This adds yet another way of dealing with this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and intl
for any non-trivial operations. There's nothing wrong with adding another
approach for unicode handling per se, but I'd like to have more empahsis on
how this integrates with existing functionality and why it is implemented
separately from it (especially intl), etc.

I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime. If not, let's talk about making that
the intent, because intl is where this belongs.

For my bikeshedding part, I'd recommend against the u() function
helper as it pollutes the global function namespace and takes a very
fundamental name. intl\u() might be worth considering, but we'll need
to have a discussion about namespacing for the intl extension as a
whole (separate topic).

I'd also recommend "IntlString" rather than "UString" as nearly all
the Intl classes follow this convention. The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

Otherwise, while there's room to quibble about specific API names and
arguments, the general concept seems a no-brainer.

-Sara

10 years ago by Andreas Heigl — view source

unread

Hi.

Am 02.07.15 um 15:43 schrieb "Ivan Enderlin"@Hoa:

Hello :-),

Just a small detail. Please, choose another name. The Hoa\String
https://packagist.org/packages/hoa/string library has been renamed to
Hoa\Ustring because of PHP7. So, please, don't force us to rename the
library again ;-).

What's the issue with the name?

As far as I see it, There's no problem at all, as there's UString and
then there's Hoa\UString. Different namespace, no issue.

Or am I missing something?

Cheers

Andreas

Moreover, this library provides an API that is useful for daily use and
can be inspiring. Please, see
http://hoa-project.net/Literature/Hack/Ustring.html.

Regards.
On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita.ppv@gmail.com
wrote:
On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthreads@pthreads.org
wrote:
     https://wiki.php.net/rfc/ustring

     This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
     Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)
Curious what the current state of the UString RFC is. I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!

I'm not totally convinced by this proposal. We already have quite a
number
of extensions that deal with unicode text in one way or another (at
least
intl, mbstring and iconv). This adds yet another way of dealing with
this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and
intl
for any non-trivial operations. There's nothing wrong with adding
another
approach for unicode handling per se, but I'd like to have more
empahsis on
how this integrates with existing functionality and why it is
implemented
separately from it (especially intl), etc.

I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime. If not, let's talk about making that
the intent, because intl is where this belongs.

For my bikeshedding part, I'd recommend against the u() function
helper as it pollutes the global function namespace and takes a very
fundamental name. intl\u() might be worth considering, but we'll need
to have a discussion about namespacing for the intl extension as a
whole (separate topic).

I'd also recommend "IntlString" rather than "UString" as nearly all
the Intl classes follow this convention. The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

Otherwise, while there's room to quibble about specific API names and
arguments, the general concept seems a no-brainer.

-Sara

--
,,,
(o o)
+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| http://andreas.heigl.org http://hei.gl/wiFKy7 |
+---------------------------------------------------------------------+
| http://hei.gl/root-ca |
+---------------------------------------------------------------------+

$\"Ivan Enderlin\"@Hoa$ 10 years ago by \"Ivan Enderlin\"@Hoa — view source

unread

I fear it will be a reserved keyword.

Hi.

Am 02.07.15 um 15:43 schrieb "Ivan Enderlin"@Hoa:

Hello :-),

Just a small detail. Please, choose another name. The Hoa\String
https://packagist.org/packages/hoa/string library has been renamed to
Hoa\Ustring because of PHP7. So, please, don't force us to rename the
library again ;-).
What's the issue with the name?

As far as I see it, There's no problem at all, as there's UString and
then there's Hoa\UString. Different namespace, no issue.

Or am I missing something?

Cheers

Andreas
Moreover, this library provides an API that is useful for daily use and
can be inspiring. Please, see
http://hoa-project.net/Literature/Hack/Ustring.html.

Regards.
On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita.ppv@gmail.com
wrote:
On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthreads@pthreads.org
wrote:
      https://wiki.php.net/rfc/ustring

      This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
      Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)
Curious what the current state of the UString RFC is. I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!

I'm not totally convinced by this proposal. We already have quite a
number
of extensions that deal with unicode text in one way or another (at
least
intl, mbstring and iconv). This adds yet another way of dealing with
this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and
intl
for any non-trivial operations. There's nothing wrong with adding
another
approach for unicode handling per se, but I'd like to have more
empahsis on
how this integrates with existing functionality and why it is
implemented
separately from it (especially intl), etc.

I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime. If not, let's talk about making that
the intent, because intl is where this belongs.

For my bikeshedding part, I'd recommend against the u() function
helper as it pollutes the global function namespace and takes a very
fundamental name. intl\u() might be worth considering, but we'll need
to have a discussion about namespacing for the intl extension as a
whole (separate topic).

I'd also recommend "IntlString" rather than "UString" as nearly all
the Intl classes follow this convention. The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

Otherwise, while there's room to quibble about specific API names and
arguments, the general concept seems a no-brainer.

-Sara

10 years ago by Kalle Sommer Nielsen — view source

unread

Hi Ivan

2015-07-02 15:48 GMT+02:00 "Ivan Enderlin"@Hoa ivan.enderlin@hoa-project.net:

I fear it will be a reserved keyword.

Internally defined classes, such as UConverter or stdClass are not
reserved keywords, they are not an actual part of the language but a
part of the library. Code like the one below is perfectly valid,
meaning the example you made will continue to work as long it remains
within a namespace:

C:\dev\php-src>php -r "namespace stdlib; class stdclass { }
var_dump(get_class(new stdclass), get_class(new \stdClass));"
string(15) "stdlib\stdclass"
string(8) "stdClass"

--
regards,

Kalle Sommer Nielsen
kalle@php.net

10 years ago by Sara Golemon — view source

unread

On Thu, Jul 2, 2015 at 6:43 AM, "Ivan Enderlin"@Hoa
ivan.enderlin@hoa-project.net wrote:

Just a small detail. Please, choose another name. The Hoa\String
https://packagist.org/packages/hoa/string library has been renamed to
Hoa\Ustring because of PHP7. So, please, don't force us to rename the
library again ;-).

As replied by others, no need for concern on that front. As \UString
and Hoa\UString can live side-by-side.

However, I would like to bump my earlier suggestion to go with
"IntlString" and make this functionality be part of the intl
extension.

I'd also recommend "IntlString" rather than "UString" as nearly all
the Intl classes follow this convention. The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

-Sara

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

I don't like that. This might sound crazy, but what about adding Unicode string literals to the parser, e.g. u"foo bar\u{202e}你好"? If the UString extension isn't available, just error. It wouldn't be the first time we had disableable syntax features (``), and this avoids any possible conflicts.

Regards,

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

I think we're going to keep going round in circles unless we can really pin down what it means for a language to "support Unicode".

Cheers

Also, bear in mind that namespaces mean you can still have your own u() if it’s in your namespace (\u).

On that note, ->charAt ought to be ->codepointAt to avoid being misleading.

Though your approach has its own merits.

The two-interfaces idea was just to get over the naming problem of prefixing everything with codePointX or graphemeX, and wouldn't actually require a separate data structure under the hood.

Cheers,

Andreas

Andreas

Cheers

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

I think we're going to keep going round in circles unless we can really
pin down what it means for a language to "support Unicode".

The two-interfaces idea was just to get over the naming problem of
prefixing everything with codePointX or graphemeX, and wouldn't actually
require a separate data structure under the hood.