[RFC] IntlChar class and intl_char_*() functions

10 years ago by Lester Caine — view source

unread

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

Isn't the problem here that while ICU is perhaps the obvious way
forward, there is still no decision that it will be the base for other
developments? Other proposals are looking for a lighter solution to the
problem? I'd make a case for using the UTF8 configuration of ICU as the
base for all the unicode developments, but can understand that this may
not play well with other installations of ICU on a system?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

10 years ago by Rowan Collins — view source

unread

Lester Caine wrote on 25/11/2014 10:00:

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char
Isn't the problem here that while ICU is perhaps the obvious way
forward, there is still no decision that it will be the base for other
developments? Other proposals are looking for a lighter solution to the
problem? I'd make a case for using the UTF8 configuration of ICU as the
base for all the unicode developments, but can understand that this may
not play well with other installations of ICU on a system?

I think this is just adding some functionality to the existing ext/intl
[http://php.net/intl], which is basically a PHP wrapper around ICU
already, so doesn't have much bearing on further Unicode developments.

--
Rowan Collins
[IMSoP]

10 years ago by Derick Rethans — view source

unread

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

I think the RFC could benefit from a few examples on how you would use
this.

cheers,
Derick

10 years ago by Andrea Faulds — view source

unread

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

Is there really a need to have both an “OOP” and “non-OOP” interface here? If it’s all static methods, why not just make it functions?

--
Andrea Faulds
http://ajf.me/

10 years ago by Sara Golemon — view source

unread

Is there really a need to have both an “OOP” and “non-OOP” interface here? If it’s all static methods, why not just make it functions?

No, there isn't, but everything else in ext/intl has this duality, so
I'm offering it in the initial RFC. If others agree that it's not
necessary, then I am ONLY TOO HAPPY to rip it the eff out.

-Sara

10 years ago by Stanislav Malyshev — view source

unread

Hi!

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

Sounds like a good idea, even for 5.6.

--
Stas Malyshev
smalyshev@gmail.com

10 years ago by Stanislav Malyshev — view source

unread

Hi!

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

Looking into this and also reading the \u{} proposal, I also thought -
do we have a programmatic way of doing what \u would do? I.e. if we
assume $x holds an Unicode codepoint value (i.e., an integer) do we have
a good built-in way to generate the corresponding utf8 sequence?
If not, then I think this class may be a good place to put such a
function in.

Stas Malyshev
smalyshev@gmail.com

10 years ago by Andrea Faulds — view source

unread

Looking into this and also reading the \u{} proposal, I also thought -
do we have a programmatic way of doing what \u would do? I.e. if we
assume $x holds an Unicode codepoint value (i.e., an integer) do we have
a good built-in way to generate the corresponding utf8 sequence?
If not, then I think this class may be a good place to put such a
function in.

You mean something along the lines of JavaScript’s String.fromCharCode (or, in ES6, String.fromCodePoint)?

One of the nice things about that function is that it can take multiple codes. So I can do String.fromCodePoint(65, 66, 67) to get “ABC”.

If we add that, we should also have an analogue of JavaScript’s String.charCodeAt/String.codePointAt to do the operation in reverse.

Andrea Faulds
http://ajf.me/

10 years ago by Rowan Collins — view source

unread

Andrea Faulds wrote on 27/11/2014 13:48:

Looking into this and also reading the \u{} proposal, I also thought -
do we have a programmatic way of doing what \u would do? I.e. if we
assume $x holds an Unicode codepoint value (i.e., an integer) do we have
a good built-in way to generate the corresponding utf8 sequence?
If not, then I think this class may be a good place to put such a
function in.
You mean something along the lines of JavaScript’s String.fromCharCode (or, in ES6, String.fromCodePoint)?

One of the nice things about that function is that it can take multiple codes. So I can do String.fromCodePoint(65, 66, 67) to get “ABC”.

If we add that, we should also have an analogue of JavaScript’s String.charCodeAt/String.codePointAt to do the operation in reverse.

We already have the single-byte versions: chr() and ord(). It's been on
my to do list for a while to rewrite the manual pages for those, which
currently have a whole lot of misleading references to ASCII.

10 years ago by Stanislav Malyshev — view source

unread

Hi!

We already have the single-byte versions: chr() and ord(). It's been on

Not really. chr(0xA1) is a byte with value 0xA1. The function I propose
would instead produce "\xC2\xA1".

--
Stas Malyshev
smalyshev@gmail.com

10 years ago by Rowan Collins — view source

unread

Hi!

We already have the single-byte versions: chr() and ord(). It's been
on

Not really. chr(0xA1) is a byte with value 0xA1. The function I propose
would instead produce "\xC2\xA1".

Ah, yes, so it would. According to MDN, the JS functions actually return UTF-16, so it's not really the same as those either - it needs to specifically return a UTF-8 byte sequence.

10 years ago by Stanislav Malyshev — view source

unread

Hi!

You mean something along the lines of JavaScript’s
String.fromCharCode (or, in ES6, String.fromCodePoint)?

Yes, exactly.

One of the nice things about that function is that it can take
multiple codes. So I can do String.fromCodePoint(65, 66, 67) to get
“ABC”.

Well, that is nice but if we have it for one we can always use
array_map() :) We could make it accept as many code points as we want,
of course.

If we add that, we should also have an analogue of JavaScript’s
String.charCodeAt/String.codePointAt to do the operation in reverse.

That'd be a bit harder since it's not clear what "at" means there -
byte? codepoint? grapheme? what about broken UTF-8 sequences? What if
you run this function on non-UTF-8 array? Etc. So this one is trickier,
the other direction is easy.

--
Stas Malyshev
smalyshev@gmail.com

10 years ago by Andrea Faulds — view source

unread

If we add that, we should also have an analogue of JavaScript’s
String.charCodeAt/String.codePointAt to do the operation in reverse.

That'd be a bit harder since it's not clear what "at" means there -
byte? codepoint? grapheme? what about broken UTF-8 sequences? What if
you run this function on non-UTF-8 array? Etc. So this one is trickier,
the other direction is easy.

It'd work with codepoints, but it strikes me now that this would be a job for UStrings. We could have u($some_input)->codePointAt(7) alongside u($some_input)[7]. Since UString works with codepoints and is guaranteed to be Unicode, this works.

--
Andrea Faulds
http://ajf.me/

10 years ago by Sara Golemon — view source

unread

On Thu, Nov 27, 2014 at 12:22 AM, Stanislav Malyshev
smalyshev@gmail.com wrote:

Looking into this and also reading the \u{} proposal, I also thought -
do we have a programmatic way of doing what \u would do? I.e. if we
assume $x holds an Unicode codepoint value (i.e., an integer) do we have
a good built-in way to generate the corresponding utf8 sequence?
If not, then I think this class may be a good place to put such a
function in.

Yeah, that's some of what this class and/or set of functions would
expose. Numeric codepoint to utf8 sequence, codepoint name (e.g.
"LATIN CAPITAL LETTER P") to utf8 sequence, numeric codepoint to
codepoint name, utf8 sequence to numeric codepoint, etc...

At this point I haven't heard any major directions so I'm going to
start work on implementing it so that we have a more firm base to
discuss and we can continue once I have that up for review.

-Sara

10 years ago by Sara Golemon — view source

unread

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

Full implementation available now at
https://github.com/sgolemon/php-src/compare/intl.uchar
RFC updated to remove the functional API and clarify some things based
on what I learned while implementing.

Pay special attention to the #notes section regarding $limit which I
think is a somewhat non-PHP API.

-Sara

10 years ago by Stanislav Malyshev — view source

unread

Hi!

Full implementation available now at
https://github.com/sgolemon/php-src/compare/intl.uchar
RFC updated to remove the functional API and clarify some things based
on what I learned while implementing.

Thanks for your work! I think it may be good to make it a pull, since
it'd be easier to comment on that (and also Travis can say its word of
course :)

Pay special attention to the #notes section regarding $limit which I
think is a somewhat non-PHP API.

It says "some methods" but I found just one which is user-callable -
enumCharNames - and one that consumes callback with the same setup -
enumCharTypes - in the latter case I don't think it makes sense to
change anything since it's a callback for ICU. So, if I didn't miss
anything, it's one function and at that one that has matching callback
in ICU API. So I think it's best to leave it as is, especially that PHP
doesn't have the concept of ranges as such...

Stas Malyshev
smalyshev@gmail.com

10 years ago by Sara Golemon — view source

unread

On Tue, Dec 16, 2014 at 12:49 AM, Stanislav Malyshev
smalyshev@gmail.com wrote:

Thanks for your work! I think it may be good to make it a pull, since
it'd be easier to comment on that (and also Travis can say its word of
course :)

Can do!
https://github.com/php/php-src/pull/961

It says "some methods" but I found just one which is user-callable -
enumCharNames - and one that consumes callback with the same setup -
enumCharTypes - in the latter case I don't think it makes sense to
change anything since it's a callback for ICU. So, if I didn't miss
anything, it's one function and at that one that has matching callback
in ICU API. So I think it's best to leave it as is, especially that PHP
doesn't have the concept of ranges as such...

You've read it right and I agree with you. The less magic the better.
I just wanted to call it out since it lacks some intuitiveness.

-Sara

10 years ago by Sara Golemon — view source

unread

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

FYI- I plan to open voting on this on Monday the 22nd with a 3-week
voting window (to account for the holiday). If you have any comments,
please make them over the next four days.

-Sara

10 years ago by Sara Golemon — view source

unread

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

Better late than never, voting is open until 2015-01-16 06:00 UTC

10 years ago by Pierre Joye — view source

unread

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

Better late than never, voting is open until 2015-01-16 06:00 UTC

Wohooo! :)

btw, it would be very nice to start a new thread for each phase, so it
does not go hidden in some random old threads. Thanks :)

--
Pierre

@pierrejoye | http://www.libgd.org

10 years ago by Sara Golemon — view source

unread

While playing around with Andrea's unicode literals syntax proposal, I
was reminded of just how little of ICU is exposed. I've put up a
short proposal for adding IntlChar exporting these APIs as static
methods (with a matching non-oop interface).

https://wiki.php.net/rfc/intl.char

Better late than never, voting is open until 2015-01-16 06:00 UTC

After 3 week voting period, IntlChar is now accepted 14-0

I'll land it in both projects shortly.

-Sara

[RFC] IntlChar class and intl_char_*() functions

-- Lester Caine - G8HFL

If we add that, we should also have an analogue of JavaScript’s String.charCodeAt/String.codePointAt to do the operation in reverse.

--
Lester Caine - G8HFL