[RFC] ICU UConverter implementation for ext/intl

12 years ago by Pierre Joye — view source

unread

hi Sara!

http://wiki.php.net/rfc/uconverter

Discuss!

Nice job!

Some comments, raw :) :

the UCNV prefix is not necessary, we are in the UConverter class
already, I would simply use LATIN_1, or ILLEGAL for exampe.
It is OO, we should or could use exception. Warning are not nice to
deal with and errors are quite common during conversion
I would add ERROR_ to the error related constants (UNASSIGNED & co), or?

Some questions, is 'latin1' the same than const UCNV_LATIN_1? or is
there any relation between the two?

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

12 years ago by Sara Golemon — view source

unread

the UCNV prefix is not necessary, we are in the UConverter class
already, I would simply use LATIN_1, or ILLEGAL for exampe.

I would add ERROR_ to the error related constants (UNASSIGNED & co), or?

For the UConverterType enum I could maybe get behind that. For
UConverterCallbackReason I think we need something, but ERROR_ isn't
universally applicable since three are just status incidents
(UCNV_RESET, UCNV_CLOSE, UCNV_CLONE). Perhaps: REASON_* for these?
In that vein, I'd be tempted to make the UConverterType values be
TYPE_*

It is OO, we should or could use exception. Warning are not nice to
deal with and errors are quite common during conversion

We are. UConverterException is thrown everywhere but one place. That
one exception is the static method I added to give a non-oop-like API
for the most core functionality. Given its non-oopness, I left it
non-exceptiony. See Gustvo's notes about error handling below.

Some questions, is 'latin1' the same than const UCNV_LATIN_1? or is
there any relation between the two?

The UCNV_LATIN1 constant is a numeric value, so they're not the same,
no. The relationship is that when you instantiate a UConverter with
'latin1' (or 'iso-8859-1', or any of its aliases),
$conv->getSourceType() (or getDestinationType() as appropriate) will
return UConverter::UCNV_LATIN_1.

Each type in the UConverterType set uses a different algorithm
approach to do the character conversion. Sets like latin1, utf*,
ascii, etc... can be done faster than sets like 'koi8-r' because the
conversions are simple bitshifts (very simple in the case of
ascii/latin1). This just surfaces some info on what the converter is
doing internally and isn't much use beyond a bit of logging, in
practice.

btw, can you add the changes to config.w32 too pls?

Ah, right sorry, had meant to do that at the last minute. Will add
that in a bit,

Error handling is done in a different way from the rest of the extension.
Unusual as this may be, it would be a bad idea to introduce inconsistency
here. See the implementations in the other modules.

I do agree, I was hoping I could slip that one by. :p I'll take
another look at the rest of ext/intl and see about adapting my error
handling.

The tests have poor coverage.

The important bits are hit, but yeah, it could stand to have better
testing. I'll add some.

12 years ago by Pierre Joye — view source

unread

btw, can you add the changes to config.w32 too pls? Pretty much the
same than in .m4, so I can add it to our CI and valid the RFC patch
in a more handy manner :)

http://wiki.php.net/rfc/uconverter

Discuss!

--

--
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

12 years ago by Ivan Enderlin @ Hoa — view source

unread

Very nice work.
Bravo!

http://wiki.php.net/rfc/uconverter

Discuss!

--
Ivan Enderlin
Developer of Hoa
http://hoa.42/ or http://hoa-project.net/

PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/

Member of HTML and WebApps Working Group of W3C
http://w3.org/

12 years ago by Gustavo Lopes — view source

unread

Em 2012-10-30 2:57, Sara Golemon escreveu:

http://wiki.php.net/rfc/uconverter

Discuss!

Good work. I do, however, have some concerns:

Error handling is done in a different way from the rest of the
extension.

Error handling in ext/intl is, admittedly, a little strange. Errors are
both stored globally (to be retrieved by intl_get_error_code() and
intl_get_error_message()) and in the object that raised them (through
::getErrorCode() and ::getErrorMessage()). The mechanism in ICU where
one can do several calls in succession and check the error code only at
the end is broken; the error codes are reset at the start of each call.
The errors may raise any kind of PHP error (depending on the value of a
INI setting, intl.error_level) or an exception (activated globally via
intl.use_exceptions). On failure, functions/methods return false
(constructors and factory methods return null).

Unusual as this may be, it would be a bad idea to introduce
inconsistency here. See the implementations in the other modules.

The tests have poor coverage.

--
Gustavo Lopes

12 years ago by Stas Malyshev — view source

unread

Hi!

http://wiki.php.net/rfc/uconverter

Discuss!

Looks nice. Some points:

transcode() accepts options, but there's no comparable way to set
options to the object. I think these APIs should be synchronized.
Imagine code keeping options in array/config object - it's be really
annoying to have two separate procedures to feed these to object and to
transcode().
Also, description of options would be helpful.
Shouldn't "Enumeration and lookup" methods be static? They look like
independent from encodings and don't use the object.
For "Advanced Use", I think "no error" condition should be the
default and not requiring explicit action.
I think error reporting should match other intl functions. It'd not
really be good if intl submodules would be all different in error
reporting.
What is $source parameter for callbacks?
Why toUCallback returns string but fromUCallback gets codepoint as
long? Shouldn't those be the same - i.e., if toU returns unicode
codepoint, it should be long? Or it can return multiple codepoints? In
which case it becomes confusing as we represent codepoints as both
string and long in the same API.
Link to ICU API from the RFC would be helpful for reviewers and later
docs, I think.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Sara Golemon — view source

unread

transcode() accepts options, but there's no comparable way to set
options to the object. I think these APIs should be synchronized.
Imagine code keeping options in array/config object - it's be really
annoying to have two separate procedures to feed these to object and to
transcode().

transcode() having an $options parameter is to make up for the
instance version (convert()) being able to set those via instance
functions (setSubstChars()).

I don't picture a given app using both convert() and transcode(), the
latter only exists to placate those who are objectophobic.

Also, description of options would be helpful.

They're covered in the RFC: to_subst and from_subst under "Simple Use"

Shouldn't "Enumeration and lookup" methods be static? They look like
independent from encodings and don't use the object.

They are in the patch, I just forgot to note that in the RFC. Updated.

For "Advanced Use", I think "no error" condition should be the
default and not requiring explicit action.

If you take no action at all, then an error still exists. This is
consistent with the underlying API.

I think error reporting should match other intl functions. It'd not
really be good if intl submodules would be all different in error
reporting.

Mentioned in previous feedback, I plan to look at this again.

What is $source parameter for callbacks?

It's context for where in the conversion we are. $codeunit/$codepoint
is the specific element causing the problem, $source is the string
from that point forward.

Why toUCallback returns string but fromUCallback gets codepoint as
long? Shouldn't those be the same - i.e., if toU returns unicode
codepoint, it should be long? Or it can return multiple codepoints? In
which case it becomes confusing as we represent codepoints as both
string and long in the same API.

Actually (I left this out of the RFC), they both can return a large
number of types.

In the case of toUCallback, you can return a utf-8 string (most
reasonable Unicode representation to be returned as a char* string)
and the callback mechanism will make that into UChars to put into the
target string. You can return a long and it'll be treated as a single
Unicode codepoint (One UChar for BMP, 2 for higher planes). You can
also return an array of either of these types to specify a string in a
readable, but unicode friendly format, e.g. array("Espa", 0x00F1 /*
LATIN SMALL LETTER N WITH TILDE */, "ol") would be equivalent to
"Espa\xC3\xB1ol".

The same is true for fromUCallback() with the exception that the
values being returned are assumed to be in the target encoding. For
longs this means a single byte unsigned char which is appended to the
target as-is. Similarly strings are appended as-is.

As for input parameters: for toUCallback, $source and $codeUnits are
still in their original encoding and presented as-is for that
encoding. For fromUCallback(), the $source/$codePoint are in Unicode
(UChar/UTF16 internall) and can't be directly offered to PHP without
running into endianness issues. So the codepoint is provided as a
single UChar32 (avoiding the surrogate problem in the process), and
source is given as a series of UChar32 codepoints in a numerically
indexed array.

I'll add a section about callback input/return types to clarify this.

Link to ICU API from the RFC would be helpful for reviewers and later
docs, I think.

Added!

12 years ago by Sara Golemon — view source

unread

With the exception of renaming the UConverter::UCNV_* constants to
remove the UCNV_ prefix, I believe I've addressed the concerns thus
far. ((Waiting to hear if anyone else wants to weigh in on the
contants)) The RFC has been updated accordingly.+

12 years ago by Ilia Alshanetsky — view source

unread

After the updates it looks really good, very handy functionality to have.

With the exception of renaming the UConverter::UCNV_* constants to
remove the UCNV_ prefix, I believe I've addressed the concerns thus
far. ((Waiting to hear if anyone else wants to weigh in on the
contants)) The RFC has been updated accordingly.+

12 years ago by Adam Harvey — view source

unread

With the exception of renaming the UConverter::UCNV_* constants to
remove the UCNV_ prefix, I believe I've addressed the concerns thus
far. ((Waiting to hear if anyone else wants to weigh in on the
contants)) The RFC has been updated accordingly.+

I would say that unprefixing makes sense in general, particularly
since that's what happens in other intl classes (such as Collator).
Prefixing the callback reason constants with REASON_* seems like a
good compromise there — as you said upthread, they are different to
the other constants defined on UConverter. Beyond that, I think the
type constants would do fine as direct UConverter constants
(UConverter::UTF8, for instance, makes sense to me — in general,
you're using a converter because you want to deal with encoding
types).

I'm +1 on the functionality in general. The thought of another
encoding conversion API in PHP doesn't fill me with great joy given we
already have mbstring and iconv, but it does provide features neither
of those libraries provide: combined with the existing intl
functionality and the ever-increasing need for internationalisation
support, I'd like to think we might look at nudging intl towards being
the usual way of providing that functionality in PHP.

Thanks,

Adam, who is apparently in a run-on sentence kind of mood today.

12 years ago by Sara Golemon — view source

unread

Okay, I can be convinced. :)

Diff and RFC updated.

And I agree on the slight sense of unease at adding a third encoding
converter, but ICU is so far ahead of the other two that I really feel
we should encourage its use over iconv and mbstring.

-Sara

With the exception of renaming the UConverter::UCNV_* constants to
remove the UCNV_ prefix, I believe I've addressed the concerns thus
far. ((Waiting to hear if anyone else wants to weigh in on the
contants)) The RFC has been updated accordingly.+

I would say that unprefixing makes sense in general, particularly
since that's what happens in other intl classes (such as Collator).
Prefixing the callback reason constants with REASON_* seems like a
good compromise there — as you said upthread, they are different to
the other constants defined on UConverter. Beyond that, I think the
type constants would do fine as direct UConverter constants
(UConverter::UTF8, for instance, makes sense to me — in general,
you're using a converter because you want to deal with encoding
types).

I'm +1 on the functionality in general. The thought of another
encoding conversion API in PHP doesn't fill me with great joy given we
already have mbstring and iconv, but it does provide features neither
of those libraries provide: combined with the existing intl
functionality and the ever-increasing need for internationalisation
support, I'd like to think we might look at nudging intl towards being
the usual way of providing that functionality in PHP.

Thanks,

Adam, who is apparently in a run-on sentence kind of mood today.

12 years ago by Sara Golemon — view source

unread

The requisite one week minimum has passed and there have been no
further comments since my last update (which addressed all comments
thereto) so I'm opening the voting process on
https://wiki.php.net/rfc/uconverter

Cheers,
-Sara

12 years ago by Pierre Joye — view source

unread

Just for the record, it is one week minimum, not maximum :)

If you think that the implementation details are solved and ready to be
voted on, then heh, go ahead :)

The requisite one week minimum has passed and there have been no
further comments since my last update (which addressed all comments
thereto) so I'm opening the voting process on
https://wiki.php.net/rfc/uconverter

Cheers,
-Sara

--

--
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

12 years ago by Sara Golemon — view source

unread

There doesn't seem to be any further discussion. I figured if someone
has more objections during the week this stays open for voting they
can be addressed, and if need be the voting can be reset and resumed.

-Sara

Just for the record, it is one week minimum, not maximum :)

If you think that the implementation details are solved and ready to be
voted on, then heh, go ahead :)

The requisite one week minimum has passed and there have been no
further comments since my last update (which addressed all comments
thereto) so I'm opening the voting process on
https://wiki.php.net/rfc/uconverter

Cheers,
-Sara

--

--
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

12 years ago by Lester Caine — view source

unread

Sara Golemon wrote:

There doesn't seem to be any further discussion. I figured if someone
has more objections during the week this stays open for voting they
can be addressed, and if need be the voting can be reset and resumed.

This may already be covered elsewhere, but is there a return from the class that
indicates easily the number of codepoint errors?
It would be helpful if the likes of the error handling single line linked
through to the relevant material? I've got back to the intl documentation, but
it's not clear there what is returned if a codepoint error is detected.

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

12 years ago by Sara Golemon — view source

unread

As far as what's returned by convert() in the case of codepoint
errors, I thought that was explained in the usage sections, but I'll
put some content into the error handling section as well (and link to
the existing intl functions related to error surfacing,

As to determining how many invalid codepoints were encountered, no.
There's not an API to surface that at ICU's level. I could add a
counter to the callback and always enable it, even when not
subclassing, but that's adding complexity where it in all likelihood
won't be used for something which can be done in userspace, so I'm
inclined against it.

-Sara

Sara Golemon wrote:

There doesn't seem to be any further discussion. I figured if someone
has more objections during the week this stays open for voting they
can be addressed, and if need be the voting can be reset and resumed.

This may already be covered elsewhere, but is there a return from the class
that indicates easily the number of codepoint errors?
It would be helpful if the likes of the error handling single line linked
through to the relevant material? I've got back to the intl documentation,
but it's not clear there what is returned if a codepoint error is detected.

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

12 years ago by Lester Caine — view source

unread

Sara Golemon wrote:

As far as what's returned by convert() in the case of codepoint
errors, I thought that was explained in the usage sections, but I'll
put some content into the error handling section as well (and link to
the existing intl functions related to error surfacing,
The problem with the wiki is then cross linking to the rest of the
documentation. After 10 minutes I was still going around in circles ... once you
know WHERE to look finding things is easier.

As to determining how many invalid codepoints were encountered, no.
There's not an API to surface that at ICU's level. I could add a
counter to the callback and always enable it, even when not
subclassing, but that's adding complexity where it in all likelihood
won't be used for something which can be done in userspace, so I'm
inclined against it.
I thought that was the case, but again cross checking gets difficult. I'm more
or less UFT8 based now content wise, and on the whole can just bounce UTF8 back,
but identifying the number of 'extended' characters would be useful to pick up
strings that would become unreadable. The odd character missed can be
acceptable, but whole words are a different matter? Almost looking for a
'readability' scale rather than number of errors?

-Sara

Sara Golemon wrote:

There doesn't seem to be any further discussion. I figured if someone
has more objections during the week this stays open for voting they
can be addressed, and if need be the voting can be reset and resumed.

This may already be covered elsewhere, but is there a return from the class
that indicates easily the number of codepoint errors?
It would be helpful if the likes of the error handling single line linked
through to the relevant material? I've got back to the intl documentation,
but it's not clear there what is returned if a codepoint error is detected.

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

12 years ago by Sara Golemon — view source

unread

It's been a few weeks since voting opened and it's unanimous so far.
If no one objects I'll plan on pushing into git on Friday.

The requisite one week minimum has passed and there have been no
further comments since my last update (which addressed all comments
thereto) so I'm opening the voting process on
https://wiki.php.net/rfc/uconverter

Cheers,
-Sara

[RFC] ICU UConverter implementation for ext/intl

Cheers,

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL