[RFC] IntlCharsetDetector

9 years ago by Stanislav Malyshev — view source

unread

Hi!

With a light push from Stas, I've decided to go ahead and put up
IntlCharsetDetector for discussion.
https://wiki.php.net/rfc/intl.charset-detector

I'm still not personally convinced this API is trustworthy enough, but
it's worth a formal discussion period at least.

The API looks a bit strange - new IntlCharsetDetector($text) and then
detect(). Can't we just have detect([$text])?

--
Stas Malyshev
smalyshev@gmail.com

9 years ago by Sara Golemon — view source

unread

The API looks a bit strange - new IntlCharsetDetector($text) and then
detect(). Can't we just have detect([$text])?

I went with a direct wrapping of the underlying API because it always
feels like we regret adding magic eventually. It's trivial for some
composer installable library to wrap this into something nicer. In
fact, one probably WOULD use a userspace library in order to provide
fallback to mb_detect_encoding() on PHP < 7.1

That said, how do you feel about compromising by adding this function
in addition to the raw API?

function ucsdet_detect_encoding(string $text, string $hint = null,
bool $filter = false) {
$det = new IntlCharsetDetector($text);
if ($hint !== null) {
$det->setDeclaredEncoding($hint);
}
$det->enableInputFiltering($filter);
return $det->detect()['name'];
}

That'll give simplicity for the 80% case (I have a string, I want a
best guess) but still provide the true API for consideration of
confidence and/or other guesses.

-Sara

9 years ago by Stanislav Malyshev — view source

unread

Hi!

That said, how do you feel about compromising by adding this function
in addition to the raw API?

function ucsdet_detect_encoding(string $text, string $hint = null,
bool $filter = false) {
$det = new IntlCharsetDetector($text);
if ($hint !== null) {
$det->setDeclaredEncoding($hint);
}
$det->enableInputFiltering($filter);
return $det->detect()['name'];
}

That works too, but object one has its uses - e.g. preparing one
detector and detecting multiple texts, like library, etc. Of course, one
can do setText, but it's extra hop that doesn't seem to add anything.

I'd take both :)

Stas Malyshev
smalyshev@gmail.com

9 years ago by Fleshgrinder — view source

unread

The API looks a bit strange - new IntlCharsetDetector($text) and then
detect(). Can't we just have detect([$text])?

I went with a direct wrapping of the underlying API because it always
feels like we regret adding magic eventually. It's trivial for some
composer installable library to wrap this into something nicer. In
fact, one probably WOULD use a userspace library in order to provide
fallback to mb_detect_encoding() on PHP < 7.1

That said, how do you feel about compromising by adding this function
in addition to the raw API?

function ucsdet_detect_encoding(string $text, string $hint = null,
bool $filter = false) {
$det = new IntlCharsetDetector($text);
if ($hint !== null) {
$det->setDeclaredEncoding($hint);
}
$det->enableInputFiltering($filter);
return $det->detect()['name'];
}

That'll give simplicity for the 80% case (I have a string, I want a
best guess) but still provide the true API for consideration of
confidence and/or other guesses.

-Sara

I agree with Stanislav here. I think that the HHVM implementation is
much better because it enables dependency injection and does not fall
back to arbitrary native arrays. However, the detect method should take
an optional text argument directly, as Stanislav proposed. I understand
the argument that extending native interfaces sometimes led to problems
but simply copying stuff without thinking of better ways is also not
really solving all problems.

In our native language:

class IntlCharsetDetector {

    fn __construct(string $text = null);

    fn setText(string $text): void;

    fn setDeclaredEncoding(string $encoding): void;

    // I incorporated your proposal here because I think it is
    // extremely useful.
    fn detect(
        string $text = null,
        string $hint = null,
        bool $filter = false
    ): IntlEncodingMatch;

    fn detectAll(string $text = null): array<IntlEncodingMatch>;

    fn inputFilterEnabled(): bool;

    fn enableInputFilter(): void;

    fn disableInputFilter(): void;

    fn getDetectableCharsets(): array<string>;

    fn enableDetectableCharset(string $charset): void;

    fn disableDetectableCharset(string $charset): void;

    static fn getAllDetectableCharsets(): array<string>;

}

NOTE how I changed the methods that take bool arguments int he Java
implementation to direct calls. This removes the necessity to validate
the arguments of the caller and results in a very clean API that does
not require extensive documentation to be easily comprehensible.

class IntlEncodingMatch {

    fn isValid(): bool;

    fn getEncoding(): string;

    fn getConfidence(): int;

    fn getLanguage(): string;

    fn getUtf8(): string;

}

NOTE that I also use getUtf8 as does HHVM and not simply getString
as it is used in the Java implementation because again it reduces the
amount of documentation reading because it is clear what this method is
supposed to do (return). I did not write getUTF8 though because I
think it violates the /camelCase/ rule but that is not really clarified
in the PHP Coding Standards and thus hard to argue about.

Having the IntlEncodingMatch object as a result has the advantage that
one does not need to look up the documentation every time they want to
handle the result, no creation of the strings to access the offsets at
runtime, no error prone mistyping issues plus the added value of type
hinting against it.

That being said, the procedural functions could still return native
arrays. I do not see a reason why the procedural versions have to be 1:1
mapping to the OO versions.

--
Richard "Fleshgrinder" Fussenegger

9 years ago by Derick Rethans — view source

unread

With a light push from Stas, I've decided to go ahead and put up
IntlCharsetDetector for discussion.
https://wiki.php.net/rfc/intl.charset-detector

I'm still not personally convinced this API is trustworthy enough, but
it's worth a formal discussion period at least.

The RFC itself looks fine. I still believe that it's
too easy to just rely on the charset that it returns. I know there is a
confidence factor, but you know how lazy we are :). Unless it's
absolutely clear that it is merely a very badly educated guess, I think
I shall be voting no.

cheers,
Derick

9 years ago by Yasuo Ohgaki — view source

unread

Hi Sara,

With a light push from Stas, I've decided to go ahead and put up
IntlCharsetDetector for discussion.
https://wiki.php.net/rfc/intl.charset-detector

I'm still not personally convinced this API is trustworthy enough, but
it's worth a formal discussion period at least.

Things might have been changed, but as you've mentioned encoding
detection is unstable and ICU is poor compared to mbstring's detection
at least for Japanese encodings.

Developers should not rely on encoding detector, but they should validate
encoding.

Problem is there are cases that developers cannot determine used encoding...
If we are going to have this API, it would be better to validate string with
detected encoding by default and disable encoding validation optionally.
There are cases that developers have to deal with broken string data
on occasion.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

9 years ago by Sara Golemon — view source

unread

Things might have been changed, but as you've mentioned encoding
detection is unstable and ICU is poor compared to mbstring's detection
at least for Japanese encodings.

For me, the difference is that I expect further work to be done on
improving ICU, while I lack that confidence for mbstring. If the API
is in place early on, the library can improve underneath it to the
point it becomes more trustworthy later, but still be usable on older
versions of PHP (linked against newer libicu).

Maybe, I dunno. I lack the motivation to push this feature forward
atm, merely because it's not trust-worthy now.

Developers should not rely on encoding detector, but they should validate
encoding.

I think everyone agrees on that. :)

Problem is there are cases that developers cannot determine used encoding...
If we are going to have this API, it would be better to validate string with
detected encoding by default and disable encoding validation optionally.
There are cases that developers have to deal with broken string data
on occasion.

What do you have in mind? Full-on pre-request input filtering?
'cause that's never worked right (we tried really hard to make PHP6 do
that and it failed badly)

Or do you mean something like wrapping the ucsdet API in a coercer
function that only returned the original string if it detected at high
confidence and then validated against that detection? 'cause
honestly, that should also be left to the application IMO.

-Sara

9 years ago by Stanislav Malyshev — view source

unread

Hi!

For me, the difference is that I expect further work to be done on
improving ICU, while I lack that confidence for mbstring. If the API

My experience over the years has been that established supported
libraries like ICU usually have better track record in improving and
maintenance than more niche libraries, but it differs a lot from case to
case. I have no idea though how good/bad is ICU in detecting Asian
languages and encodings.

Developers should not rely on encoding detector, but they should validate
encoding.

I think everyone agrees on that. :)

True, but also incomplete. There's ideal case, and there's real world.
In ideal case, you know encodings of everything and everything is nicely
specified and shiny and rainbows and unicorns abound. Real data, though,
is messy and unpredictable and comes from places and practices that
makes one shudder. And when it comes to that we can either give the
developers at least something - an imperfect encoding detector, with all
caveats - or just ignore it and not give them anything, because it is
not matching our theories. and leave them to implement even worse hacks.
I think the former is much better approach.

And of course, detection and validation is a different thing. A text may
look like valid string in encoding A but actually be encoding B. "Tell
me if this data looks like Russian text in KOI-8 or Japanese text in
Shift-JIS" and "tell me if this is a valid or invalid UTF-8" are two
completely different tasks.

Stas Malyshev
smalyshev@gmail.com

9 years ago by Yasuo Ohgaki — view source

unread

Hi Sara,

Things might have been changed, but as you've mentioned encoding
detection is unstable and ICU is poor compared to mbstring's detection
at least for Japanese encodings.

For me, the difference is that I expect further work to be done on
improving ICU, while I lack that confidence for mbstring. If the API
is in place early on, the library can improve underneath it to the
point it becomes more trustworthy later, but still be usable on older
versions of PHP (linked against newer libicu).

Maybe, I dunno. I lack the motivation to push this feature forward
atm, merely because it's not trust-worthy now.

I can understand this.

Developers should not rely on encoding detector, but they should validate
encoding.

I think everyone agrees on that. :)

Problem is there are cases that developers cannot determine used encoding...
If we are going to have this API, it would be better to validate string with
detected encoding by default and disable encoding validation optionally.
There are cases that developers have to deal with broken string data
on occasion.

What do you have in mind? Full-on pre-request input filtering?
'cause that's never worked right (we tried really hard to make PHP6 do
that and it failed badly)

I'm not.

Or do you mean something like wrapping the ucsdet API in a coercer
function that only returned the original string if it detected at high
confidence and then validated against that detection? 'cause
honestly, that should also be left to the application IMO.

I don't have problem with this approach. Developers must be responsible
for this.

For normal web apps, developers must validate encoding if it is expected
one or not. Developers do not have guess encoding for most cases.

Developers may need to detect encoding for uploaded text files, for example.
If they use encoding detection, then they should validate text data by
detected encoding.

Experienced developers will detect encoding then validate text data with
detected encoding before saving uploaded text files. However, many developers
will detect encoding and assume text file char encoding is valid. This is the
reason why I suggest

detect encoding (This is done by using only the beginning of data usually)
then validate the text data by detected encoding
if validation is OK return encoding name, otherwise return error.

This would reduce chance of storing invalid text data in system.
It's not strictly required, but I think it is more developer friendly.
It's just a suggestion.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

9 years ago by Tom Worster — view source

unread

Things might have been changed, but as you've mentioned encoding
detection is unstable and ICU is poor compared to mbstring's detection
at least for Japanese encodings.

For me, the difference is that I expect further work to be done on
improving ICU,

Why do you expect that?

When I researched this problem some years ago I had the impression a
number of attempted solutions had been published and abandoned. I took
this to mean that there was a learning experience that ended with the
understanding that it's insoluble.

That's why I'm curious if you know of ongoing efforts in ICU. I took a
look and saw little activity in the last 10 years.

while I lack that confidence for mbstring. If the API
is in place early on, the library can improve underneath it to the
point it becomes more trustworthy later, but still be usable on older
versions of PHP (linked against newer libicu).

How would it becomes more trustworthy? A way to make it trustworthy
would need to exist. And somebody would have to work on it.

Tom