IntlCharsetDetector

9 years ago by Sara Golemon — view source

unread

The subject of character set detection (yes, I know, a hard problem to
solve) came up on SO chat, and Niki noticed that we don't yet wrap the
ICU UCharsetDetector API so I volunteered to put something together.

https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector

The trouble is, for the WIDE majority of my test cases so far, ICU is
really bad at detecting character sets correctly (as I said, it's a
tough problem). In fact, the ICU manual admits that it doesn't even
look at all of the corpus text, and the "language detection" is a
byproduct not meant for actual language detection.

Given all that, I'm inclined to reject the idea of rolling this into
PHP for fear of just confusing users without actually adding any
value.

Thoughts?

-Sara

9 years ago by Derick Rethans — view source

unread

The subject of character set detection (yes, I know, a hard problem to
solve) came up on SO chat, and Niki noticed that we don't yet wrap the
ICU UCharsetDetector API so I volunteered to put something together.

https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector

The trouble is, for the WIDE majority of my test cases so far, ICU is
really bad at detecting character sets correctly (as I said, it's a
tough problem). In fact, the ICU manual admits that it doesn't even
look at all of the corpus text, and the "language detection" is a
byproduct not meant for actual language detection.

Given all that, I'm inclined to reject the idea of rolling this into
PHP for fear of just confusing users without actually adding any
value.

Thoughts?

I would advice against adding this.

As you say, it doesn't work properly. As a matter of fact, guessing
charsets, like timezones, is not possible. You need to know which
charset something is in. If not, you need to address that problem.

cheers,
Derick

9 years ago by Sebastian Bergmann — view source

unread

Am 05.04.2016 um 11:05 schrieb Derick Rethans:

I would advice against adding this.

As you say, it doesn't work properly. As a matter of fact, guessing
charsets, like timezones, is not possible. You need to know which
charset something is in. If not, you need to address that problem.

Agreed.

9 years ago by Bishop Bettini — view source

unread

On Wed, Apr 6, 2016 at 9:18 AM, Sebastian Bergmann sebastian@php.net
wrote:

Am 05.04.2016 um 11:05 schrieb Derick Rethans:

I would advice against adding this.

As you say, it doesn't work properly. As a matter of fact, guessing
charsets, like timezones, is not possible. You need to know which
charset something is in. If not, you need to address that problem.

Agreed.

The problem is, developers are going to write code to guess character sets.

Ironically, PHPUnit attempts to detect UTF-8
https://github.com/sebastianbergmann/phpunit/blob/master/src/Util/String.php#L38-L70.
There is also no shortage of SO posts explaining other approaches. My
favorite is using a preg_match trick
http://stackoverflow.com/a/4407996/2908724.

I'd rather we include the patch for a few reasons:

so that there's a modern "standard" method of doing so, and that
"standard" method has plenty of documentation that points people to the
limitations.
to completely expose the underlying ICU, rather than arbitrarily
deciding one part isn't good for developers to use.
to provide an alternative to mb_detect_encoding.

While I can't say if this will or won't cause more user confusion, I do
believe this adds value: ICU provides a confidence metric, which no other
in-built or buildable solution (to my knowledge) provides.

9 years ago by Sara Golemon — view source

unread

The problem is, developers are going to write code to guess character sets.

True. But they're going to put more faith in something in the
standard distribution, assuming it's passed muster.

Ironically, PHPUnit attempts to detect UTF-8

Awwwwwwwwkward....

I'd rather we include the patch for a few reasons:

so that there's a modern "standard" method of doing so, and that
"standard" method has plenty of documentation that points people to the
limitations.

In that spirit, how about we put in some stub documentation under the
intl extension with a paragraph or two on why UCharsetDetector isn't
wrapped, and why it's such a bad idea to try to solve the problem from
this end.

to completely expose the underlying ICU, rather than arbitrarily
deciding one part isn't good for developers to use.

Is it arbitrary though? The fact that coming up with test cases which
produce reasonable/expected results is half crap-shoot makes this an
evidence based decision, not a capricious one.

to provide an alternative to mb_detect_encoding.

And again in that spirit, I think this is a good argument for going
E_DEPRECATED on mb_detect_encoding(). The entire conversation which
led to prototyping an IntlCharsetDetector extension came from the fact
that mb_detect_encoding() wasn't doing its job well. Rather than have
two supported, bad solutions, I think it'd be better to have one
deprecated (and thus unsupported) bad solution (which is only kept for
BC).

While I can't say if this will or won't cause more user confusion, I do
believe this adds value: ICU provides a confidence metric, which no other
in-built or buildable solution (to my knowledge) provides.

The confidence metric is useful, but my spidey sense tells me that
it'll simply be ignored.

How about a compromise. I'll reorder this patch to be a standalone
extension and we PECLize it. If someone REALLY wants to throw caution
to the wind, they can, but they're on their own when it gives them
fugly results.

-Sara

9 years ago by Bishop Bettini — view source

unread

The problem is, developers are going to write code to guess character
sets.

True. But they're going to put more faith in something in the
standard distribution, assuming it's passed muster.

Ironically, PHPUnit attempts to detect UTF-8

Awwwwwwwwkward....

I'd rather we include the patch for a few reasons:

so that there's a modern "standard" method of doing so, and that
"standard" method has plenty of documentation that points people to the
limitations.

In that spirit, how about we put in some stub documentation under the
intl extension with a paragraph or two on why UCharsetDetector isn't
wrapped, and why it's such a bad idea to try to solve the problem from
this end.

to completely expose the underlying ICU, rather than arbitrarily
deciding one part isn't good for developers to use.

Is it arbitrary though? The fact that coming up with test cases which
produce reasonable/expected results is half crap-shoot makes this an
evidence based decision, not a capricious one.

to provide an alternative to mb_detect_encoding.

And again in that spirit, I think this is a good argument for going
E_DEPRECATED on mb_detect_encoding(). The entire conversation which
led to prototyping an IntlCharsetDetector extension came from the fact
that mb_detect_encoding() wasn't doing its job well. Rather than have
two supported, bad solutions, I think it'd be better to have one
deprecated (and thus unsupported) bad solution (which is only kept for
BC).

While I can't say if this will or won't cause more user confusion, I do
believe this adds value: ICU provides a confidence metric, which no other
in-built or buildable solution (to my knowledge) provides.

The confidence metric is useful, but my spidey sense tells me that
it'll simply be ignored.

How about a compromise. I'll reorder this patch to be a standalone
extension and we PECLize it. If someone REALLY wants to throw caution
to the wind, they can, but they're on their own when it gives them
fugly results.

What about forcing the consumer to stipulate minimal acceptable confidence?
The API would internally filter any matches with confidence strictly lower
than the given value. Along the lines of:

ucsdet_detect(IntlCharsetDetector $det, int $minimum_confidence): array
ucsdet_detect_all(IntlCharsetDetector $det, int $minimum_confidence): array

So the relatively reliable UTF-8 test
https://tools.ietf.org/html/rfc3629#section-4 could be written:

if ('UTF-8' === $detector->detect(100)) {
// ...
}

This exposes the heuristics available in ICU and leaves the API flexible,
while forcing the consumer to consider the fact that this is statistical
reasoning, not decision.

9 years ago by Stanislav Malyshev — view source

unread

Hi!

As you say, it doesn't work properly. As a matter of fact, guessing
charsets, like timezones, is not possible. You need to know which
charset something is in. If not, you need to address that problem.

It is true that you can not detect charsets with 100% accuracy. It is,
however, also true that many charsets can be distinguished with enough
accuracy to make it useful, especially if you know the set of charsets
you are dealing with. E.g., Russian had about 5 commonly used encodings
before everybody started to use UTF-8, and several exotic ones. Being
able to detect at least the major ones while dealing with a
heterogeneous library of Russian-language texts is a great help. There
may be other cases like this.

The point is even imperfect detection may be useful in certain
circumstances, and detector being part of ICU hints that people find it
useful enough to spend time implementing and supporting it. We should
not ignore that.

--
Stas Malyshev
smalyshev@gmail.com

9 years ago by Fleshgrinder — view source

unread

Hi!

As you say, it doesn't work properly. As a matter of fact, guessing
charsets, like timezones, is not possible. You need to know which
charset something is in. If not, you need to address that problem.

It is true that you can not detect charsets with 100% accuracy. It is,
however, also true that many charsets can be distinguished with enough
accuracy to make it useful, especially if you know the set of charsets
you are dealing with. E.g., Russian had about 5 commonly used encodings
before everybody started to use UTF-8, and several exotic ones. Being
able to detect at least the major ones while dealing with a
heterogeneous library of Russian-language texts is a great help. There
may be other cases like this.

The point is even imperfect detection may be useful in certain
circumstances, and detector being part of ICU hints that people find it
useful enough to spend time implementing and supporting it. We should
not ignore that.

I need to agree with Stanislav here completely. Sebastian Bergmann has a
quirky userland detection in its own library and I am sure there are
millions of others who have it. Providing one quirky implementation in
the core at least allows us to improve it over time and userland
improves at the same time (although I doubt that it is possible to
improve this kind of detection to a point where it really works).

What about forcing the consumer to stipulate minimal acceptable
confidence?
The API would internally filter any matches with confidence strictly lower
than the given value. Along the lines of:

ucsdet_detect(IntlCharsetDetector $det, int $minimum_confidence): array
ucsdet_detect_all(IntlCharsetDetector $det, int $minimum_confidence):
array

So the relatively reliable UTF-8 test
https://tools.ietf.org/html/rfc3629#section-4 could be written:

if ('UTF-8' === $detector->detect(100)) {
// ...
}

This exposes the heuristics available in ICU and leaves the API flexible,
while forcing the consumer to consider the fact that this is statistical
reasoning, not decision.

This is actually not such a bad idea to create awareness. At least
better than only documenting it; which probably only good devs read (and
understand).

--
Richard "Fleshgrinder" Fussenegger

9 years ago by Sara Golemon — view source

unread

The point is even imperfect detection may be useful in certain
circumstances, and detector being part of ICU hints that people find it
useful enough to spend time implementing and supporting it. We should
not ignore that.

Well, Stas, your informal thumbs up to the idea means enough to me to
at least formalize it into an RFC even though I was previously feeling
negative on it.

I may yet vote no on my own RFC after the discussion period, but as
you say it's worth considering the fact that someone thought it
reasonable enough to actually build into ICU...

-Sara

9 years ago by Tom Worster — view source

unread

The point is even imperfect detection may be useful in certain
circumstances, and detector being part of ICU hints that people find it
useful enough to spend time implementing and supporting it. We should
not ignore that.

Well, Stas, your informal thumbs up to the idea means enough to me to
at least formalize it into an RFC even though I was previously feeling
negative on it.

I may yet vote no on my own RFC after the discussion period, but as
you say it's worth considering the fact that someone thought it
reasonable enough to actually build into ICU...

The general problem is impossible. If you constrain the question, for
example as Stas says by knowing the language and choosing between a
given set of codes, then you may have success. And I'm sure I'm not
alone in sometimes using a simple heuristic to choose between cp1252 and
utf8.

But this does not logically imply that ICU CharsetDetector is a suitable
solution in such cases or that it's a good API or a decent
implementation. Or that PHP should expose it. An SO chat doesn't
necessarily count as a feature request.

I'd rather people engineered real solutions specific to their
requirements than resort to any of the failed attempts to solve the
general problem.

Tom

9 years ago by Andrea Faulds — view source

unread

Derick Rethans wrote:

As you say, it doesn't work properly. As a matter of fact, guessing
charsets, like timezones, is not possible. You need to know which
charset something is in. If not, you need to address that problem.

Indeed, 畂桳栠摩琠敨映捡獴!

--
Andrea Faulds
https://ajf.me/

P.S. Google it.