Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:92194
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.218.43 as permitted sender)
MIME-Version: 1.0
Reply-To: bishop@php.net
Sender: bishop.bettini@gmail.com
In-Reply-To: <CAESVnVrguq6k-9Tc7L7W=8RxFCg1QAEhDdxGqxF2M-i0FLjsLQ@mail.gmail.com>
References: <CAESVnVouJP7Cad-AgoT4_nGSDXE6sejE9tGY5EOJitaH7JKS8A@mail.gmail.com>
 <alpine.DEB.2.20.1604051004480.4094@whisky.home.derickrethans.nl>
 <57050CAB.1040302@php.net> <CAEYWF=6ACZKxJ5=T98d7q+7pBT-FjEevjDdP3OsKKZOLFLypGQ@mail.gmail.com>
 <CAESVnVrguq6k-9Tc7L7W=8RxFCg1QAEhDdxGqxF2M-i0FLjsLQ@mail.gmail.com>
Date: Mon, 11 Apr 2016 10:51:12 -0400
Message-ID: <CAEYWF=6gyh1z6_UswHSW5GKyjPRDwGcDVeitO6suNkcpcmB7VA@mail.gmail.com>
To: Sara Golemon <pollita@php.net>
Cc: Sebastian Bergmann <sebastian@php.net>, PHP internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary=001a114630ecc3eff2053036aed2
Subject: Re: [PHP-DEV] IntlCharsetDetector
From: bishop@php.net (Bishop Bettini)

--001a114630ecc3eff2053036aed2
Content-Type: text/plain; charset=UTF-8

On Fri, Apr 8, 2016 at 2:20 PM, Sara Golemon <pollita@php.net> wrote:

> On Thu, Apr 7, 2016 at 9:36 AM, Bishop Bettini <bishop@php.net> wrote:
> > The problem is, developers are going to write code to guess character
> sets.
> >
> True.  But they're going to put more faith in something in the
> standard distribution, assuming it's passed muster.
>
> > Ironically, PHPUnit attempts to detect UTF-8
> >
> Awwwwwwwwkward....
>
> > I'd rather we include the patch for a few reasons:
> >
> > 1. so that there's a modern "standard" method of doing so, and that
> > "standard" method has plenty of documentation that points people to the
> > limitations.
> >
> In that spirit, how about we put in some stub documentation under the
> intl extension with a paragraph or two on why UCharsetDetector *isn't*
> wrapped, and why it's such a bad idea to try to solve the problem from
> this end.
>
> > 2. to completely expose the underlying ICU, rather than arbitrarily
> > deciding one part isn't good for developers to use.
> >
> Is it arbitrary though?  The fact that coming up with test cases which
> produce reasonable/expected results is half crap-shoot makes this an
> evidence based decision, not a capricious one.
>
> > 3. to provide an alternative to mb_detect_encoding.
> >
> And again in that spirit, I think this is a good argument for going
> E_DEPRECATED on mb_detect_encoding().  The entire conversation which
> led to prototyping an IntlCharsetDetector extension came from the fact
> that mb_detect_encoding() wasn't doing its job well.  Rather than have
> two supported, bad solutions, I think it'd be better to have one
> deprecated (and thus unsupported) bad solution (which is only kept for
> BC).
>
> > While I can't say if this will or won't cause more user confusion, I do
> > believe this adds value: ICU provides a confidence metric, which no other
> > in-built or buildable solution (to my knowledge) provides.
> >
> The confidence metric is useful, but my spidey sense tells me that
> it'll simply be ignored.
>
> How about a compromise.  I'll reorder this patch to be a standalone
> extension and we PECLize it.  If someone REALLY wants to throw caution
> to the wind, they can, but they're on their own when it gives them
> fugly results.


What about forcing the consumer to stipulate minimal acceptable confidence?
The API would internally filter any matches with confidence strictly lower
than the given value. Along the lines of:

ucsdet_detect(IntlCharsetDetector $det, int $minimum_confidence): array
ucsdet_detect_all(IntlCharsetDetector $det, int $minimum_confidence): array

So the relatively reliable UTF-8 test
<https://tools.ietf.org/html/rfc3629#section-4> could be written:

if ('UTF-8' === $detector->detect(100)) {
    // ...
}

This exposes the heuristics available in ICU and leaves the API flexible,
while forcing the consumer to consider the fact that this is statistical
reasoning, not decision.

--001a114630ecc3eff2053036aed2--