Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92194 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 7612 invoked from network); 11 Apr 2016 14:57:51 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Apr 2016 14:57:51 -0000 Authentication-Results: pb1.pair.com header.from=bishop.bettini@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=bishop.bettini@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.218.43 as permitted sender) X-PHP-List-Original-Sender: bishop.bettini@gmail.com X-Host-Fingerprint: 209.85.218.43 mail-oi0-f43.google.com Received: from [209.85.218.43] ([209.85.218.43:34763] helo=mail-oi0-f43.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 91/00-07428-D6BBB075 for ; Mon, 11 Apr 2016 10:57:49 -0400 Received: by mail-oi0-f43.google.com with SMTP id s79so210578009oie.1 for ; Mon, 11 Apr 2016 07:57:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:in-reply-to:references:from:date :message-id:subject:to:cc; bh=8TaK3/OKhaH41oah0gtnDCgJAvwo2rSOtc62Z27wY+0=; b=umwEQTVyUw1Xl5qKkgW6Ud2tFfQAG2OBfstFMgtp/RCrf3bFOVqvFUxC48dTkoOFu8 0UE0Nu563P88/9+hG3cEk/wuRw+0lS3i8WyCfvw41nyy2RipIz6lw60r79ToXazs4eKM tB/3JqASjjVXzZWVGN+E5HkaxG/SyXA8bkSfT/0rOCrZiYgTB1DslYOE1MFvVUD/wbOU vVwHDeInm3WW5DlsNK3zX71LXFEC67LVCaYcJ4xGmz8HTCx81RibRcOzyBBHqhAGChK2 T+shP3CULnng4ha0ck7BeV9Z0T/SDzSiNui0ADLFMwiWp8Iy+Js2ETNbW5+D37FCfPza EQKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:reply-to:sender:in-reply-to :references:from:date:message-id:subject:to:cc; bh=8TaK3/OKhaH41oah0gtnDCgJAvwo2rSOtc62Z27wY+0=; b=G7xM8dD+eWsVXQ/q+1K0vJq0cKnGb7tgYp1btzMNIZ+eKroiFjPbVTM/Yrn6YqE8Re JgpAldKVQg7MwyWGuCzko7J7yWrpxP09DcKZzFDoi7Ff/6lj3ZXEFYnsF/oAQy4kVzBs TxBX6bI0XxVZTnRS6weVIBeUS/jQCy8jF0CfLlujnh28sgKqApbPFAUYGhkf5ZLVilI6 Sl7vRGl/RPbW/shaAyrX9ZFrBV44zr0rBiC9BfSTzVvm9vjcRy6kcX3MNSc8Iumv83BV /MFMGnr1NVSIf3xsXRERsQhzN6Jda//vXGr8mvwwh2FokeMira/U3lkEKMt66kBa9f5V cTjw== X-Gm-Message-State: AD7BkJJrbnsUEUAJXoouQC2ZUrxzLwIDVDLBgyqR76x8LIAda2n/nTM7kEj3YzAqeNECNTNczTqzAvWu1sRKwg== X-Received: by 10.157.26.88 with SMTP id u24mr9512180otu.11.1460386301995; Mon, 11 Apr 2016 07:51:41 -0700 (PDT) MIME-Version: 1.0 Reply-To: bishop@php.net Sender: bishop.bettini@gmail.com Received: by 10.157.3.164 with HTTP; Mon, 11 Apr 2016 07:51:12 -0700 (PDT) In-Reply-To: References: <57050CAB.1040302@php.net> Date: Mon, 11 Apr 2016 10:51:12 -0400 X-Google-Sender-Auth: R-e1H5R_pSt3XIwXRdGMkDfpzeU Message-ID: To: Sara Golemon Cc: Sebastian Bergmann , PHP internals Content-Type: multipart/alternative; boundary=001a114630ecc3eff2053036aed2 Subject: Re: [PHP-DEV] IntlCharsetDetector From: bishop@php.net (Bishop Bettini) --001a114630ecc3eff2053036aed2 Content-Type: text/plain; charset=UTF-8 On Fri, Apr 8, 2016 at 2:20 PM, Sara Golemon wrote: > On Thu, Apr 7, 2016 at 9:36 AM, Bishop Bettini wrote: > > The problem is, developers are going to write code to guess character > sets. > > > True. But they're going to put more faith in something in the > standard distribution, assuming it's passed muster. > > > Ironically, PHPUnit attempts to detect UTF-8 > > > Awwwwwwwwkward.... > > > I'd rather we include the patch for a few reasons: > > > > 1. so that there's a modern "standard" method of doing so, and that > > "standard" method has plenty of documentation that points people to the > > limitations. > > > In that spirit, how about we put in some stub documentation under the > intl extension with a paragraph or two on why UCharsetDetector *isn't* > wrapped, and why it's such a bad idea to try to solve the problem from > this end. > > > 2. to completely expose the underlying ICU, rather than arbitrarily > > deciding one part isn't good for developers to use. > > > Is it arbitrary though? The fact that coming up with test cases which > produce reasonable/expected results is half crap-shoot makes this an > evidence based decision, not a capricious one. > > > 3. to provide an alternative to mb_detect_encoding. > > > And again in that spirit, I think this is a good argument for going > E_DEPRECATED on mb_detect_encoding(). The entire conversation which > led to prototyping an IntlCharsetDetector extension came from the fact > that mb_detect_encoding() wasn't doing its job well. Rather than have > two supported, bad solutions, I think it'd be better to have one > deprecated (and thus unsupported) bad solution (which is only kept for > BC). > > > While I can't say if this will or won't cause more user confusion, I do > > believe this adds value: ICU provides a confidence metric, which no other > > in-built or buildable solution (to my knowledge) provides. > > > The confidence metric is useful, but my spidey sense tells me that > it'll simply be ignored. > > How about a compromise. I'll reorder this patch to be a standalone > extension and we PECLize it. If someone REALLY wants to throw caution > to the wind, they can, but they're on their own when it gives them > fugly results. What about forcing the consumer to stipulate minimal acceptable confidence? The API would internally filter any matches with confidence strictly lower than the given value. Along the lines of: ucsdet_detect(IntlCharsetDetector $det, int $minimum_confidence): array ucsdet_detect_all(IntlCharsetDetector $det, int $minimum_confidence): array So the relatively reliable UTF-8 test could be written: if ('UTF-8' === $detector->detect(100)) { // ... } This exposes the heuristics available in ICU and leaves the API flexible, while forcing the consumer to consider the fact that this is statistical reasoning, not decision. --001a114630ecc3eff2053036aed2--