Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92090 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 31580 invoked from network); 5 Apr 2016 09:05:56 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 5 Apr 2016 09:05:56 -0000 Authentication-Results: pb1.pair.com smtp.mail=derick@php.net; spf=unknown; sender-id=unknown Authentication-Results: pb1.pair.com header.from=derick@php.net; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 82.113.146.227 as permitted sender) X-PHP-List-Original-Sender: derick@php.net X-Host-Fingerprint: 82.113.146.227 xdebug.org Received: from [82.113.146.227] ([82.113.146.227:52654] helo=xdebug.org) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 07/B0-27948-3FF73075 for ; Tue, 05 Apr 2016 05:05:55 -0400 Received: from localhost (localhost [IPv6:::1]) by xdebug.org (Postfix) with ESMTPS id 3532C10C010; Tue, 5 Apr 2016 10:05:53 +0100 (BST) Date: Tue, 5 Apr 2016 10:05:53 +0100 (BST) X-X-Sender: derick@whisky.home.derickrethans.nl To: Sara Golemon cc: PHP internals In-Reply-To: Message-ID: References: User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Subject: Re: [PHP-DEV] IntlCharsetDetector From: derick@php.net (Derick Rethans) On Mon, 4 Apr 2016, Sara Golemon wrote: > The subject of character set detection (yes, I know, a hard problem to > solve) came up on SO chat, and Niki noticed that we don't yet wrap the > ICU UCharsetDetector API so I volunteered to put something together. > > https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector > > The trouble is, for the WIDE majority of my test cases so far, ICU is > really bad at detecting character sets correctly (as I said, it's a > tough problem). In fact, the ICU manual admits that it doesn't even > look at all of the corpus text, and the "language detection" is a > byproduct not meant for actual language detection. > > Given all that, I'm inclined to reject the idea of rolling this into > PHP for fear of just confusing users without actually adding any > value. > > Thoughts? I would advice against adding this. As you say, it doesn't work properly. As a matter of fact, guessing charsets, like timezones, is not possible. You need to know which charset something is in. If not, you need to address *that* problem. cheers, Derick