Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92086 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 7946 invoked from network); 5 Apr 2016 01:03:39 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 5 Apr 2016 01:03:39 -0000 Authentication-Results: pb1.pair.com header.from=php@golemon.com; sender-id=softfail Authentication-Results: pb1.pair.com smtp.mail=php@golemon.com; spf=softfail; sender-id=softfail Received-SPF: softfail (pb1.pair.com: domain golemon.com does not designate 209.85.215.43 as permitted sender) X-PHP-List-Original-Sender: php@golemon.com X-Host-Fingerprint: 209.85.215.43 mail-lf0-f43.google.com Received: from [209.85.215.43] ([209.85.215.43:33229] helo=mail-lf0-f43.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 04/00-07379-AEE03075 for ; Mon, 04 Apr 2016 21:03:39 -0400 Received: by mail-lf0-f43.google.com with SMTP id p188so167526696lfd.0 for ; Mon, 04 Apr 2016 18:03:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=golemon-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:date:message-id:subject:from:to; bh=/4gByqcw2xAbf9IskZamtMVIaY5BYPbFHB4YLQAs0KE=; b=PpVDtfzTFzZDmeIhJHn5Hr7bXLY/qzm9htlihJKEWAV70un5HLUMfFX5KG7ef3Ny+z dTW+tQYvQv7VRtBT6HP/ArD21uWBOBtY6s1w6uGf/J8IH4cVYNDKPYW38sdFFWqqy/5p w3Cf/0xMKIa/H079peDEyzVFnGBqesOerE/ob4fTsVuWjqtSBTaN+vB4E/q/SCotXbB1 LADINLPMSKQZsM460XGZT+oMT1p1dFoZ92y/IpN05UAcrpPgNBro+v8zM7pyVcpRvUL6 iKdolUVV4C2jo4J731+mLu6v+KouVHaDyokYudsVwTCchf3HHHGfAtlN5eUOL3YCYg7c qy2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:date:message-id:subject:from :to; bh=/4gByqcw2xAbf9IskZamtMVIaY5BYPbFHB4YLQAs0KE=; b=Jb9Ez0EcaKSNqA9JTO8iZZcuy5ujywoupV5gIL9JVnA6Mm32mYMLNyZblgtEPmD/R0 wiEBIkGNm1xGUDvS4YuGTbYFesGpeaOSar2/M1nz1NIS/eFyqC+r9JxezTTFJOW9dMMu TcDWeXNV1oCL8faMX6C7RgUKX8GaWh22Puog8Q6UPkb2HvefZKAtqKA7X5FggXybYDSU 3hWrqB1f3TRp6JER+0NKl8CAxFakhdWjWxhiVPY9RcUhe4JHpxXPwniVjX/f96s+HlTs 6NQlzC8XC8Y0iT6UY5d16jyZ3/JBZXyEyz8O8vq4YiJ5WEz3Xq6AlGlR+VjxE5AsrkCd hQwg== X-Gm-Message-State: AD7BkJKNewOsvf0Pv7aGR99F3zjBSTji8PB/4O1uJSORQzUh8ldYbuHGCgC7gtWn7jSrIxbP8VsSq7i7g1P2tQ== MIME-Version: 1.0 X-Received: by 10.25.218.196 with SMTP id r187mr9970866lfg.6.1459818216199; Mon, 04 Apr 2016 18:03:36 -0700 (PDT) Sender: php@golemon.com Received: by 10.112.18.75 with HTTP; Mon, 4 Apr 2016 18:03:36 -0700 (PDT) X-Originating-IP: [107.198.91.68] Date: Mon, 4 Apr 2016 18:03:36 -0700 X-Google-Sender-Auth: yR3m4RIJcQ3uWGSsmPFOFTPySdQ Message-ID: To: PHP internals Content-Type: text/plain; charset=UTF-8 Subject: IntlCharsetDetector From: pollita@php.net (Sara Golemon) The subject of character set detection (yes, I know, a hard problem to solve) came up on SO chat, and Niki noticed that we don't yet wrap the ICU UCharsetDetector API so I volunteered to put something together. https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector The trouble is, for the WIDE majority of my test cases so far, ICU is really bad at detecting character sets correctly (as I said, it's a tough problem). In fact, the ICU manual admits that it doesn't even look at all of the corpus text, and the "language detection" is a byproduct not meant for actual language detection. Given all that, I'm inclined to reject the idea of rolling this into PHP for fear of just confusing users without actually adding any value. Thoughts? -Sara