Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92201 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 22648 invoked from network); 11 Apr 2016 16:59:28 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Apr 2016 16:59:28 -0000 Authentication-Results: pb1.pair.com smtp.mail=php@fleshgrinder.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=php@fleshgrinder.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain fleshgrinder.com from 212.232.25.163 cause and error) X-PHP-List-Original-Sender: php@fleshgrinder.com X-Host-Fingerprint: 212.232.25.163 mx207.easyname.com Received: from [212.232.25.163] ([212.232.25.163:60942] helo=mx207.easyname.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 83/D2-07428-DE7DB075 for ; Mon, 11 Apr 2016 12:59:26 -0400 Received: from cable-81-173-133-226.netcologne.de ([81.173.133.226] helo=[192.168.178.20]) by mx.easyname.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84_2) (envelope-from ) id 1apfBJ-00026O-Ln; Mon, 11 Apr 2016 16:59:22 +0000 Reply-To: internals@lists.php.net References: <57050CAB.1040302@php.net> <570BD2A2.4040504@gmail.com> To: internals@lists.php.net, bishop@php.net, Stanislav Malyshev Message-ID: <570BD7E0.5060305@fleshgrinder.com> Date: Mon, 11 Apr 2016 18:59:12 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: <570BD2A2.4040504@gmail.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="vJE7WnowTSGHituwkGGOXO7Ggvrb8lpeQ" X-ACL-Warn: X-DNSBL-BARRACUDACENTRAL Subject: Re: [PHP-DEV] IntlCharsetDetector From: php@fleshgrinder.com (Fleshgrinder) --vJE7WnowTSGHituwkGGOXO7Ggvrb8lpeQ Content-Type: multipart/mixed; boundary="2J2imEDRnnpNonVgWb0xv5VwDbEEuaRoS" From: Fleshgrinder Reply-To: internals@lists.php.net To: internals@lists.php.net, bishop@php.net, Stanislav Malyshev Message-ID: <570BD7E0.5060305@fleshgrinder.com> Subject: Re: [PHP-DEV] IntlCharsetDetector References: <57050CAB.1040302@php.net> <570BD2A2.4040504@gmail.com> In-Reply-To: <570BD2A2.4040504@gmail.com> --2J2imEDRnnpNonVgWb0xv5VwDbEEuaRoS Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 4/11/2016 6:36 PM, Stanislav Malyshev wrote: > Hi! >=20 >>> As you say, it doesn't work properly. As a matter of fact, guessing=20 >>> charsets, like timezones, is not possible. You need to know which=20 >>> charset something is in. If not, you need to address *that* problem. >=20 > It is true that you can not detect charsets with 100% accuracy. It is, > however, also true that many charsets can be distinguished with enough > accuracy to make it useful, especially if you know the set of charsets > you are dealing with. E.g., Russian had about 5 commonly used encodings= > before everybody started to use UTF-8, and several exotic ones. Being > able to detect at least the major ones while dealing with a > heterogeneous library of Russian-language texts is a great help. There > may be other cases like this. >=20 > The point is even imperfect detection may be useful in certain > circumstances, and detector being part of ICU hints that people find it= > useful enough to spend time implementing and supporting it. We should > not ignore that. >=20 I need to agree with Stanislav here completely. Sebastian Bergmann has a quirky userland detection in its own library and I am sure there are millions of others who have it. Providing one quirky implementation in the core at least allows us to improve it over time and userland improves at the same time (although I doubt that it is possible to improve this kind of detection to a point where it really works). On 4/11/2016 4:51 PM, Bishop Bettini wrote: > What about forcing the consumer to stipulate minimal acceptable confidence? > The API would internally filter any matches with confidence strictly lo= wer > than the given value. Along the lines of: > > ucsdet_detect(IntlCharsetDetector $det, int $minimum_confidence): array= > ucsdet_detect_all(IntlCharsetDetector $det, int $minimum_confidence): array > > So the relatively reliable UTF-8 test > could be written: > > if ('UTF-8' =3D=3D=3D $detector->detect(100)) { > // ... > } > > This exposes the heuristics available in ICU and leaves the API flexibl= e, > while forcing the consumer to consider the fact that this is statistica= l > reasoning, not decision. > This is actually not such a bad idea to create awareness. At least better than only documenting it; which probably only good devs read (and understand). --=20 Richard "Fleshgrinder" Fussenegger --2J2imEDRnnpNonVgWb0xv5VwDbEEuaRoS-- --vJE7WnowTSGHituwkGGOXO7Ggvrb8lpeQ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXC9fkAAoJEOKkKcqFPVVrhCYQAI8f8G6l7SnxJShtzPwKNGSn KtpPvuYfIjVLI/JeC5ZhM+72XxTLV6PjGFqMhX2B3v4C4S+LNy1srkgvYRORfh3k uZn66cdB61315SC7SL4IBXcVA1Ep25t7QUZFEL2ZYyT3J8Bi3q7jmAGEH/XGIRrQ 6SIaorSGMPH1Wex9fjifu8TFsl8im5TRHIaWp8yLUokeFxHcpVRiymrIXQhTmVuk viwIl6qhPE0V1DbKwf39HeKkBaVNjPSP0gjhtzK3EyERJtWaaV3g0V7JI2nwuobF l+s0PHOATM9PKTthKgC4yOP0OR85al/saGiFcbe9JSNRGAOsCeHaX0p7Dvklji6W C0g8TlA7Pdt69tL8FT03xH8BsAzRlsiC3LbMeS8LrgkiYDPJNmUHwLFiY9jwQGXt fAt3MHJLhS1KVCXWbWlIn3oDNxYnmmEGlgZKsmB3OBVJda1btouBLdTHFpwWa1as AH2EINCpZdTnisPZCayis/WGjtUprdRQLCyuwBmFtooNw8UbQ9UGYIlV2U9zaUkq XgDWVsX9hKnRrjk9GiAHLAWyScReCPgWz2eW1WJlyZp58aVv7TFPxyHxDEiVlX19 P2afVk/UWYrFZLksIfMFEWlFxazxsi2OFYXt5bjpVhjXTMGbFD37c4GMYoAAHnY0 1vW1Mz2GiCWpzkM+Y340 =oJIq -----END PGP SIGNATURE----- --vJE7WnowTSGHituwkGGOXO7Ggvrb8lpeQ--