Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92829 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 72947 invoked from network); 27 Apr 2016 04:11:51 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 27 Apr 2016 04:11:51 -0000 Authentication-Results: pb1.pair.com header.from=yohgaki@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=yohgaki@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.192.67 as permitted sender) X-PHP-List-Original-Sender: yohgaki@gmail.com X-Host-Fingerprint: 209.85.192.67 mail-qg0-f67.google.com Received: from [209.85.192.67] ([209.85.192.67:34386] helo=mail-qg0-f67.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id E8/EF-20013-50C30275 for ; Wed, 27 Apr 2016 00:11:50 -0400 Received: by mail-qg0-f67.google.com with SMTP id d90so2324661qgd.1 for ; Tue, 26 Apr 2016 21:11:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=LtuSzXrTZFdzdpRceVVSK/1vwsPs1e8Nt8GUYjPYyoo=; b=WMTyseo64X3zgW9qiRWGTYfzk/+GLRE/yrwxBDKepG9ksN2mAeRJKVran6B9q8/oZu xTP2Tb5JasBRgc1v1P4Y1G21UExG2DzizaKUoOBc3iHSdYbSeXbJm697r3corauy6rHm vjl0kFI2ZTheW6DfwS3i+MRj9PM/kuGltbEDHRQKPKtY1GDPCQGsyXDrYKkDA17welP5 ThpFZfDV5pUnBJtKQdDIzau8fE86GNU5IeZ8bqK+J6Pnr0TAkw1BseR/gQHwtmaOh9aU d8NlQEDmM7ZJ4EvAZdmFxUyrg9k3su9yFR40WeKZUw4Pix5VDRCsHS42yBzTFnHh1rGV 4BDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=LtuSzXrTZFdzdpRceVVSK/1vwsPs1e8Nt8GUYjPYyoo=; b=MxLh1BITC5y3qmbyuy4aJxyKgt7JmafCMO7gFkRcUTUsfJpnSU2UvhlS92B/bzX2lM CeCPLSxxUHGfe/Q4vQZa+/a0MBwoCPdUvzIEfv6BY3f71tKpTug5a3ZCyvmp9xqX1+VP IeE+dASFOePTZ0TfgjJY5wlT/s1w8LK68Se4+gzHl0fLhmWw/Jj/IrAbDWMZGEL0h63+ 8g1UM1xSwW81IvOWbzkXkabD12VoAjg5d/SMmbsFtUABgiQlmbuxdN7ImBqNUxRmqTnm 5vQ3CSJX0mLH3hF14P5DP/5y3QfciI9+Heh9LgcZ1AAd2EJqOnllohJQHIPwjelBbeDX N7LQ== X-Gm-Message-State: AOPr4FVqQsL3FOivPB/lbxsmbIH/niUleMvYWrhvobT2s1ft5G58QMuIR6yn8RxVLHKRRt47uZ+3wSSRv/iVlQ== X-Received: by 10.140.215.133 with SMTP id l127mr6456550qhb.26.1461730307579; Tue, 26 Apr 2016 21:11:47 -0700 (PDT) MIME-Version: 1.0 Sender: yohgaki@gmail.com Received: by 10.140.27.133 with HTTP; Tue, 26 Apr 2016 21:11:08 -0700 (PDT) In-Reply-To: References: Date: Wed, 27 Apr 2016 13:11:08 +0900 X-Google-Sender-Auth: T8WQKFzGPrY1g02gdxRW_zjhG7Y Message-ID: To: Sara Golemon Cc: PHP internals Content-Type: text/plain; charset=UTF-8 Subject: Re: [PHP-DEV] [RFC] IntlCharsetDetector From: yohgaki@ohgaki.net (Yasuo Ohgaki) Hi Sara, On Wed, Apr 27, 2016 at 1:10 AM, Sara Golemon wrote: > On Tue, Apr 26, 2016 at 2:06 AM, Yasuo Ohgaki wrote: >> Things might have been changed, but as you've mentioned encoding >> detection is unstable and ICU is poor compared to mbstring's detection >> at least for Japanese encodings. >> > For me, the difference is that I expect further work to be done on > improving ICU, while I lack that confidence for mbstring. If the API > is in place early on, the library can improve underneath it to the > point it becomes more trustworthy later, but still be usable on older > versions of PHP (linked against newer libicu). > > Maybe, I dunno. I lack the motivation to push this feature forward > atm, merely because it's not trust-worthy now. I can understand this. > >> Developers should not rely on encoding detector, but they should validate >> encoding. >> > I think everyone agrees on that. :) > >> Problem is there are cases that developers cannot determine used encoding... >> If we are going to have this API, it would be better to validate string with >> detected encoding by default and disable encoding validation optionally. >> There are cases that developers have to deal with broken string data >> on occasion. >> > What do you have in mind? Full-on pre-request input filtering? > 'cause that's never worked right (we tried really hard to make PHP6 do > that and it failed badly) I'm not. > > Or do you mean something like wrapping the ucsdet API in a coercer > function that only returned the original string if it detected at high > confidence and then validated against that detection? 'cause > honestly, that should also be left to the application IMO. I don't have problem with this approach. Developers must be responsible for this. For normal web apps, developers must validate encoding if it is expected one or not. Developers do not have guess encoding for most cases. Developers may need to detect encoding for uploaded text files, for example. If they use encoding detection, then they should validate text data by detected encoding. Experienced developers will detect encoding then validate text data with detected encoding before saving uploaded text files. However, many developers will detect encoding and assume text file char encoding is valid. This is the reason why I suggest - detect encoding (This is done by using only the beginning of data usually) - then validate the text data by detected encoding - if validation is OK return encoding name, otherwise return error. This would reduce chance of storing invalid text data in system. It's not strictly required, but I think it is more developer friendly. It's just a suggestion. Regards, -- Yasuo Ohgaki yohgaki@ohgaki.net