Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92795 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 94818 invoked from network); 26 Apr 2016 16:10:10 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 26 Apr 2016 16:10:10 -0000 Authentication-Results: pb1.pair.com header.from=php@golemon.com; sender-id=softfail Authentication-Results: pb1.pair.com smtp.mail=php@golemon.com; spf=softfail; sender-id=softfail Received-SPF: softfail (pb1.pair.com: domain golemon.com does not designate 209.85.215.66 as permitted sender) X-PHP-List-Original-Sender: php@golemon.com X-Host-Fingerprint: 209.85.215.66 mail-lf0-f66.google.com Received: from [209.85.215.66] ([209.85.215.66:36795] helo=mail-lf0-f66.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id F9/A1-20013-0E29F175 for ; Tue, 26 Apr 2016 12:10:09 -0400 Received: by mail-lf0-f66.google.com with SMTP id y84so3078840lfc.3 for ; Tue, 26 Apr 2016 09:10:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=golemon-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc; bh=RVNyPFjI0X8m0+OfayIsgJPXo/pE+gR/yNU0BxxN1H0=; b=OwaAIuDh8JYmd7ChLsBOmtaiGbLLr9cYcLBLgK0arFdbyuTLYPiFCAAoblih743l9Z LavG2jdOaS2geaIJRJeO3RAe8FJ+95tdlCZaw0G9dgv2QyIkUSIsGqblnIZ1pJF2LQeJ qFupLW9P5iiQ8lHvYAWZQGubq8r6vZr3sm9k999oFaAJOw6PeKlZmBAkxnuxbM5zupzB A8K0gN0NyUedtAaRy1uns947v9Wu78j+3K2CSf0wQALH0f1Q7uty0FG48aiiH9f+ZjZr 8Zb0XRY3reVd3W7hs8GulrcRVsPS/ELWlkq2xMb78fi4jyRZ5QbEuBdS780X7395M0hC rLdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:cc; bh=RVNyPFjI0X8m0+OfayIsgJPXo/pE+gR/yNU0BxxN1H0=; b=P/TLYbC4rl9XPchbOPWhOTsDM/mjUDG3HhZLnhb5WTEQTXzulnoi6atxGc65SxyPJe 0PjU3FJ0vvwg/nOjkCwUR6Jr7hjMgAjeWbIEfuV6KL22Riv9j3zOPqbs/XILnyRXw8N0 aVVZjFHqmyMapwARRclikOohXT0TIImTY46ZBC65cZFGhDV420T3HYICO8yRlRN8Sqfr 9iUT8acEiMsvJ7vmzoje6hMS6w+3hZ7L8n4e3h93JGMo3kb/taWUxqP76RlbhpOfVWqE 7+zXLcxqFmqGnBq/8QSVlwxo2bnbvbO/kcJxSUEXn3ByK8GOHnQrymb1mDqdoxexmyZQ WQSw== X-Gm-Message-State: AOPr4FWDBoWzaFFf/wmKlhl8P3TACpxxePJGgl0+JBwl2cJG2n9m6nVY7S57qtxOpku60JYG/etRhcERo34F9Q== MIME-Version: 1.0 X-Received: by 10.25.155.85 with SMTP id d82mr1760982lfe.74.1461687005342; Tue, 26 Apr 2016 09:10:05 -0700 (PDT) Sender: php@golemon.com Received: by 10.112.19.74 with HTTP; Tue, 26 Apr 2016 09:10:05 -0700 (PDT) X-Originating-IP: [107.198.91.68] In-Reply-To: References: Date: Tue, 26 Apr 2016 09:10:05 -0700 X-Google-Sender-Auth: hymdJmMQFokHuoxALI7HCA714rY Message-ID: To: Yasuo Ohgaki Cc: PHP internals Content-Type: text/plain; charset=UTF-8 Subject: Re: [PHP-DEV] [RFC] IntlCharsetDetector From: pollita@php.net (Sara Golemon) On Tue, Apr 26, 2016 at 2:06 AM, Yasuo Ohgaki wrote: > Things might have been changed, but as you've mentioned encoding > detection is unstable and ICU is poor compared to mbstring's detection > at least for Japanese encodings. > For me, the difference is that I expect further work to be done on improving ICU, while I lack that confidence for mbstring. If the API is in place early on, the library can improve underneath it to the point it becomes more trustworthy later, but still be usable on older versions of PHP (linked against newer libicu). Maybe, I dunno. I lack the motivation to push this feature forward atm, merely because it's not trust-worthy now. > Developers should not rely on encoding detector, but they should validate > encoding. > I think everyone agrees on that. :) > Problem is there are cases that developers cannot determine used encoding... > If we are going to have this API, it would be better to validate string with > detected encoding by default and disable encoding validation optionally. > There are cases that developers have to deal with broken string data > on occasion. > What do you have in mind? Full-on pre-request input filtering? 'cause that's never worked right (we tried really hard to make PHP6 do that and it failed badly) Or do you mean something like wrapping the ucsdet API in a coercer function that only returned the original string if it detected at high confidence and then validated against that detection? 'cause honestly, that should also be left to the application IMO. -Sara