Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92826 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 59112 invoked from network); 26 Apr 2016 22:51:27 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 26 Apr 2016 22:51:27 -0000 Authentication-Results: pb1.pair.com header.from=smalyshev@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=smalyshev@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.192.176 as permitted sender) X-PHP-List-Original-Sender: smalyshev@gmail.com X-Host-Fingerprint: 209.85.192.176 mail-pf0-f176.google.com Received: from [209.85.192.176] ([209.85.192.176:36279] helo=mail-pf0-f176.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 3C/6E-20013-DE0FF175 for ; Tue, 26 Apr 2016 18:51:26 -0400 Received: by mail-pf0-f176.google.com with SMTP id c189so12466323pfb.3 for ; Tue, 26 Apr 2016 15:51:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=sDA2z9cNbuNzWJy2FDf+BqoUUkBbXxrC5gyMd8omVKA=; b=pjQ91NRbsUCwMSALkYB5aOnTaOuCwQRq4UrwN8A9JD8XHTRp/doeYN+UdOTEu2546q X+rysqynB4cYYzJsC/3N1rtxplL0NQBWSCFfqM9YPFLWUleFcS7KWo8/yWXMUrKmeB4q Ze5ROWkw/MnvGZ58/dvpS+FhtXngqKKI4aF/6MFNL5AoCvmPD0a2nXeCjG6nw/XbnrY7 pyo0egNCbk4bvLUv6y+ThDwQt0h4/wWHO6MdUd+1UPp0xRiDIPK8Qyop2xzPa0LA5jJ0 0Nxx0+SEshR3AgqHe81dReaRhJ0tnce59wtnIQk/5muTPGuMa3/XI3HC/AvRjNIPiYGk dA4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=sDA2z9cNbuNzWJy2FDf+BqoUUkBbXxrC5gyMd8omVKA=; b=iEIU/FOiRFkm7qBnXcOWgVEVAoqNeAZscuw26KuKYt7Ou/iM9FklpUDpPDmEySUu41 uV/9+8O389gY67B0CCKrHwwuzsWwUAdYAiN2zrCf0keHxjtyFnB7iPmq1Ont/SrJfenY 1spnhTp1EUQbTW9cfHyzmqDE9nRMrNvsKJ+DCr2hw9M6rbEUIsjiSDEkiDPkyGrIYBv3 xYrgMdDSFNF3N4aS5RGxKctC2q4kH0VbgI29sWvawbsg/qi+shLCGNEufUuyg5H2KFgQ 6si/5qbA5bS1tE5cfyxEkmsLDysYk+sgpkBQzV47GHc6xUsoZqbssVC8jhoSUWyhap/x plJg== X-Gm-Message-State: AOPr4FV47ZpHrSUpqrpDCY7zjlcK68ET3ZhXN0D6epRc2zvIWQ5SK1xM+ImT4pIV8vjCdg== X-Received: by 10.98.76.194 with SMTP id e63mr7228461pfj.63.1461711082833; Tue, 26 Apr 2016 15:51:22 -0700 (PDT) Received: from ?IPv6:2602:304:cdc2:e5f0:1c3f:6e88:d81d:19a4? ([2602:304:cdc2:e5f0:1c3f:6e88:d81d:19a4]) by smtp.gmail.com with ESMTPSA id s197sm827667pfs.62.2016.04.26.15.51.20 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 26 Apr 2016 15:51:21 -0700 (PDT) To: Sara Golemon , Yasuo Ohgaki References: Cc: PHP internals Message-ID: <571FF0DE.9060600@gmail.com> Date: Tue, 26 Apr 2016 15:51:10 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] [RFC] IntlCharsetDetector From: smalyshev@gmail.com (Stanislav Malyshev) Hi! > For me, the difference is that I expect further work to be done on > improving ICU, while I lack that confidence for mbstring. If the API My experience over the years has been that established supported libraries like ICU usually have better track record in improving and maintenance than more niche libraries, but it differs a lot from case to case. I have no idea though how good/bad is ICU in detecting Asian languages and encodings. >> Developers should not rely on encoding detector, but they should validate >> encoding. >> > I think everyone agrees on that. :) True, but also incomplete. There's ideal case, and there's real world. In ideal case, you know encodings of everything and everything is nicely specified and shiny and rainbows and unicorns abound. Real data, though, is messy and unpredictable and comes from places and practices that makes one shudder. And when it comes to that we can either give the developers at least something - an imperfect encoding detector, with all caveats - or just ignore it and not give them anything, because it is not matching our theories. and leave them to implement even worse hacks. I think the former is much better approach. And of course, detection and validation is a different thing. A text may look like valid string in encoding A but actually be encoding B. "Tell me if this data looks like Russian text in KOI-8 or Japanese text in Shift-JIS" and "tell me if this is a valid or invalid UTF-8" are two completely different tasks. -- Stas Malyshev smalyshev@gmail.com