Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:113651 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 73413 invoked from network); 21 Mar 2021 18:18:12 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 21 Mar 2021 18:18:12 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id EA3721804B7 for ; Sun, 21 Mar 2021 11:13:15 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_20,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 21 Mar 2021 11:13:15 -0700 (PDT) Received: by mail-ed1-f47.google.com with SMTP id b16so16690619eds.7 for ; Sun, 21 Mar 2021 11:13:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding:content-language; bh=XMYjWYVC7gRdh6eKxBQ2D01smPgaFCysGKAC0P1VAAs=; b=EB8reH0SXa8ArT12lHuId/MXCrBR2fwucw62aK40J+mqsh08kzylelzsdjhdgd3NCo tJnqUJIuuhj/Px+OpHmvzOmn/Fz9zfy18u0n6VjDnTQBDURHaQoN4GW4rffPRr6Sl/t1 HmpzA0obQv+wFzIuXVmkDJWHrSfrZly1PrB5pch0GqlShXgdYYhgXQvF9OFZ1j3Qlats 4RM3O3AHnRqYe03jsVQWKkT0V7YYxpJEh8V7/dOlWZXfO1+l5KqwSlKEJLcwSymPeBUL 41YWphHQUQ9+CRjj3VC/v/oY/TTNIlShWjZEUHHAgmTGTU86zrrCfyn/YjQMeK919Byo 2qCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=XMYjWYVC7gRdh6eKxBQ2D01smPgaFCysGKAC0P1VAAs=; b=MKDkAZnf6Jn2DMN11CYNP7twOxe/blClO+6+H3qWFoKjSFDlaOuEwALR8Bo7VKz6OK xLSDXNTAWIWczWKwJDTpwivxdJhiwBOzvxksrov/x/dPRcyWIlo2H/u37NBTvQ9QpnGl CVRyfOEfljePEfPMmD+ALD4cZzljeGE2AV2HK3ZATzCoSGxWPRU5t75Rgj5+kZp6DYuQ kQUK/aFAqLzjnBIaUrwexhFZZR6y5m//eVMKgD3imfB+xzx7wQIAUMF0hfZFNwVtebj3 pdDI1Q6bjfZyOfwcE6/NgAfhnpCjHlK6wHln7Rdvy/MgKpbHkjHVt1EBlqTMY9345uYP An/g== X-Gm-Message-State: AOAM530N7IcsgqQ90pQz6NMMqbYuyUDfEb+xRJvDO2kI5W658bPrCp04 FJyJ4cx/Q1ikHadABu1PXhyGEuOHtjE= X-Google-Smtp-Source: ABdhPJxkHwkeT84Eydr1YN/niwO1pM7wR53/EHMj0+uoAxkgc988/N+1n03BmMe6LK1NX6pBSaPUeg== X-Received: by 2002:a05:6402:4241:: with SMTP id g1mr22048911edb.331.1616350391928; Sun, 21 Mar 2021 11:13:11 -0700 (PDT) Received: from [192.168.0.22] (cpc104104-brig22-2-0-cust548.3-3.cable.virginm.net. [82.10.58.37]) by smtp.googlemail.com with ESMTPSA id i11sm7494134ejf.76.2021.03.21.11.13.10 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 21 Mar 2021 11:13:11 -0700 (PDT) To: internals@lists.php.net References: <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com> Message-ID: <5f5fd136-e181-d5d3-fe40-1a4cc5c668f2@gmail.com> Date: Sun, 21 Mar 2021 18:13:11 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode? From: rowan.collins@gmail.com (Rowan Tommins) On 21/03/2021 16:51, Larry Garfield wrote: > As Rowan notes, what people actually*want* most of the time is "I got this string from a user and have NFI what it's encoding is, but my system needs UTF-8, so gimmie this string in UTF-8." And they use utf8_encode(), which then fails*sometimes* in exciting and mysterious ways, because that's not what it is. > > [...] > > If we're removing a bad answer to the problem, we should also replace it with a good answer. This is indeed my main concern with complete deprecation. The problem is that detecting string encoding is a Really Hard Problemâ„¢ The fundamental problem is that any sequence of bytes is valid in any single-byte encoding. If you're expecting printable characters only, you can rule out some candidates if you're lucky - e.g. if your string contains a byte in the range 0x80 to 0x9F, it's not any part of ISO 8859 - but the string "\xB0\xC0\xD0" is both valid and printable in any of dozens of 8-bit encodings. I recently came across a Python library implementing a clever approach to the problem, which originated at Mozilla. Its concise FAQ is worth reading: https://chardet.readthedocs.io/en/latest/faq.html The approach Mozilla came up with is to decide which encoding leads to something most likely to be natural human text - e.g. don't suggest an encoding common for Cyrillic if the result would be completely unpronounceable in Russian. The only function I know of which even attempts encoding detection in PHP is mb_detect_encoding, and it does a pretty bad job. For instance: echo mb_detect_encoding("\x80500", ['Windows-1252', 'ISO-8859-15', 'ISO-8859-1']); ...picks ISO-8859-15, where 0x80 is a rarely-used control character, rather than Windows-1252, where it's the Euro symbol. On the other hand, if you know what encoding you do have, either of the following will work fine: echo mb_convert_encoding("\x80500", 'UTF-8', 'Windows-1252'); echo iconv('Windows-1252', 'UTF-8', "\x80500"); Either of these functions (passed ISO-8859-1) can be used as a polyfill for correct uses of utf8_encode/utf8_decode, but neither is going to do the magic trick which people always *hope* those functions will. Regards, -- Rowan Tommins [IMSoP]