Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:113677 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 73823 invoked from network); 22 Mar 2021 15:46:12 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 22 Mar 2021 15:46:12 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 3FC33180087 for ; Mon, 22 Mar 2021 08:41:30 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=BAYES_20,BODY_8BITS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, NICE_REPLY_A,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 22 Mar 2021 08:41:29 -0700 (PDT) Received: by mail-ed1-f51.google.com with SMTP id l18so11714751edc.9 for ; Mon, 22 Mar 2021 08:41:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding:content-language; bh=+6WxlZclSZQlEF3c07p39UeqLMjGwgt2jZiNGotYcyM=; b=Yi3EuCs2A0mxR1aiYZnGb9mc7YwkEe/yG3+J9qoL/DQ0lwiiehfdrw0u3siCFVGzTV Af3/3nLI/89Q2pN69uMY4QNjlIX/ZAFKFZkdMkON6cjzeK//L6job/QBrJuAVZaIe8BD YGVIboLVg4rScXKrvL+170C5up+vf1O1OMni4J/rNCRO5is7ZMK5VJm4a1RI38cyLX75 c6EKoYwOl2Vhh3OJYWzeQtTQVD6NyBX7R9NR38UYGdkB6lmmyOjzqoDyVEbd/lYb7tbc MPk3BCq/SYbcQI8zw3pRhkptNA0I09xePFShgVzHNuzunsLzPIPPXOFuu6N9ANq/0fc4 hFwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=+6WxlZclSZQlEF3c07p39UeqLMjGwgt2jZiNGotYcyM=; b=V70hWYVlLTwRfSpz2FtVRlrqZd2Snu00Arr13vdWixWv7eG+cplhJfGOHWQch5VR4j gdqigImnr6v0I1Uh2IfoRPsnu4hn1zaA0dFa+oqNHsYAUWjVoVHydmE9dor7h+UvJpXH jc7bwGJSVwEZeZf3KdXlf1ZApR6EIxErtpUgVfU980/AQbuX6qshWEGqUHNUYiBKwqrV roo54nOBVtdyneGGBKrOfyTo/LiIUK1JhuhKklos/TBYcRzpFT29HaQv52u6N4OIIw4+ srEa7wbrrDAII+shCTNYZf1nC1XoLizd2UYjA2B19FGBE2zeYsw53ohlMilg1aP7rxRJ rh6w== X-Gm-Message-State: AOAM531gVNMGsVsMWHtIawx4w2sXeZqy1nVVXY10ixYInZO67vi4Vuxx 7s4ita39gp/5wGdNHNeNZKskoKsAUwE= X-Google-Smtp-Source: ABdhPJzTigo5lLFPvQEuugpo8CyGgqgetV69xdd0+wxFsAYj+obtoV6jBO2obJ5rAqPfxepo+og0Jw== X-Received: by 2002:aa7:d5c9:: with SMTP id d9mr198370eds.102.1616427688532; Mon, 22 Mar 2021 08:41:28 -0700 (PDT) Received: from [192.168.0.22] (cpc104104-brig22-2-0-cust548.3-3.cable.virginm.net. [82.10.58.37]) by smtp.googlemail.com with ESMTPSA id b22sm11242929edv.96.2021.03.22.08.41.27 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 22 Mar 2021 08:41:27 -0700 (PDT) To: internals@lists.php.net References: <693767b5-a25b-b4d9-f535-6b985bf26d67@gmail.com> <29d5329c-bea2-7944-4820-515d4a10ae86@alec.pl> Message-ID: <16ecfc31-33aa-4223-fb67-b5a4b5895f05@gmail.com> Date: Mon, 22 Mar 2021 15:41:27 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <29d5329c-bea2-7944-4820-515d4a10ae86@alec.pl> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode? From: rowan.collins@gmail.com (Rowan Tommins) On 22/03/2021 15:04, Aleksander Machniak wrote: > I'm using utf8_encode()/utf8_decode() to make input string safe to be > stored in DB, and back. In most cases the input is utf-8, but it > occasionally may contain "broken characters". That is not what this function does, at all. The fact that its name makes you think that is exactly why I want to get rid of that name. > $str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃"; > > $this->assertSame($str, utf8_decode(utf8_encode($str))); Let's write that out with a more descriptive function name: $str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃"; $this->assertSame($str, utf8_to_latin1(latin1_to_utf8($str))); Since Latin-1 does not contain any Chinese, Japanese, or Emoji characters, running latin1_to_uft8 on that string is clearly nonsensical. The only reason it doesn't give you any errors is that every possible byte is a valid character in Latin1, and every Latin1 character has a Unicode code point. So the "グ" is interpreted as three Latin-1 characters: E3, 82, and B0; those then become the corresponding Unicode code points U+00E3, U+00821, and U+00B0, represented in UTF-8. You then run utf8_to_latin1, and they get converted back. That code will never do anything useful. Regards, -- Rowan Tommins [IMSoP]