Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:113645 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 48367 invoked from network); 21 Mar 2021 14:23:24 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 21 Mar 2021 14:23:24 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 4C8C11804F4 for ; Sun, 21 Mar 2021 07:18:26 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_40,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 21 Mar 2021 07:18:25 -0700 (PDT) Received: by mail-ej1-f47.google.com with SMTP id b9so17089750ejc.11 for ; Sun, 21 Mar 2021 07:18:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=to:from:subject:message-id:date:user-agent:mime-version :content-transfer-encoding:content-language; bh=lsAyqtVDPweS8V3rUOVImNyiZC52iigvWBjmN+stqaQ=; b=UODalntwGjImHqKwqM5H6er2k1rmI9lsuzukZ4V2xINmGupBYfTz1tRnZON035oaTK jnE++ml7a7Y1SQ7VZr4h+Rr5d9hkGrUjDbJTwUszYjS87dLHs7h8tEColD2TQ3/XFj/E 5PpS5xl3SgexlEw6biE1shtBBl0H8i+tAaY08hF2L2SuEPc3D8KTC6knhaycFSb+EQNS ZEFCCkUAJvePKREouqXyT494rY7fsesgoHNNVoOKkyMLzKs5Sopt+/u0JlD8UM9xaXsH MAHgqgz9NLFDN/D4D380oTf22/xFFrvwfnsm8CDdiRuXNskw1DGqIzHFtEnioVn/Srhd NswQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:from:subject:message-id:date:user-agent :mime-version:content-transfer-encoding:content-language; bh=lsAyqtVDPweS8V3rUOVImNyiZC52iigvWBjmN+stqaQ=; b=BIwaaDuRX09AcemB9JnbRQ9atketBWpDYyMkXoqFJBaFCC85ORoNojya6vXRb+AGPm EfEHZ436jT1ZXguEco2uogRJDEWf0VAungrRZPmuefqOmJbbcaRF3K/7stv2cMh39jWr I2KteFHqxJ7ti7IuQA15r02loNfb/h5kqRZEZv8R4qbZACGsP49km/D07v7wGv3Hhfz6 SpHNBM/cefANdyM8C6c/2N3fyXr489SBNvaeClPe754ExmeYNB/KJQggT6yyIZjt8I7W uO31yJnJ09sc9vf6YRdNU36LH/t4gT6cMwd08sD8wk5Qv31cl5vY7TaeKPPzonwv3ReG sK5A== X-Gm-Message-State: AOAM533Bw3gS2xVDWR0D+X1JstCjhjmhJxjGHgOSLzWPsiSnOmT02uUl wpCVbe02tiOXjfnkoFh1ilO16b01oV4= X-Google-Smtp-Source: ABdhPJz9gB0W7yCAJ7u8EJ2n/5vyKeF4hcIN9o7znViP74Xnd7RXkOPVSOR0bxjO+0cpPgkGtSE8vA== X-Received: by 2002:a17:906:7384:: with SMTP id f4mr14796602ejl.196.1616336301330; Sun, 21 Mar 2021 07:18:21 -0700 (PDT) Received: from [192.168.0.22] (cpc104104-brig22-2-0-cust548.3-3.cable.virginm.net. [82.10.58.37]) by smtp.googlemail.com with ESMTPSA id l12sm8157676edb.39.2021.03.21.07.18.20 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 21 Mar 2021 07:18:20 -0700 (PDT) To: PHP Internals Message-ID: Date: Sun, 21 Mar 2021 14:18:20 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: What should we do with utf8_encode and utf8_decode? From: rowan.collins@gmail.com (Rowan Tommins) Hi all, The functions utf8_encode and utf8_decode are historical oddities, which almost certainly would not be accepted if proposed today: * Their names do not describe their functionality, which is to convert to/from one specific single-byte encoding. This leads to a common confusion that they can be used to "fix" UTF-8 encoding problems, which they generally make worse. * That single-byte encoding is ISO 8859-1, not its common cousins Windows-1252 or ISO 88159-15. This means, for instance, that they do not handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)  not "\x80" (Windows-1252) or "\xA4" (8859-15) On the other hand, they are commonly used, both correctly and incorrectly, so removing them is not easy. A previous proposal to remove them [1] resulted in Andrea making two significant improvements: moving them from ext/xml to ext/standard [2] and rewriting the documentation to explain them properly [3]. My genuine thanks for that. However, it hasn't stopped people misunderstanding them, and quite reasonably: you shouldn't need to look up every function you use in the manual, to make sure it actually does what its name suggests. I can see three ways forward: A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide a specific replacement, but recommend people look at iconv() or mb_convert_encoding(). There is precedent for this, such as convert_cyr_string(), but it may frustrate those who are using the functions correctly. B) Introduce new names, such as utf8_to_iso_8859_1 and iso_8859_1_to_utf8; immediately make those the primary names in the manual, with utf8_encode / utf8_decode as aliases. Raise deprecation notices for the old names, either immediately or in some future release. This gives a smoother upgrade path, but commits us to having these functions as outliers in our standard library. C) Leave them alone forever. Treat it as the user's fault if they mess things up by misunderstanding them. I am happy to put together an RFC for either A or B, if it has a chance of reaching consensus. I would really like to avoid option C. [1] https://externals.io/message/95166 [2] https://github.com/php/php-src/pull/2160 [3] https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238 Regards, -- Rowan Tommins [IMSoP]