Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:113647 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 55463 invoked from network); 21 Mar 2021 15:06:38 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 21 Mar 2021 15:06:38 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 6FAF6180503 for ; Sun, 21 Mar 2021 08:01:39 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from forward102p.mail.yandex.net (forward102p.mail.yandex.net [77.88.28.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 21 Mar 2021 08:01:37 -0700 (PDT) Received: from myt5-bc23fd5efdaf.qloud-c.yandex.net (myt5-bc23fd5efdaf.qloud-c.yandex.net [IPv6:2a02:6b8:c12:3ca5:0:640:bc23:fd5e]) by forward102p.mail.yandex.net (Yandex) with ESMTP id E96571D411FA for ; Sun, 21 Mar 2021 18:01:34 +0300 (MSK) Received: from myt3-07a4bd8655f2.qloud-c.yandex.net (myt3-07a4bd8655f2.qloud-c.yandex.net [2a02:6b8:c12:693:0:640:7a4:bd86]) by myt5-bc23fd5efdaf.qloud-c.yandex.net (mxback/Yandex) with ESMTP id yylsVFb1q1-1YIe2PuQ; Sun, 21 Mar 2021 18:01:34 +0300 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=php.watch; s=mail; t=1616338894; bh=AYEb14Pp6FY/t/iTTOoIVLCyTzVWU0EFTajtFvCZZHg=; h=To:Subject:From:In-Reply-To:Cc:Message-ID:Date:References; b=pka/g1zLHwdQlFBS4Jz8imx/w7hvvSiwdJGjtpfnOSdyeBlIFy1JGxOrapcGG2zx/ NpXti6XbByIDVLNvbAtiLUP0IfM30nYuu5MrvJpnhncYT9GPsWdNhnpPnW1HVO366Q 4GROBMleq8vjZ6D255zZiImsi1PEmAI7jdEc/t+0= Authentication-Results: myt5-bc23fd5efdaf.qloud-c.yandex.net; dkim=pass header.i=@php.watch Received: by myt3-07a4bd8655f2.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA id FOb8wwL8Yf-1XJuDspW; Sun, 21 Mar 2021 18:01:33 +0300 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (Client certificate not present) Received: by mail-oi1-f173.google.com with SMTP id i81so8929916oif.6 for ; Sun, 21 Mar 2021 08:01:33 -0700 (PDT) X-Gm-Message-State: AOAM531mBhcH9gaGxT+v5CHg5Pu6RuoQElnhkiFDvIvquz9wz6dvfe9+ lkfpvTzywRw1Awrfk3WA+dgshCHQQFaqvrWchDU= X-Google-Smtp-Source: ABdhPJzEonR76cxCRFk4vU+3vuSVItGNO636MfW3UaBKavjz8P8M9hg0uY2Q7diAbTqIXWEpr69U4lLXzKkE1xKrgOI= X-Received: by 2002:aca:7543:: with SMTP id q64mr7202864oic.100.1616338892420; Sun, 21 Mar 2021 08:01:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Sun, 21 Mar 2021 20:31:06 +0530 X-Gmail-Original-Message-ID: Message-ID: To: Rowan Tommins Cc: PHP Internals Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode? From: ayesh@php.watch (Ayesh Karunaratne) Thank you for opening this conversation, these functions have stung me in the past, and I would be so happy to see them gone :) Personally, I would very much like to go with Plan A. - XML parsers that often deal with non-UTF-8 character encodings frequently use these functions. However, any parser worth their salt is better off using mbstring or iconv because of the lack of Windows-1252 support that is assumed elsewhere for ISO-8859. If we have a `utf8_encode` that supports Windows-1252 as often expected, I think plan B would be the more smoother upgrade. - On Packagist top 1000 downloads, stripe-php, phpcpd, pdepend, carbon, monolog, php-cs-fixer, htmlpurifier, and aws-php-sdk use `utf8_encode`. Some of these libraries depend on `ext-mbstring` or Symfony mbstring polyfill, so we are left with even fewer libraries that cannot assume `iconv()` or `mb_convert_encoding` availability. On Sun, Mar 21, 2021 at 7:48 PM Rowan Tommins wro= te: > > Hi all, > > The functions utf8_encode and utf8_decode are historical oddities, which > almost certainly would not be accepted if proposed today: > > * Their names do not describe their functionality, which is to convert > to/from one specific single-byte encoding. This leads to a common > confusion that they can be used to "fix" UTF-8 encoding problems, which > they generally make worse. > * That single-byte encoding is ISO 8859-1, not its common cousins > Windows-1252 or ISO 88159-15. This means, for instance, that they do not > handle the Euro sign: utf8_decode('=E2=82=AC') returns '?' (i.e. unmappab= le) > not "\x80" (Windows-1252) or "\xA4" (8859-15) > > On the other hand, they are commonly used, both correctly and > incorrectly, so removing them is not easy. > > A previous proposal to remove them [1] resulted in Andrea making two > significant improvements: moving them from ext/xml to ext/standard [2] > and rewriting the documentation to explain them properly [3]. My genuine > thanks for that. > > However, it hasn't stopped people misunderstanding them, and quite > reasonably: you shouldn't need to look up every function you use in the > manual, to make sure it actually does what its name suggests. > > > I can see three ways forward: > > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide > a specific replacement, but recommend people look at iconv() or > mb_convert_encoding(). There is precedent for this, such as > convert_cyr_string(), but it may frustrate those who are using the > functions correctly. > > B) Introduce new names, such as utf8_to_iso_8859_1 and > iso_8859_1_to_utf8; immediately make those the primary names in the > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation > notices for the old names, either immediately or in some future release. > This gives a smoother upgrade path, but commits us to having these > functions as outliers in our standard library. > > C) Leave them alone forever. Treat it as the user's fault if they mess > things up by misunderstanding them. > > > I am happy to put together an RFC for either A or B, if it has a chance > of reaching consensus. I would really like to avoid option C. > > > [1] https://externals.io/message/95166 > [2] https://github.com/php/php-src/pull/2160 > [3] > https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a= 8238 > > Regards, > > -- > Rowan Tommins > [IMSoP] > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: https://www.php.net/unsub.php >