Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:113655
MIME-Version: 1.0
References: <f313c9c4-f8a2-0b39-7499-30620d80cecd@gmail.com>
In-Reply-To: <f313c9c4-f8a2-0b39-7499-30620d80cecd@gmail.com>
Date: Sun, 21 Mar 2021 20:15:18 -0500
Message-ID: <CAESVnVoa2U1ZCsthWCt8Cu8Es-P8WWO=vky4TG5CmtaoumEmUA@mail.gmail.com>
To: Rowan Tommins <rowan.collins@gmail.com>
Cc: PHP Internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="00000000000032962405be15cd6c"
Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?
From: pollita@php.net (Sara Golemon)

--00000000000032962405be15cd6c
Content-Type: text/plain; charset="UTF-8"

On Sun, Mar 21, 2021 at 9:18 AM Rowan Tommins <rowan.collins@gmail.com>
wrote:

> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> a specific replacement, but recommend people look at iconv() or
> mb_convert_encoding(). There is precedent for this, such as
> convert_cyr_string(), but it may frustrate those who are using the
> functions correctly.
>
> B) Introduce new names, such as utf8_to_iso_8859_1 and
> iso_8859_1_to_utf8; immediately make those the primary names in the
> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> notices for the old names, either immediately or in some future release.
> This gives a smoother upgrade path, but commits us to having these
> functions as outliers in our standard library.
>
> C) Leave them alone forever. Treat it as the user's fault if they mess
> things up by misunderstanding them.
>
>
My preference is for a deprecation notice (but not necessarily removal ever
-- We can argue that part a little).

As for what users should use instead, obviously there are multiple options
already in core (which you referenced), but those all have third party deps
and can't be guaranteed the way utf8_en/decode() can (this was the point of
moving them from xml).

While I'm normally in favor of userspace things belonging in userspace
(this particular conversion is trivial since it's a 1:1 mapping), I'm
actually willing to see this added under a new, clearer name in
ext/standard since this is something that's in long use, but used
incorrectly.

As for details, I don't love iso_8859_1_to_utf8(), but we can use the
common alias for iso-8859-1 known as latin1 and call the new functions:
utf8_from_latin1() and utf8_to_latin1() with the caveat that the later will
throw a ValueError for codepoints which are out of range (one of the more
problematic issues with utf8_decode()).  That makes this not just a simple
rename for clarity, but what I'd consider a bug-fix for an unfortunately
unfixable function.

-Sara

--00000000000032962405be15cd6c--