Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:113650
User-Agent: Cyrus-JMAP/3.5.0-alpha0-206-g078a48fda5-fm-20210226.001-g078a48fd
Mime-Version: 1.0
Message-ID: <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com>
In-Reply-To: <f313c9c4-f8a2-0b39-7499-30620d80cecd@gmail.com>
References: <f313c9c4-f8a2-0b39-7499-30620d80cecd@gmail.com>
Date: Sun, 21 Mar 2021 11:51:25 -0500
To: "php internals" <internals@lists.php.net>
Content-Type: text/plain;charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?
From: larry@garfieldtech.com ("Larry Garfield")

On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote:
> Hi all,
>=20
> The functions utf8_encode and utf8_decode are historical oddities, whi=
ch=20
> almost certainly would not be accepted if proposed today:
>=20
> * Their names do not describe their functionality, which is to convert=
=20
> to/from one specific single-byte encoding. This leads to a common=20
> confusion that they can be used to "fix" UTF-8 encoding problems, whic=
h=20
> they generally make worse.
> * That single-byte encoding is ISO 8859-1, not its common cousins=20
> Windows-1252 or ISO 88159-15. This means, for instance, that they do n=
ot=20
> handle the Euro sign: utf8_decode('=E2=82=AC') returns '?' (i.e. unmap=
pable)=C2=A0=20
> not "\x80" (Windows-1252) or "\xA4" (8859-15)
>=20
> On the other hand, they are commonly used, both correctly and=20
> incorrectly, so removing them is not easy.
>=20
> A previous proposal to remove them [1] resulted in Andrea making two=20=

> significant improvements: moving them from ext/xml to ext/standard [2]=
=20
> and rewriting the documentation to explain them properly [3]. My genui=
ne=20
> thanks for that.
>=20
> However, it hasn't stopped people misunderstanding them, and quite=20
> reasonably: you shouldn't need to look up every function you use in th=
e=20
> manual, to make sure it actually does what its name suggests.
>=20
>=20
> I can see three ways forward:
>=20
> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provid=
e=20
> a specific replacement, but recommend people look at iconv() or=20
> mb_convert_encoding(). There is precedent for this, such as=20
> convert_cyr_string(), but it may frustrate those who are using the=20
> functions correctly.
>=20
> B) Introduce new names, such as utf8_to_iso_8859_1 and=20
> iso_8859_1_to_utf8; immediately make those the primary names in the=20=

> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation=20=

> notices for the old names, either immediately or in some future releas=
e.=20
> This gives a smoother upgrade path, but commits us to having these=20
> functions as outliers in our standard library.
>=20
> C) Leave them alone forever. Treat it as the user's fault if they mess=
=20
> things up by misunderstanding them.
>=20
>=20
> I am happy to put together an RFC for either A or B, if it has a chanc=
e=20
> of reaching consensus. I would really like to avoid option C.
>=20
>=20
> [1] https://externals.io/message/95166
> [2] https://github.com/php/php-src/pull/2160
> [3]=20
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b262=
95a8238
>=20
> Regards,

I lost several days of my life to exactly this problem, many years ago. =
 I am still triggered by it.

I am mostly OK with option A, but with a big caveat:

The root problem here is "You keep using that function.  I do not think =
it means what you think it means."

As Rowan notes, what people actually *want* most of the time is "I got t=
his string from a user and have NFI what it's encoding is, but my system=
 needs UTF-8, so gimmie this string in UTF-8."  And they use utf8_encode=
(), which then fails *sometimes* in exciting and mysterious ways, becaus=
e that's not what it is.

Removing utf8_encode() may keep people from misusing it, but that doesn'=
t mean the problem space they were trying to solve goes away.  If anythi=
ng, people who still don't realize that it's the wrong solution will get=
 angry that we're taking away a "useful" tool and replacing it with "meh=
, go look at library X," which is admittedly a pretty rude answer.

If we're removing a bad answer to the problem, we should also replace it=
 with a good answer.

Someone will, I'm sure, pop in at this point and declare "if you don't k=
now the character encoding you're receiving, then you're doing it wrong =
and are already lost and we can't help you."  While that may be technica=
lly correct, it's also an entirely useless answer because strings receiv=
ed over HTTP very frequently do not tell you what their encoding is, or =
they lie about what their encoding is.  (The header may say it's ISO8859=
, or UTF8, or whatever, but someone copy-pasted from MS Word into a text=
 box and now it's Windows-1252 within a wrapper that says ISO8859 but is=
 mostly UTF8 except for the Windows-1252 part.  Like, that's literally t=
he problem I lost several days to.)  "Your own fault" is not even an acc=
urate answer at that point.

So if we're going to take away people's broken hammer, we need to be ver=
y clear about what hammer to use instead.

The initial answer is probably "here's how to use a series of mb_string =
functions together to produce a reasonably good guess-my-encoding-and-co=
nvert-to-utf8 routine" documentation.  Which... may exist, but if it doe=
s I've never found it.  So at bare minimum the encode_utf8() documentati=
on needs to include a "use this code snippet instead" description, and n=
ot just link to the mbstring extension.  Glancing through the mbstring d=
ocs right now, it looks like it's not already a single function call, bu=
t some combination of several, and has some global flags that get set (v=
ia mb_detect_order()), I think.  It's not as easy to use as utf8_encode(=
), even if utf8_encode() is wrong.  That suggests we may want to try and=
 simplify the mbstring API, or internalize some function that handles th=
e most common case in a way that doesn't rely on global flags.

So, let's make that easier to use, so that we can change "this function =
is wrong, we're taking it away from you" to "this function is wrong, her=
e's a way better alternative that you can use instead (while we quietly =
take the wrong one away from you while you're distracted by the new shin=
y)."

I don't know the mbstring API well enough to say what that alternative i=
deally looks like, but if we can answer that it would make killing off t=
he old functions much more palatable.

--Larry Garfield