Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:116710
MIME-Version: 1.0
References: <f313c9c4-f8a2-0b39-7499-30620d80cecd@gmail.com> <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com>
In-Reply-To: <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com>
Date: Tue, 21 Dec 2021 15:20:45 -0800
Message-ID: <CABXx6-y=5-OV2PnOy5f8xr4Nu-=7rkz9c4Q51sfsDj5wb1XJhA@mail.gmail.com>
To: Larry Garfield <larry@garfieldtech.com>
Cc: php internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="000000000000eb919d05d3b0417e"
Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?
From: wrossmann@gmail.com (Wade Rossmann)

--000000000000eb919d05d3b0417e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield <larry@garfieldtech.com>
wrote:

> On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote:
> > Hi all,
> >
> > The functions utf8_encode and utf8_decode are historical oddities, whic=
h
> > almost certainly would not be accepted if proposed today:
> >
> > * Their names do not describe their functionality, which is to convert
> > to/from one specific single-byte encoding. This leads to a common
> > confusion that they can be used to "fix" UTF-8 encoding problems, which
> > they generally make worse.
> > * That single-byte encoding is ISO 8859-1, not its common cousins
> > Windows-1252 or ISO 88159-15. This means, for instance, that they do no=
t
> > handle the Euro sign: utf8_decode('=E2=82=AC') returns '?' (i.e. unmapp=
able)
> > not "\x80" (Windows-1252) or "\xA4" (8859-15)
> >
> > On the other hand, they are commonly used, both correctly and
> > incorrectly, so removing them is not easy.
> >
> > A previous proposal to remove them [1] resulted in Andrea making two
> > significant improvements: moving them from ext/xml to ext/standard [2]
> > and rewriting the documentation to explain them properly [3]. My genuin=
e
> > thanks for that.
> >
> > However, it hasn't stopped people misunderstanding them, and quite
> > reasonably: you shouldn't need to look up every function you use in the
> > manual, to make sure it actually does what its name suggests.
> >
> >
> > I can see three ways forward:
> >
> > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> > a specific replacement, but recommend people look at iconv() or
> > mb_convert_encoding(). There is precedent for this, such as
> > convert_cyr_string(), but it may frustrate those who are using the
> > functions correctly.
> >
> > B) Introduce new names, such as utf8_to_iso_8859_1 and
> > iso_8859_1_to_utf8; immediately make those the primary names in the
> > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> > notices for the old names, either immediately or in some future release=
.
> > This gives a smoother upgrade path, but commits us to having these
> > functions as outliers in our standard library.
> >
> > C) Leave them alone forever. Treat it as the user's fault if they mess
> > things up by misunderstanding them.
> >
> >
> > I am happy to put together an RFC for either A or B, if it has a chance
> > of reaching consensus. I would really like to avoid option C.
> >
> >
> > [1] https://externals.io/message/95166
> > [2] https://github.com/php/php-src/pull/2160
> > [3]
> >
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a=
8238
> >
> > Regards,
>
> I lost several days of my life to exactly this problem, many years ago.  =
I
> am still triggered by it.
>
> I am mostly OK with option A, but with a big caveat:
>
> The root problem here is "You keep using that function.  I do not think i=
t
> means what you think it means."
>
> As Rowan notes, what people actually *want* most of the time is "I got
> this string from a user and have NFI what it's encoding is, but my system
> needs UTF-8, so gimmie this string in UTF-8."  And they use utf8_encode()=
,
> which then fails *sometimes* in exciting and mysterious ways, because
> that's not what it is.
>
> Removing utf8_encode() may keep people from misusing it, but that doesn't
> mean the problem space they were trying to solve goes away.  If anything,
> people who still don't realize that it's the wrong solution will get angr=
y
> that we're taking away a "useful" tool and replacing it with "meh, go loo=
k
> at library X," which is admittedly a pretty rude answer.
>
> If we're removing a bad answer to the problem, we should also replace it
> with a good answer.
>
> Someone will, I'm sure, pop in at this point and declare "if you don't
> know the character encoding you're receiving, then you're doing it wrong
> and are already lost and we can't help you."  While that may be technical=
ly
> correct, it's also an entirely useless answer because strings received ov=
er
> HTTP very frequently do not tell you what their encoding is, or they lie
> about what their encoding is.  (The header may say it's ISO8859, or UTF8,
> or whatever, but someone copy-pasted from MS Word into a text box and now
> it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8
> except for the Windows-1252 part.  Like, that's literally the problem I
> lost several days to.)  "Your own fault" is not even an accurate answer a=
t
> that point.
>
> So if we're going to take away people's broken hammer, we need to be very
> clear about what hammer to use instead.
>
> The initial answer is probably "here's how to use a series of mb_string
> functions together to produce a reasonably good
> guess-my-encoding-and-convert-to-utf8 routine" documentation.  Which... m=
ay
> exist, but if it does I've never found it.  So at bare minimum the
> encode_utf8() documentation needs to include a "use this code snippet
> instead" description, and not just link to the mbstring extension.
> Glancing through the mbstring docs right now, it looks like it's not
> already a single function call, but some combination of several, and has
> some global flags that get set (via mb_detect_order()), I think.  It's no=
t
> as easy to use as utf8_encode(), even if utf8_encode() is wrong.  That
> suggests we may want to try and simplify the mbstring API, or internalize
> some function that handles the most common case in a way that doesn't rel=
y
> on global flags.
>
> So, let's make that easier to use, so that we can change "this function i=
s
> wrong, we're taking it away from you" to "this function is wrong, here's =
a
> way better alternative that you can use instead (while we quietly take th=
e
> wrong one away from you while you're distracted by the new shiny)."
>
> I don't know the mbstring API well enough to say what that alternative
> ideally looks like, but if we can answer that it would make killing off t=
he
> old functions much more palatable.
>
> --Larry Garfield
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>
>
As an encoding nerd and perennial complainer regarding these functions I
would like nothing more than to see them immediately disappear, but I do
recognize the BC-breaking potential for something like that. However, I do
have a suggestion that I've not seen mentioned yet that should at least
address some of the misconceptions that people get from the current
functions.

I would suggest adding optional source/destination encoding parameters to
the functions, eg:

utf8_encode(string $string, string $source_encoding =3D "ISO-8859-1")
utf8_decode(string $string, string $destination_encoding =3D "ISO-8859-1")

and, if you'll forgive the hand-waving due to my unfamiliarity with PHP
internals, they could simply be passed through to an underlying
mb_convert_encoding() call. Eg:

mb_convert_encoding($string, 'UTF-8', $source_encoding)
mb_convert_encoding($string, $destination_encoding, 'UTF-8')

This would preserve BC while also making the function header and
documentation much more descriptive of what the function actually does,
allow more flexible use of the functions, and potentially drive people to
use the mb_* functions instead. This could also be used as a gradual
pathway to deprecating the functions, where, for example, a deprecation
notice could be raised when the function is called without the
source/destination encoding explicitly given.

I know that there is also some resistance to the idea of requiring mbstring
as it is an optional extension, as well as resistance to bringing mbstring
into core due to design and/or history. This could be worked around by
[once again, apology for handwaving] only requiring mbstring for
conversions involving an encoding other than ISO-8859-1 and falling back to
the existing implementation otherwise.

--000000000000eb919d05d3b0417e--