Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:116711
MIME-Version: 1.0
References: <f313c9c4-f8a2-0b39-7499-30620d80cecd@gmail.com>
 <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com> <CABXx6-y=5-OV2PnOy5f8xr4Nu-=7rkz9c4Q51sfsDj5wb1XJhA@mail.gmail.com>
In-Reply-To: <CABXx6-y=5-OV2PnOy5f8xr4Nu-=7rkz9c4Q51sfsDj5wb1XJhA@mail.gmail.com>
Date: Tue, 21 Dec 2021 16:31:00 -0800
Message-ID: <CAKOpQSxxxmzT4a=EZfhBssFV9rM5_vjjh3Kpy9+bd1pzoOPsxQ@mail.gmail.com>
To: Wade Rossmann <wrossmann@gmail.com>
Cc: Larry Garfield <larry@garfieldtech.com>, php internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="0000000000005f6c7805d3b13dff"
Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?
From: kris.craig@gmail.com (Kris Craig)

--0000000000005f6c7805d3b13dff
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Tue, Dec 21, 2021 at 3:21 PM Wade Rossmann <wrossmann@gmail.com> wrote:

> On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield <larry@garfieldtech.com>
> wrote:
>
> > On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote:
> > > Hi all,
> > >
> > > The functions utf8_encode and utf8_decode are historical oddities,
> which
> > > almost certainly would not be accepted if proposed today:
> > >
> > > * Their names do not describe their functionality, which is to conver=
t
> > > to/from one specific single-byte encoding. This leads to a common
> > > confusion that they can be used to "fix" UTF-8 encoding problems, whi=
ch
> > > they generally make worse.
> > > * That single-byte encoding is ISO 8859-1, not its common cousins
> > > Windows-1252 or ISO 88159-15. This means, for instance, that they do
> not
> > > handle the Euro sign: utf8_decode('=E2=82=AC') returns '?' (i.e. unma=
ppable)
> > > not "\x80" (Windows-1252) or "\xA4" (8859-15)
> > >
> > > On the other hand, they are commonly used, both correctly and
> > > incorrectly, so removing them is not easy.
> > >
> > > A previous proposal to remove them [1] resulted in Andrea making two
> > > significant improvements: moving them from ext/xml to ext/standard [2=
]
> > > and rewriting the documentation to explain them properly [3]. My
> genuine
> > > thanks for that.
> > >
> > > However, it hasn't stopped people misunderstanding them, and quite
> > > reasonably: you shouldn't need to look up every function you use in t=
he
> > > manual, to make sure it actually does what its name suggests.
> > >
> > >
> > > I can see three ways forward:
> > >
> > > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provi=
de
> > > a specific replacement, but recommend people look at iconv() or
> > > mb_convert_encoding(). There is precedent for this, such as
> > > convert_cyr_string(), but it may frustrate those who are using the
> > > functions correctly.
> > >
> > > B) Introduce new names, such as utf8_to_iso_8859_1 and
> > > iso_8859_1_to_utf8; immediately make those the primary names in the
> > > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> > > notices for the old names, either immediately or in some future
> release.
> > > This gives a smoother upgrade path, but commits us to having these
> > > functions as outliers in our standard library.
> > >
> > > C) Leave them alone forever. Treat it as the user's fault if they mes=
s
> > > things up by misunderstanding them.
> > >
> > >
> > > I am happy to put together an RFC for either A or B, if it has a chan=
ce
> > > of reaching consensus. I would really like to avoid option C.
> > >
> > >
> > > [1] https://externals.io/message/95166
> > > [2] https://github.com/php/php-src/pull/2160
> > > [3]
> > >
> >
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a=
8238
> > >
> > > Regards,
> >
> > I lost several days of my life to exactly this problem, many years ago.
> I
> > am still triggered by it.
> >
> > I am mostly OK with option A, but with a big caveat:
> >
> > The root problem here is "You keep using that function.  I do not think
> it
> > means what you think it means."
> >
> > As Rowan notes, what people actually *want* most of the time is "I got
> > this string from a user and have NFI what it's encoding is, but my syst=
em
> > needs UTF-8, so gimmie this string in UTF-8."  And they use
> utf8_encode(),
> > which then fails *sometimes* in exciting and mysterious ways, because
> > that's not what it is.
> >
> > Removing utf8_encode() may keep people from misusing it, but that doesn=
't
> > mean the problem space they were trying to solve goes away.  If anythin=
g,
> > people who still don't realize that it's the wrong solution will get
> angry
> > that we're taking away a "useful" tool and replacing it with "meh, go
> look
> > at library X," which is admittedly a pretty rude answer.
> >
> > If we're removing a bad answer to the problem, we should also replace i=
t
> > with a good answer.
> >
> > Someone will, I'm sure, pop in at this point and declare "if you don't
> > know the character encoding you're receiving, then you're doing it wron=
g
> > and are already lost and we can't help you."  While that may be
> technically
> > correct, it's also an entirely useless answer because strings received
> over
> > HTTP very frequently do not tell you what their encoding is, or they li=
e
> > about what their encoding is.  (The header may say it's ISO8859, or UTF=
8,
> > or whatever, but someone copy-pasted from MS Word into a text box and n=
ow
> > it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8
> > except for the Windows-1252 part.  Like, that's literally the problem I
> > lost several days to.)  "Your own fault" is not even an accurate answer
> at
> > that point.
> >
> > So if we're going to take away people's broken hammer, we need to be ve=
ry
> > clear about what hammer to use instead.
> >
> > The initial answer is probably "here's how to use a series of mb_string
> > functions together to produce a reasonably good
> > guess-my-encoding-and-convert-to-utf8 routine" documentation.  Which...
> may
> > exist, but if it does I've never found it.  So at bare minimum the
> > encode_utf8() documentation needs to include a "use this code snippet
> > instead" description, and not just link to the mbstring extension.
> > Glancing through the mbstring docs right now, it looks like it's not
> > already a single function call, but some combination of several, and ha=
s
> > some global flags that get set (via mb_detect_order()), I think.  It's
> not
> > as easy to use as utf8_encode(), even if utf8_encode() is wrong.  That
> > suggests we may want to try and simplify the mbstring API, or internali=
ze
> > some function that handles the most common case in a way that doesn't
> rely
> > on global flags.
> >
> > So, let's make that easier to use, so that we can change "this function
> is
> > wrong, we're taking it away from you" to "this function is wrong, here'=
s
> a
> > way better alternative that you can use instead (while we quietly take
> the
> > wrong one away from you while you're distracted by the new shiny)."
> >
> > I don't know the mbstring API well enough to say what that alternative
> > ideally looks like, but if we can answer that it would make killing off
> the
> > old functions much more palatable.
> >
> > --Larry Garfield
> >
> > --
> > PHP Internals - PHP Runtime Development Mailing List
> > To unsubscribe, visit: https://www.php.net/unsub.php
> >
> >
> As an encoding nerd and perennial complainer regarding these functions I
> would like nothing more than to see them immediately disappear, but I do
> recognize the BC-breaking potential for something like that. However, I d=
o
> have a suggestion that I've not seen mentioned yet that should at least
> address some of the misconceptions that people get from the current
> functions.
>
> I would suggest adding optional source/destination encoding parameters to
> the functions, eg:
>
> utf8_encode(string $string, string $source_encoding =3D "ISO-8859-1")
> utf8_decode(string $string, string $destination_encoding =3D "ISO-8859-1"=
)
>
> and, if you'll forgive the hand-waving due to my unfamiliarity with PHP
> internals, they could simply be passed through to an underlying
> mb_convert_encoding() call. Eg:
>
> mb_convert_encoding($string, 'UTF-8', $source_encoding)
> mb_convert_encoding($string, $destination_encoding, 'UTF-8')
>
> This would preserve BC while also making the function header and
> documentation much more descriptive of what the function actually does,
> allow more flexible use of the functions, and potentially drive people to
> use the mb_* functions instead. This could also be used as a gradual
> pathway to deprecating the functions, where, for example, a deprecation
> notice could be raised when the function is called without the
> source/destination encoding explicitly given.
>
> I know that there is also some resistance to the idea of requiring mbstri=
ng
> as it is an optional extension, as well as resistance to bringing mbstrin=
g
> into core due to design and/or history. This could be worked around by
> [once again, apology for handwaving] only requiring mbstring for
> conversions involving an encoding other than ISO-8859-1 and falling back =
to
> the existing implementation otherwise.
>

Now might be a good time to make this into an RFC.  :)

--Kris

--0000000000005f6c7805d3b13dff--