[RFC] Deprecate and Remove utf8_encode and utf8_decode

3 years ago by Rowan Tommins — view source

unread

Good $daypart everybody,

I would like to open discussion on an RFC to deprecate and later remove
the functions utf8_encode() and utf8_decode()

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

This is not a straight-forward decision, because these functions are not
actually broken, and do have a narrow set of legitimate use cases - they
convert between ISO 8859-1 (the single-byte encoding also known as
"Latin 1") and UTF-8.

However, their naming and implementation leads to them so frequently
being misused, that I believe their inclusion in the language does more
harm than good. By deprecating and removing them, we can encourage users
to specify their source and target encodings explicitly using one of
several more flexible functions.

In previous discussions, alternatives have been proposed to improve
these functions, but I am not convinced this is in the long-term
interest of the language (see "Rejected Features" section in the RFC).

Note: one of the alternatives, UConverter::transcode, is currently
undocumented; I have written a documentation patch to address this and
other issues, regardless of the RFC's result:
https://github.com/php/doc-en/pull/1418

Your feedback on the RFC, and the proposal itself, are welcome.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Craig Francis — view source

unread

I would like to open discussion on an RFC to deprecate and later remove
the functions utf8_encode() and utf8_decode()

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

Thanks Rowan.

Whenever I see these functions being used (including when I used them),
it's been, as you note, "commonly misunderstood" - so I'd be happy to see
the back of them.

Only query I have is about the availability of different functions... not
sure why, but the documentation says these are provided by the
"xml" extension, even though it looks like they are in
./standard/string.c (your pull request seems to correct this)... so I
assume projects have used these functions on the basis that they are always
available... I suppose you could argue that "iconv" is enabled by default,
so that's hopefully reliable (even though it can be disabled with
--without-iconv)... whereas "mbstring" and "intl" are non-default
extensions.

Craig

3 years ago by Rowan Tommins — view source

unread

Only query I have is about the availability of different functions...
not sure why, but the documentation says these are provided by the
"xml" extension, even though it looks like they are in
./standard/string.c (your pull request seems to correct this)... so
I assume projects have used these functions on the basis that they are
always available... I suppose you could argue that "iconv" is enabled
by default, so that's hopefully reliable (even though it can be
disabled with --without-iconv)... whereas "mbstring" and "intl" are
non-default extensions.

Yes, since 7.2, utf8_encode and utf8_decode have been always available;
before that, they were in ext/xml (which in practice meant nearly
always available). The fact that none of the alternatives are guaranteed
to be available is unfortunate, but by their nature they are large
amounts of code, so moving or replicating them in core is not really an
option.

I don't have hard facts to back it up, but my impression is that
ext/mbstring is quite commonly installed, and required by apps and
libraries. Unlike the other two, it has no system dependencies, because
the implementation is entirely in PHP's source tree.

I'm not sure how often iconv is enabled (default according to php.net
doesn't necessarily mean default according to Ubuntu / Centos / cheap
shared hosting), but its functionality isn't very portable between
systems - for instance, 3v4l.org rejects 'ISO-8859-1' as an encoding
[https://3v4l.org/biGa8], but my local system accepts it, although both
report ICONV_IMPL as "glibc".

ext/intl is by far the most powerful of the three extensions, albeit
extremely poorly documented; but it may not be installed as often,
because that power comes from a large external library (ICU).

The bright side is that if you really do only need one encoding pair,
implementing in pure PHP is pretty trivial, and there are multiple
polyfills already out there. That leaves a minority of a minority of a
minority, who a) actually need Latin1 <-> UTF-8, and no other encodings;
b) can't rely on any of the three listed extensions; AND c) care enough
about performance that a pure PHP implementation is problematic.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Craig Francis — view source

unread

I don't have hard facts to back it up, but my impression is that
ext/mbstring is quite commonly installed, and required by apps and
libraries. Unlike the other two, it has no system dependencies, because
the implementation is entirely in PHP's source tree.

Thanks for confirming those details Rowan.

I'm just wondering, and this would not be necessary... considering how most
systems need to deal with UTF-8 data today, could an argument be made for
enabling etc/mbstring by default?

I'm fairly sure Ubuntu and CentOS need to install the package
php-mbstring separately; whereas my limited experience with cheep/shared
hosting, they tend to have it enabled.

Then mb_convert_encoding() could become the default suggested
alternative, and everyone could trust functions like mb_strlen() are
available as well.

Just for my own interest - technically mbstring uses libmbfl, but that's
already available from /ext/mbstring/libmbfl/

Craig

3 years ago by Rowan Tommins — view source

unread

I'm just wondering, and this would not be necessary... considering how
most systems need to deal with UTF-8 data today, could an argument be
made for enabling etc/mbstring by default?

I'm fairly sure Ubuntu and CentOS need to install the package
php-mbstring separately; whereas my limited experience with
cheep/shared hosting, they tend to have it enabled.

Unfortunately, enabling by default in the distributed source files won't
make any difference to that situation, as anything that can be built as
a separate library file can (and seemingly will) be split into a
separate package in a binary distribution.

Making the extension always available (impossible to compile without it)
is a potential option, and I think has been suggested before; I'm not
sure of the exact pros and cons.

everyone could trust functions like mb_strlen() are available as well.

I would personally encourage everyone to have ext/intl installed and use
grapheme_strlen() instead of mb_strlen(), because knowing whether a
particular instance of the string "Nguyễn" is written with 6, 7, or 8
code points is not nearly as useful as knowing that it looks like 6
"characters" to a user either way.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Craig Francis — view source

unread

Making the extension always available (impossible to compile without it)
is a potential option, and I think has been suggested before; I'm not
sure of the exact pros and cons.

[...]

I would personally encourage everyone to have ext/intl installed and use

grapheme_strlen() instead of mb_strlen(), because knowing whether a
particular instance of the string "Nguyễn" is written with 6, 7, or 8
code points is not nearly as useful as knowing that it looks like 6
"characters" to a user either way.

Good point.

I would like something that can be relied on to convert a strings character
encoding... I assume it's a question of ext/mbstring having all of its
dependencies already present (easier to compile?), vs ext/intl potentially
being more useful (if a little bigger?).

Craig

3 years ago by Christoph M. Becker — view source

unread

Making the extension always available (impossible to compile without it)
is a potential option, and I think has been suggested before; I'm not
sure of the exact pros and cons.

[...]

I would personally encourage everyone to have ext/intl installed and use

grapheme_strlen() instead of mb_strlen(), because knowing whether a
particular instance of the string "Nguyễn" is written with 6, 7, or 8
code points is not nearly as useful as knowing that it looks like 6
"characters" to a user either way.

Good point.

I would like something that can be relied on to convert a strings character
encoding... I assume it's a question of ext/mbstring having all of its
dependencies already present (easier to compile?), vs ext/intl potentially
being more useful (if a little bigger?).

We cannot make any extension with external dependencies mandatory. If
we would require ext/intl, we had to bundle ICU, which is highly
unlikely to happen. Making ext/mbstring (without ext/mbregex) mandatory
would be an option, but there should be a separate RFC about that.

--
Christoph M. Becker

3 years ago by Alain D D Williams — view source

unread

I would personally encourage everyone to have ext/intl installed and use

grapheme_strlen() instead of mb_strlen(), because knowing whether a
particular instance of the string "Nguyễn" is written with 6, 7, or 8
code points is not nearly as useful as knowing that it looks like 6
"characters" to a user either way.

Looking at the description of grapheme_strlen() I note that it can return null.
However it does not say why.

https://www.php.net/manual/en/function.grapheme-strlen.php

Digging in the code I see that it will return null if
intl_convert_utf8_to_utf16() fails. I think because of one of:

U_BUFFER_OVERFLOW_ERROR means that *target buffer is not large enough

U_STRING_NOT_TERMINATED_WARNING usually means that the input string is empty

u_strFromUTF8() failing.

--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 https://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: https://www.phcomp.co.uk/Contact.html
#include <std_disclaimer.h

3 years ago by Rowan Tommins — view source

unread

Looking at the description of grapheme_strlen() I note that it can return null.
However it does not say why.

Huh, that feels like a bug to me, since it can also return false, which
is the more standard way of indicating failure.

The obvious failure case is an input string that's not valid UTF-8, e.g.
grapheme_strlen("\xFF"). It appears that currently returns null; so I'm
not actually sure how you'd trigger the false case.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Christoph M. Becker — view source

unread

Looking at the description of grapheme_strlen() I note that it can
return null.
However it does not say why>
Huh, that feels like a bug to me, since it can also return false, which
is the more standard way of indicating failure.

This is not the only function which returns null or false for different
kinds of failures.

The obvious failure case is an input string that's not valid UTF-8, e.g.
grapheme_strlen("\xFF"). It appears that currently returns null; so I'm
not actually sure how you'd trigger the false case.

false is returned, if the internal function
grapheme_get_break_iterator() fails. I'm not sure under which
circumstances that function may fail, but from looking at the code and
the ICU documentation[1], it seems we're relying on functionality which
is deprecated as of ICU 52; it's probably a good idea to adjust the
code, and to possibly bump the ICU version requirements (currently, ICU

= 50.1 is required).

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ubrk_8h.html#ad1917029dcd00164d18e9d77b73be195

--
Christoph M. Becker

3 years ago by Rowan Tommins — view source

unread

Good $daypart everybody,

I would like to open discussion on an RFC to deprecate and later
remove the functions utf8_encode() and utf8_decode()

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

Hi all,

I have made some minor updates to this RFC, to explicitly recommend
mb_convert_encoding as the alternative. See the "Alternatives to Removed
Functionality" for the reasoning.

If there is no further feedback, I will open voting some time next week,
so if you have any concerns or queries please let me know.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Rowan Tommins — view source

unread

I would like to open discussion on an RFC to deprecate and later
remove the functions utf8_encode() and utf8_decode()

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

Final chance for feedback on this RFC, as revised, before I put it to a
vote.

There's been very little reaction on this thread, which I'm hoping means
everyone's either planning to vote "Yes" or just doesn't care...

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Wade Rossmann — view source

unread

As someone who has spent a lot of time on IRC and StackOverflow answering
encoding questions over the last 10 years or so I am 100% behind this
change, even though I'm not able to vote for it.

Thank you for putting this RFC forward.

On Wed, Mar 23, 2022 at 6:47 AM Rowan Tommins rowan.collins@gmail.com
wrote:

I would like to open discussion on an RFC to deprecate and later
remove the functions utf8_encode() and utf8_decode()

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

Final chance for feedback on this RFC, as revised, before I put it to a
vote.

There's been very little reaction on this thread, which I'm hoping means
everyone's either planning to vote "Yes" or just doesn't care...

Regards,

--
Rowan Tommins
[IMSoP]

--

To unsubscribe, visit: https://www.php.net/unsub.php

3 years ago by Juliette Reinders Folmer — view source

unread

I would like to open discussion on an RFC to deprecate and later
remove the functions utf8_encode() and utf8_decode()

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

Final chance for feedback on this RFC, as revised, before I put it to
a vote.

There's been very little reaction on this thread, which I'm hoping
means everyone's either planning to vote "Yes" or just doesn't care...

Regards,

While I agree the functions are often used incorrectly, what worries me
about this RFC is that the only viable alternatives for these functions
are in two optional extensions, which in practice will mean lots of
projects will need to start polyfilling the old functions and/or start
including a polyfill for one of the extensions, as they can not rely on
those optional extensions being turned on.

3 years ago by Rowan Tommins — view source

unread

While I agree the functions are often used incorrectly, what worries
me about this RFC is that the only viable alternatives for these
functions are in two optional extensions, which in practice will mean
lots of projects will need to start polyfilling the old functions
and/or start including a polyfill for one of the extensions, as they
can not rely on those optional extensions being turned on.

While I certainly understand this concern, the more I look into it, the
less worried I am, because so many projects already require the mbstring
extension. The list includes projects like PHPUnit, Laravel, Drupal, and
phpBB. A lot of others will probably detect it at run-time, and disable
certain functionality - WooCommerce lists it as required only for
non-English sites, for instance.

There probably is a case for making mbstring always-on rather than
optional, but that perhaps requires a bigger discussion around what a
"minimum" PHP should include. Why is XML support not always enabled, for
instance?

If these functions were proposed today, under better names but the same
feature set, I don't think mbstring being optional would be accepted as
reasoning for adding them to core. So the only reason to keep them is if
they're widely (and successfully) used, which I've not found evidence for.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Christian Schneider — view source

unread

Am 24.03.2022 um 10:32 schrieb Rowan Tommins rowan.collins@gmail.com:

While I agree the functions are often used incorrectly, what worries me about this RFC is that the only viable alternatives for these functions are in two optional extensions, which in practice will mean lots of projects will need to start polyfilling the old functions and/or start including a polyfill for one of the extensions, as they can not rely on those optional extensions being turned on.

[...]

If these functions were proposed today, under better names but the same feature set, I don't think mbstring being optional would be accepted as reasoning for adding them to core. So the only reason to keep them is if they're widely (and successfully) used, which I've not found evidence for.

This argument is somewhat broken: Not adding something because an alternative exists is not the same as removing something (requiring code changes) where an alternative exists.

Don't get me wrong: I'm on the fence if deprecating and later removing is the right thing to do, I'm not completely against it.

I have one issue with the wording in the RFC: While “Function utf8_encode is deprecated; check usage is correct and consider mb_convert_encoding or other replacement.” suggests to replace it, the part about checking the usage implies that if someone is sure about the correct usage it is fine to keep using utf8_encode(). But as the proposal wants to remove it for 9.0 I think this is somewhat misleading.

Regards,

Chris

3 years ago by Rowan Tommins — view source

unread

If these functions were proposed today, under better names but the same feature set, I don't think mbstring being optional would be accepted as reasoning for adding them to core. So the only reason to keep them is if they're widely (and successfully) used, which I've not found evidence for.
This argument is somewhat broken: Not adding something because an alternative exists is not the same as removing something (requiring code changes) where an alternative exists.

Absolutely, and an earlier draft of my e-mail said that more explicitly,
but I edited for conciseness. I was trying to make the distinction that
we wouldn't want to tell new users "if you want to measure the length of
the string, install ext/strlen"; but telling new users "if you want to
convert encodings, install ext/mbstring" is probably reasonable.

You are absolutely right that that does not apply to existing uses
of the functions, and there will be some proportion of users / libraries
who were previously using these functions successfully, but do not have
mbstring installed / required. I think the usage is low enough, and the
install base of mbstring is high enough, that that won't affect many
people; but it's hard to prove that conclusively.

In the end, we have to weigh two costs: the cost to new users of
leaving these confusing functions in place; and the cost to some portion
of existing users of forcing them to change their code, and possibly
install ext-mbstring (or some alternative). Doing nothing is not
automatically better.

I have one issue with the wording in the RFC: While “Function utf8_encode is deprecated; check usage is correct and consider mb_convert_encoding or other replacement.” suggests to replace it, the part about checking the usage implies that if someone is sure about the correct usage it is fine to keep using utf8_encode(). But as the proposal wants to remove it for 9.0 I think this is somewhat misleading.

That's a fair point. The intention was to encourage people to look at
whether they were using the function right in the first place, not
blindly replace it with mb_convert_encoding, since the whole point of
the deprecation is that they probably aren't. I'm not sure how to
concisely say "replace with mb_convert_encoding if you actually need to,
but maybe you can just delete this function call and your code will be
better".

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Rowan Tommins — view source

unread

I have one issue with the wording in the RFC: While “Function
utf8_encode is deprecated; check usage is correct and consider
mb_convert_encoding or other replacement.” suggests to replace it,
the part about checking the usage implies that if someone is sure
about the correct usage it is fine to keep using utf8_encode(). But
as the proposal wants to remove it for 9.0 I think this is somewhat
misleading.

That's a fair point. The intention was to encourage people to look at
whether they were using the function right in the first place, not
blindly replace it with mb_convert_encoding, since the whole point of
the deprecation is that they probably aren't. I'm not sure how to
concisely say "replace with mb_convert_encoding if you actually need
to, but maybe you can just delete this function call and your code
will be better".

Maybe I'm trying to be "too helpful" there. Should we just use the
generic deprecation message, and let people look up the in-depth
explanation in the manual?

Anyone have any thoughts on that?

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Rowan Tommins — view source

unread

Maybe I'm trying to be "too helpful" there. Should we just use the
generic deprecation message, and let people look up the in-depth
explanation in the manual?

I've dropped the custom deprecation message from the proposal, as
there's just not room for the subtlety I was trying to get across.

I will open voting this time tomorrow unless anyone has any further
comments.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Björn Larsson via internals — view source

unread

Den 2022-03-23 kl. 14:46, skrev Rowan Tommins:

I would like to open discussion on an RFC to deprecate and later
remove the functions utf8_encode() and utf8_decode()

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

Final chance for feedback on this RFC, as revised, before I put it to a
vote.

There's been very little reaction on this thread, which I'm hoping means
everyone's either planning to vote "Yes" or just doesn't care...

Regards,

Well, the usecase I can provide is that we have a site with content
in ISO9959-1 (Latin 1).

Now the utf8_decode are used in the backend when the webapps and site
are communicating with DB, e.g. in Ajax calls. The webapps are
built in Javascript.

Our feedback would be that the usage of this function have had zero
issues or bugs for more then five years that e.g. the apps have been
in production. So in this specific use case there is no benefit at the
moment with this removal. OTOH, the usage of this function is mostly
wrapped so it's not a big job changing it and it's good that the RFC
points out alternatives, where we would go for mbstring.

I also find the usage of both functions in one of our most important
Open Source libraries. How important they are there I don't know.

r//Björn

3 years ago by Rowan Tommins — view source

unread

Well, the usecase I can provide is that we have a site with content
in ISO9959-1 (Latin 1).

Thanks for the data point. Out of curiosity, what language is the site
in, and where are you based?

One of the things missing from Latin-1 is the Euro symbol (because it
wasn't invented yet), so I'm guessing your site doesn't need that?

I also find the usage of both functions in one of our most important
Open Source libraries. How important they are there I don't know.

Can you point me to the library in question? I'd be interested to see
how they're using it.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Björn Larsson via internals — view source

unread

Den 2022-03-24 kl. 19:48, skrev Rowan Tommins:

Well, the usecase I can provide is that we have a site with content
in ISO9959-1 (Latin 1).

Thanks for the data point. Out of curiosity, what language is the site
in, and where are you based?

It's in swedish and I'm located in Sweden.

One of the things missing from Latin-1 is the Euro symbol (because it
wasn't invented yet), so I'm guessing your site doesn't need that?

No, we are not using that one. In one specific case for payment we just
write euro ;-)

I also find the usage of both functions in one of our most important
Open Source libraries. How important they are there I don't know.

Can you point me to the library in question? I'd be interested to see
how they're using it.

Yes, it's the Revive Ad server. At the moment we are on v4, going for
v5.4 as soon it's released.

Regards //Björn Larsson

3 years ago by php@beccati.com — view source

unread

Hi Bjorn,

Can you point me to the library in question? I'd be interested to see
how they're using it.

Yes, it's the Revive Ad server. At the moment we are on v4, going for
v5.4 as soon it's released.

Thanks, I was supposed to check myself, but you beat me to it!

From a cursory look:

some calls are wrapped in a function_exists check and are lower
priority then mbstring or iconv.
some are surely inappropriate usages.

If not 5.4.0, I'll make sure the matter is fully taken care of in a
subsequent release.

So please, I'm all for burning utf8_en/decode with fire in 9.0.

Cheers

Matteo Beccati

Development & Consulting - http://www.beccati.com/