What should we do with utf8_encode and utf8_decode?

4 years ago by Benjamin Morel — view source

unread

I can see three ways forward:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.

Hi, I'm personally fine with A or B, both of which have pros & cons:

A is probably the cleanest way as, as you said, these functions should
never have existed (locked to a single encoding that will only benefit a
portion of users), but that's quite a BC break
B has is less of a BC break as it gives users a chance to rename their
function calls, but leaves an oddity in the standard library

I'm a bit worried that either way, we'll start seeing some "polyfills"
appear on Packagist to re-introduce the old functions, but at least they
will be gone from the core.

— Benjamin

4 years ago by Ben Ramsey — view source

unread

I can see three ways forward:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.

Hi, I'm personally fine with A or B, both of which have pros & cons:

A is probably the cleanest way as, as you said, these functions should
never have existed (locked to a single encoding that will only benefit a
portion of users), but that's quite a BC break

B has is less of a BC break as it gives users a chance to rename their
function calls, but leaves an oddity in the standard library

I'm a bit worried that either way, we'll start seeing some "polyfills"
appear on Packagist to re-introduce the old functions, but at least they
will be gone from the core.

I prefer option A, and the emergence of userland polyfills doesn’t worry
me. IMO, that’s the right way for the community to handle the BC break.

Cheers,
Ben

4 years ago by Ayesh Karunaratne — view source

unread

Thank you for opening this conversation, these functions have stung me
in the past, and I would be so happy to see them gone :)

Personally, I would very much like to go with Plan A.

XML parsers that often deal with non-UTF-8 character encodings
frequently use these functions. However, any parser worth their salt
is better off using mbstring or iconv because of the lack of
Windows-1252 support that is assumed elsewhere for ISO-8859. If we
have a utf8_encode that supports Windows-1252 as often expected, I
think plan B would be the more smoother upgrade.
On Packagist top 1000 downloads, stripe-php, phpcpd, pdepend,
carbon, monolog, php-cs-fixer, htmlpurifier, and aws-php-sdk use
utf8_encode. Some of these libraries depend on ext-mbstring or
Symfony mbstring polyfill, so we are left with even fewer libraries
that cannot assume iconv() or mb_convert_encoding availability.

Hi all,

The functions utf8_encode and utf8_decode are historical oddities, which
almost certainly would not be accepted if proposed today:

Their names do not describe their functionality, which is to convert
to/from one specific single-byte encoding. This leads to a common
confusion that they can be used to "fix" UTF-8 encoding problems, which
they generally make worse.

That single-byte encoding is ISO 8859-1, not its common cousins
Windows-1252 or ISO 88159-15. This means, for instance, that they do not
handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
not "\x80" (Windows-1252) or "\xA4" (8859-15)

On the other hand, they are commonly used, both correctly and
incorrectly, so removing them is not easy.

A previous proposal to remove them [1] resulted in Andrea making two
significant improvements: moving them from ext/xml to ext/standard [2]
and rewriting the documentation to explain them properly [3]. My genuine
thanks for that.

However, it hasn't stopped people misunderstanding them, and quite
reasonably: you shouldn't need to look up every function you use in the
manual, to make sure it actually does what its name suggests.

I can see three ways forward:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.

C) Leave them alone forever. Treat it as the user's fault if they mess
things up by misunderstanding them.

I am happy to put together an RFC for either A or B, if it has a chance
of reaching consensus. I would really like to avoid option C.

[1] https://externals.io/message/95166
[2] https://github.com/php/php-src/pull/2160
[3]
https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238

Regards,

--
Rowan Tommins
[IMSoP]

--

To unsubscribe, visit: https://www.php.net/unsub.php

4 years ago by Larry Garfield — view source

unread

Hi all,

The functions utf8_encode and utf8_decode are historical oddities, which
almost certainly would not be accepted if proposed today:

Their names do not describe their functionality, which is to convert
to/from one specific single-byte encoding. This leads to a common
confusion that they can be used to "fix" UTF-8 encoding problems, which
they generally make worse.

That single-byte encoding is ISO 8859-1, not its common cousins
Windows-1252 or ISO 88159-15. This means, for instance, that they do not
handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
not "\x80" (Windows-1252) or "\xA4" (8859-15)

On the other hand, they are commonly used, both correctly and
incorrectly, so removing them is not easy.

A previous proposal to remove them [1] resulted in Andrea making two
significant improvements: moving them from ext/xml to ext/standard [2]
and rewriting the documentation to explain them properly [3]. My genuine
thanks for that.

However, it hasn't stopped people misunderstanding them, and quite
reasonably: you shouldn't need to look up every function you use in the
manual, to make sure it actually does what its name suggests.

I can see three ways forward:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.

C) Leave them alone forever. Treat it as the user's fault if they mess
things up by misunderstanding them.

I am happy to put together an RFC for either A or B, if it has a chance
of reaching consensus. I would really like to avoid option C.

[1] https://externals.io/message/95166
[2] https://github.com/php/php-src/pull/2160
[3]
https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238

Regards,

I lost several days of my life to exactly this problem, many years ago. I am still triggered by it.

I am mostly OK with option A, but with a big caveat:

The root problem here is "You keep using that function. I do not think it means what you think it means."

As Rowan notes, what people actually want most of the time is "I got this string from a user and have NFI what it's encoding is, but my system needs UTF-8, so gimmie this string in UTF-8." And they use utf8_encode(), which then fails sometimes in exciting and mysterious ways, because that's not what it is.

Removing utf8_encode() may keep people from misusing it, but that doesn't mean the problem space they were trying to solve goes away. If anything, people who still don't realize that it's the wrong solution will get angry that we're taking away a "useful" tool and replacing it with "meh, go look at library X," which is admittedly a pretty rude answer.

If we're removing a bad answer to the problem, we should also replace it with a good answer.

Someone will, I'm sure, pop in at this point and declare "if you don't know the character encoding you're receiving, then you're doing it wrong and are already lost and we can't help you." While that may be technically correct, it's also an entirely useless answer because strings received over HTTP very frequently do not tell you what their encoding is, or they lie about what their encoding is. (The header may say it's ISO8859, or UTF8, or whatever, but someone copy-pasted from MS Word into a text box and now it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8 except for the Windows-1252 part. Like, that's literally the problem I lost several days to.) "Your own fault" is not even an accurate answer at that point.

So if we're going to take away people's broken hammer, we need to be very clear about what hammer to use instead.

The initial answer is probably "here's how to use a series of mb_string functions together to produce a reasonably good guess-my-encoding-and-convert-to-utf8 routine" documentation. Which... may exist, but if it does I've never found it. So at bare minimum the encode_utf8() documentation needs to include a "use this code snippet instead" description, and not just link to the mbstring extension. Glancing through the mbstring docs right now, it looks like it's not already a single function call, but some combination of several, and has some global flags that get set (via mb_detect_order()), I think. It's not as easy to use as utf8_encode(), even if utf8_encode() is wrong. That suggests we may want to try and simplify the mbstring API, or internalize some function that handles the most common case in a way that doesn't rely on global flags.

So, let's make that easier to use, so that we can change "this function is wrong, we're taking it away from you" to "this function is wrong, here's a way better alternative that you can use instead (while we quietly take the wrong one away from you while you're distracted by the new shiny)."

I don't know the mbstring API well enough to say what that alternative ideally looks like, but if we can answer that it would make killing off the old functions much more palatable.

--Larry Garfield

4 years ago by Rowan Tommins — view source

unread

As Rowan notes, what people actuallywant most of the time is "I got this string from a user and have NFI what it's encoding is, but my system needs UTF-8, so gimmie this string in UTF-8." And they use utf8_encode(), which then failssometimes in exciting and mysterious ways, because that's not what it is.

[...]

If we're removing a bad answer to the problem, we should also replace it with a good answer.

This is indeed my main concern with complete deprecation. The problem is
that detecting string encoding is a Really Hard Problem™

The fundamental problem is that any sequence of bytes is valid in any
single-byte encoding. If you're expecting printable characters only, you
can rule out some candidates if you're lucky - e.g. if your string
contains a byte in the range 0x80 to 0x9F, it's not any part of ISO 8859

but the string "\xB0\xC0\xD0" is both valid and printable in any of
dozens of 8-bit encodings.

I recently came across a Python library implementing a clever approach
to the problem, which originated at Mozilla. Its concise FAQ is worth
reading: https://chardet.readthedocs.io/en/latest/faq.html The approach
Mozilla came up with is to decide which encoding leads to something most
likely to be natural human text - e.g. don't suggest an encoding common
for Cyrillic if the result would be completely unpronounceable in Russian.

The only function I know of which even attempts encoding detection in
PHP is mb_detect_encoding, and it does a pretty bad job. For instance:

echo mb_detect_encoding("\x80500", ['Windows-1252', 'ISO-8859-15',
'ISO-8859-1']);

...picks ISO-8859-15, where 0x80 is a rarely-used control character,
rather than Windows-1252, where it's the Euro symbol.

On the other hand, if you know what encoding you do have, either of the
following will work fine:

echo mb_convert_encoding("\x80500", 'UTF-8', 'Windows-1252');
echo iconv('Windows-1252', 'UTF-8', "\x80500");

Either of these functions (passed ISO-8859-1) can be used as a polyfill
for correct uses of utf8_encode/utf8_decode, but neither is going to do
the magic trick which people always hope those functions will.

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Kamil Tekiela — view source

unread

Option A, please.

I have never had a reason to use either of these two functions. I assume
there's plenty of valid applications for converting between ISO-8859-1 and
UTF-8, but that function causes more harm than good.
I have seen plenty of people use it, but I have never seen anyone use it
properly. Most of the time people use it to fix their mojibake text when
they forget to set the connection charset in PDO or mysqli. I was a little
surprised to learn that these functions had something to do with XML.

The reason why I consider them dangerous is that people using them are most
likely solving the wrong problem. The problem isn't the conversion from ISO
to UTF but having the text in the wrong format in the first place. They are
used as some kind of magical solution that fixes an annoying problem. I
would have no quarrel with them if they were named correctly though.
Another reason why I do not like these functions is that they let you shoot
yourself in the foot very easily. They don't warn about invalid or missing
code points, which often leads to more data corruption. When doing the same
with ICONV you at least get a notice.

I think we really do not need to keep these functions. As for the
alternative that we can offer, iconv seems to be doing exactly the same
thing and even better. mb_convert_encoding does the same but also silently
ignores invalid characters. So we already offer plenty of alternatives. We
don't need to add anything new.

-- Kamil

4 years ago by Max Semenik — view source

unread

I think we really do not need to keep these functions. As for the
alternative that we can offer, iconv seems to be doing exactly the same
thing and even better. mb_convert_encoding does the same but also silently
ignores invalid characters. So we already offer plenty of alternatives. We
don't need to add anything new.

Just a quick reminder that it's possible to compile PHP without mbstring
and intl, which means that some hosts will provide PHP without these
extensions, and some packagers make them available as separate packages
that users can't or don't know how to install. Maybe we've got an
opportunity to think about making these extensions mandatory?

--
Best regards,
Max Semenik

4 years ago by Rowan Tommins — view source

unread

Just a quick reminder that it's possible to compile PHP without
mbstring and intl, which means that some hosts will provide PHP
without these extensions, and some packagers make them available as
separate packages that users can't or don't know how to install. Maybe
we've got an opportunity to think about making these extensions mandatory?

It's somewhat relevant that until PHP 7.2, it was also possible for
utf8_encode and utf8_decode to be missing, because they were in ext/xml,
which is also optional.

Bundling mbstring sounds great, until you look into the details of
what's in there and how it works. Its origin as a PHP 4 extension for
handling Japanese-specific character encodings is visible in parts of
its design - there's a lot of global state, and very little support for
the nuances of Unicode.

Bundling intl would be great, but it's a wrapper around ICU, which is
huge (because Unicode is complicated). I have read that incorporating
that into core was one of the icebergs that sunk PHP 6. It's also
extremely sparsely documented (if someone's looking for a project, it
would be great to fill in all the manual stubs with a few details from
the corresponding ICU documentation).

For what its worth, it seems these would be the relevant polyfills:

function utf8_encode(string $string) { return
UConverter::transcode($string, 'UTF8', 'ISO-8859-1'); }
function utf8_decode(string $string) { return
UConverter::transcode($string, 'ISO-8859-1', 'UTF8'); }

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by bjorn.x.larsson@telia.com — view source

unread

Den 2021-03-21 kl. 22:39, skrev Rowan Tommins:

Just a quick reminder that it's possible to compile PHP without
mbstring and intl, which means that some hosts will provide PHP
without these extensions, and some packagers make them available as
separate packages that users can't or don't know how to install. Maybe
we've got an opportunity to think about making these extensions
mandatory?

It's somewhat relevant that until PHP 7.2, it was also possible for
utf8_encode and utf8_decode to be missing, because they were in ext/xml,
which is also optional.

Bundling mbstring sounds great, until you look into the details of
what's in there and how it works. Its origin as a PHP 4 extension for
handling Japanese-specific character encodings is visible in parts of
its design - there's a lot of global state, and very little support for
the nuances of Unicode.

Bundling intl would be great, but it's a wrapper around ICU, which is
huge (because Unicode is complicated). I have read that incorporating
that into core was one of the icebergs that sunk PHP 6. It's also
extremely sparsely documented (if someone's looking for a project, it
would be great to fill in all the manual stubs with a few details from
the corresponding ICU documentation).

For what its worth, it seems these would be the relevant polyfills:

function utf8_encode(string $string) { return
UConverter::transcode($string, 'UTF8', 'ISO-8859-1'); }
function utf8_decode(string $string) { return
UConverter::transcode($string, 'ISO-8859-1', 'UTF8'); }

Regards,

In our case we use the utf8_decode functions to convert from UTF8 in
the client to ISO-8859-1 on the server, since the site is encoded in
latin1.

Our usage of that function is working flawlessly, so for us it's super
important to have a clear migration path with a good polyfill!

r//Björn L

4 years ago by Rowan Tommins — view source

unread

Hi Björn,

In our case we use the utf8_decode functions to convert from UTF8 in
the client to ISO-8859-1 on the server, since the site is encoded in
latin1.

Our usage of that function is working flawlessly, so for us it's super
important to have a clear migration path with a good polyfill!

I realise you can't speak for anyone else, but as a point of interest,
would you be OK with a polyfill having a requirement on ext/mbstring or
ext/iconv, or would you have a strong preference for a replacement built
into the core (i.e. guaranteed available without any optional packages)?

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by bjorn.x.larsson@telia.com — view source

unread

Den 2021-03-22 kl. 12:12, skrev Rowan Tommins:

Hi Björn,

In our case we use the utf8_decode functions to convert from UTF8 in
the client to ISO-8859-1 on the server, since the site is encoded in
latin1.

Our usage of that function is working flawlessly, so for us it's super
important to have a clear migration path with a good polyfill!

I realise you can't speak for anyone else, but as a point of interest,
would you be OK with a polyfill having a requirement on ext/mbstring or
ext/iconv, or would you have a strong preference for a replacement built
into the core (i.e. guaranteed available without any optional packages)?

Regards,

Well, both these extensions are part of our environment so I presume it
will also be so in the future.

Now if we have a polyfill dependent on these libraries it's a question
on how these libraries are maintained and that they are not EOL. Just
speaking from a general point here. I'm in slight favour of mbstring,
since I have a small experience of it.

What's important for us is that the polyfill has a simple API and
doesn't have any surprises / side effects. I think though there is
a case for improving these functions and keep them in the core.

We wrap these functions in one place so it's relatively easy to change
the wrapper to accomodate new functionality in the utf8_* functions as
long as we get the same end result.

I also think one should consider which opensource libraries that are
using these functions. E.g. the Revive ad server v5.2 are using both.

r//Björn L

4 years ago by Sara Golemon — view source

unread

On Mon, Mar 22, 2021 at 6:12 AM Rowan Tommins rowan.collins@gmail.com
wrote:

I realise you can't speak for anyone else, but as a point of interest,
would you be OK with a polyfill having a requirement on ext/mbstring or
ext/iconv, or would you have a strong preference for a replacement built
into the core (i.e. guaranteed available without any optional packages)?

Can you clarify what YOU mean by a polyfill? Because you're talking
about dependence on iconv/mbstring/icu which implies you want a polyfill
that does something other than what the original utf8_en/decode() functions
do, and those functions certainly do not need external dependencies.
They're really just not that complex.

-Sara

4 years ago by Nicolas Grekas — view source

unread

Le lun. 22 mars 2021 à 14:14, Sara Golemon pollita@php.net a écrit :

On Mon, Mar 22, 2021 at 6:12 AM Rowan Tommins rowan.collins@gmail.com
wrote:

I realise you can't speak for anyone else, but as a point of interest,
would you be OK with a polyfill having a requirement on ext/mbstring or
ext/iconv, or would you have a strong preference for a replacement built
into the core (i.e. guaranteed available without any optional packages)?

Can you clarify what YOU mean by a polyfill? Because you're talking
about dependence on iconv/mbstring/icu which implies you want a polyfill
that does something other than what the original utf8_en/decode() functions
do, and those functions certainly do not need external dependencies.
They're really just not that complex.

Shameless plug: the polyfill exists, without any dependency, see
https://github.com/symfony/polyfill-php72/blob/main/Php72.php

;)
Nicolas

4 years ago by Rowan Tommins — view source

unread

Shameless plug: the polyfill exists, without any dependency, see
https://github.com/symfony/polyfill-php72/blob/main/Php72.php
https://github.com/symfony/polyfill-php72/blob/main/Php72.php

Ah, thanks for sharing that. I realised while trying to get to sleep
that a pure-PHP implementation would be fairly straight-forward because
of the relationship between Latin1 and Unicode.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Wade Rossmann — view source

unread

On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield larry@garfieldtech.com
wrote:

Hi all,

The functions utf8_encode and utf8_decode are historical oddities, which
almost certainly would not be accepted if proposed today:

Their names do not describe their functionality, which is to convert
to/from one specific single-byte encoding. This leads to a common
confusion that they can be used to "fix" UTF-8 encoding problems, which
they generally make worse.

That single-byte encoding is ISO 8859-1, not its common cousins
Windows-1252 or ISO 88159-15. This means, for instance, that they do not
handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
not "\x80" (Windows-1252) or "\xA4" (8859-15)

On the other hand, they are commonly used, both correctly and
incorrectly, so removing them is not easy.

A previous proposal to remove them [1] resulted in Andrea making two
significant improvements: moving them from ext/xml to ext/standard [2]
and rewriting the documentation to explain them properly [3]. My genuine
thanks for that.

However, it hasn't stopped people misunderstanding them, and quite
reasonably: you shouldn't need to look up every function you use in the
manual, to make sure it actually does what its name suggests.

I can see three ways forward:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.

C) Leave them alone forever. Treat it as the user's fault if they mess
things up by misunderstanding them.

I am happy to put together an RFC for either A or B, if it has a chance
of reaching consensus. I would really like to avoid option C.

[1] https://externals.io/message/95166
[2] https://github.com/php/php-src/pull/2160
[3]

https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238

Regards,

I lost several days of my life to exactly this problem, many years ago. I
am still triggered by it.

I am mostly OK with option A, but with a big caveat:

The root problem here is "You keep using that function. I do not think it
means what you think it means."

As Rowan notes, what people actually want most of the time is "I got
this string from a user and have NFI what it's encoding is, but my system
needs UTF-8, so gimmie this string in UTF-8." And they use utf8_encode(),
which then fails sometimes in exciting and mysterious ways, because
that's not what it is.

Removing utf8_encode() may keep people from misusing it, but that doesn't
mean the problem space they were trying to solve goes away. If anything,
people who still don't realize that it's the wrong solution will get angry
that we're taking away a "useful" tool and replacing it with "meh, go look
at library X," which is admittedly a pretty rude answer.

If we're removing a bad answer to the problem, we should also replace it
with a good answer.

Someone will, I'm sure, pop in at this point and declare "if you don't
know the character encoding you're receiving, then you're doing it wrong
and are already lost and we can't help you." While that may be technically
correct, it's also an entirely useless answer because strings received over
HTTP very frequently do not tell you what their encoding is, or they lie
about what their encoding is. (The header may say it's ISO8859, or UTF8,
or whatever, but someone copy-pasted from MS Word into a text box and now
it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8
except for the Windows-1252 part. Like, that's literally the problem I
lost several days to.) "Your own fault" is not even an accurate answer at
that point.

So if we're going to take away people's broken hammer, we need to be very
clear about what hammer to use instead.

The initial answer is probably "here's how to use a series of mb_string
functions together to produce a reasonably good
guess-my-encoding-and-convert-to-utf8 routine" documentation. Which... may
exist, but if it does I've never found it. So at bare minimum the
encode_utf8() documentation needs to include a "use this code snippet
instead" description, and not just link to the mbstring extension.
Glancing through the mbstring docs right now, it looks like it's not
already a single function call, but some combination of several, and has
some global flags that get set (via mb_detect_order()), I think. It's not
as easy to use as utf8_encode(), even if utf8_encode() is wrong. That
suggests we may want to try and simplify the mbstring API, or internalize
some function that handles the most common case in a way that doesn't rely
on global flags.

So, let's make that easier to use, so that we can change "this function is
wrong, we're taking it away from you" to "this function is wrong, here's a
way better alternative that you can use instead (while we quietly take the
wrong one away from you while you're distracted by the new shiny)."

I don't know the mbstring API well enough to say what that alternative
ideally looks like, but if we can answer that it would make killing off the
old functions much more palatable.

--Larry Garfield

--

To unsubscribe, visit: https://www.php.net/unsub.php

As an encoding nerd and perennial complainer regarding these functions I
would like nothing more than to see them immediately disappear, but I do
recognize the BC-breaking potential for something like that. However, I do
have a suggestion that I've not seen mentioned yet that should at least
address some of the misconceptions that people get from the current
functions.

I would suggest adding optional source/destination encoding parameters to
the functions, eg:

utf8_encode(string $string, string $source_encoding = "ISO-8859-1")
utf8_decode(string $string, string $destination_encoding = "ISO-8859-1")

and, if you'll forgive the hand-waving due to my unfamiliarity with PHP
internals, they could simply be passed through to an underlying
mb_convert_encoding() call. Eg:

mb_convert_encoding($string, 'UTF-8', $source_encoding)
mb_convert_encoding($string, $destination_encoding, 'UTF-8')

This would preserve BC while also making the function header and
documentation much more descriptive of what the function actually does,
allow more flexible use of the functions, and potentially drive people to
use the mb_* functions instead. This could also be used as a gradual
pathway to deprecating the functions, where, for example, a deprecation
notice could be raised when the function is called without the
source/destination encoding explicitly given.

I know that there is also some resistance to the idea of requiring mbstring
as it is an optional extension, as well as resistance to bringing mbstring
into core due to design and/or history. This could be worked around by
[once again, apology for handwaving] only requiring mbstring for
conversions involving an encoding other than ISO-8859-1 and falling back to
the existing implementation otherwise.

3 years ago by Kris Craig — view source

unread

On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield larry@garfieldtech.com
wrote:

Hi all,

The functions utf8_encode and utf8_decode are historical oddities,
which
almost certainly would not be accepted if proposed today:

Their names do not describe their functionality, which is to convert
to/from one specific single-byte encoding. This leads to a common
confusion that they can be used to "fix" UTF-8 encoding problems, which
they generally make worse.

That single-byte encoding is ISO 8859-1, not its common cousins
Windows-1252 or ISO 88159-15. This means, for instance, that they do
not
handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
not "\x80" (Windows-1252) or "\xA4" (8859-15)

On the other hand, they are commonly used, both correctly and
incorrectly, so removing them is not easy.

A previous proposal to remove them [1] resulted in Andrea making two
significant improvements: moving them from ext/xml to ext/standard [2]
and rewriting the documentation to explain them properly [3]. My
genuine
thanks for that.

However, it hasn't stopped people misunderstanding them, and quite
reasonably: you shouldn't need to look up every function you use in the
manual, to make sure it actually does what its name suggests.

I can see three ways forward:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future
release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.

C) Leave them alone forever. Treat it as the user's fault if they mess
things up by misunderstanding them.

I am happy to put together an RFC for either A or B, if it has a chance
of reaching consensus. I would really like to avoid option C.

[1] https://externals.io/message/95166
[2] https://github.com/php/php-src/pull/2160
[3]

https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238

Regards,

I lost several days of my life to exactly this problem, many years ago.
I
am still triggered by it.

I am mostly OK with option A, but with a big caveat:

The root problem here is "You keep using that function. I do not think
it
means what you think it means."

As Rowan notes, what people actually want most of the time is "I got
this string from a user and have NFI what it's encoding is, but my system
needs UTF-8, so gimmie this string in UTF-8." And they use
utf8_encode(),
which then fails sometimes in exciting and mysterious ways, because
that's not what it is.

Removing utf8_encode() may keep people from misusing it, but that doesn't
mean the problem space they were trying to solve goes away. If anything,
people who still don't realize that it's the wrong solution will get
angry
that we're taking away a "useful" tool and replacing it with "meh, go
look
at library X," which is admittedly a pretty rude answer.

If we're removing a bad answer to the problem, we should also replace it
with a good answer.

Someone will, I'm sure, pop in at this point and declare "if you don't
know the character encoding you're receiving, then you're doing it wrong
and are already lost and we can't help you." While that may be
technically
correct, it's also an entirely useless answer because strings received
over
HTTP very frequently do not tell you what their encoding is, or they lie
about what their encoding is. (The header may say it's ISO8859, or UTF8,
or whatever, but someone copy-pasted from MS Word into a text box and now
it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8
except for the Windows-1252 part. Like, that's literally the problem I
lost several days to.) "Your own fault" is not even an accurate answer
at
that point.

So if we're going to take away people's broken hammer, we need to be very
clear about what hammer to use instead.

The initial answer is probably "here's how to use a series of mb_string
functions together to produce a reasonably good
guess-my-encoding-and-convert-to-utf8 routine" documentation. Which...
may
exist, but if it does I've never found it. So at bare minimum the
encode_utf8() documentation needs to include a "use this code snippet
instead" description, and not just link to the mbstring extension.
Glancing through the mbstring docs right now, it looks like it's not
already a single function call, but some combination of several, and has
some global flags that get set (via mb_detect_order()), I think. It's
not
as easy to use as utf8_encode(), even if utf8_encode() is wrong. That
suggests we may want to try and simplify the mbstring API, or internalize
some function that handles the most common case in a way that doesn't
rely
on global flags.

So, let's make that easier to use, so that we can change "this function
is
wrong, we're taking it away from you" to "this function is wrong, here's
a
way better alternative that you can use instead (while we quietly take
the
wrong one away from you while you're distracted by the new shiny)."

I don't know the mbstring API well enough to say what that alternative
ideally looks like, but if we can answer that it would make killing off
the
old functions much more palatable.

--Larry Garfield

--

To unsubscribe, visit: https://www.php.net/unsub.php

As an encoding nerd and perennial complainer regarding these functions I
would like nothing more than to see them immediately disappear, but I do
recognize the BC-breaking potential for something like that. However, I do
have a suggestion that I've not seen mentioned yet that should at least
address some of the misconceptions that people get from the current
functions.

I would suggest adding optional source/destination encoding parameters to
the functions, eg:

utf8_encode(string $string, string $source_encoding = "ISO-8859-1")
utf8_decode(string $string, string $destination_encoding = "ISO-8859-1")

and, if you'll forgive the hand-waving due to my unfamiliarity with PHP
internals, they could simply be passed through to an underlying
mb_convert_encoding() call. Eg:

mb_convert_encoding($string, 'UTF-8', $source_encoding)
mb_convert_encoding($string, $destination_encoding, 'UTF-8')

This would preserve BC while also making the function header and
documentation much more descriptive of what the function actually does,
allow more flexible use of the functions, and potentially drive people to
use the mb_* functions instead. This could also be used as a gradual
pathway to deprecating the functions, where, for example, a deprecation
notice could be raised when the function is called without the
source/destination encoding explicitly given.

I know that there is also some resistance to the idea of requiring mbstring
as it is an optional extension, as well as resistance to bringing mbstring
into core due to design and/or history. This could be worked around by
[once again, apology for handwaving] only requiring mbstring for
conversions involving an encoding other than ISO-8859-1 and falling back to
the existing implementation otherwise.

Now might be a good time to make this into an RFC. :)

--Kris

3 years ago by Rowan Tommins — view source

unread

I would suggest adding optional source/destination encoding parameters to
the functions, eg:

utf8_encode(string $string, string $source_encoding = "ISO-8859-1")
utf8_decode(string $string, string $destination_encoding = "ISO-8859-1")

That's an interesting idea, and definitely worth considering. In the
much longer term, we could make the parameter mandatory rather than
deprecating the entire function.

As you say, the challenge is how to implement the other encodings / what
to do if ext/mbstring is not installed. It would be very tempting to
support Windows-1252 directly, because it's just a few characters on top
of the existing mappings, and is so commonly mistaken for ISO-8859-1.
Anything else could then perhaps give a run-time error if ext/mbstring
wasn't found.

Now might be a good time to make this into an RFC. :)

I have a draft kicking around with a lot of analysis of current usage. I
will try to pick it back up after Christmas.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Andreas Heigl — view source

unread

Hey all.

I would suggest adding optional source/destination encoding parameters to
the functions, eg:

utf8_encode(string $string, string $source_encoding = "ISO-8859-1")
utf8_decode(string $string, string $destination_encoding = "ISO-8859-1")

That's an interesting idea, and definitely worth considering. In the
much longer term, we could make the parameter mandatory rather than
deprecating the entire function.

As you say, the challenge is how to implement the other encodings / what
to do if ext/mbstring is not installed. It would be very tempting to
support Windows-1252 directly, because it's just a few characters on top
of the existing mappings, and is so commonly mistaken for ISO-8859-1.
Anything else could then perhaps give a run-time error if ext/mbstring
wasn't found. >

Now might be a good time to make this into an RFC. :)

I have a draft kicking around with a lot of analysis of current usage. I
will try to pick it back up after Christmas.

Regards,

To be quite honest: Despite the huge outcry that might provoke: I'd
rather remove them today than keep them or deprecate them. And I'd
declare the removal as a bug-fix!

Due to the way those functions are currently working they have caused
more harm than actually good. One had to very explicitly know what they
are doing to use them in the right way. And most certainly when they
worked as expected that was more likely due to sheer luck than because
someone knew what they were doing.

So giving those functions a continued lifetime either as an alias to
mb_convert_encoding or by implementing the conversion to/from
Windows-1252 would still leave people under the impression that it is a
magic function.

I'd rather prefer to get rid of them and point people to the proper way
of converting one character set to another one with all the possible
mishaps that will occur.

Just my 0.02€

Cheers

Andreas

                                                           ,,,
                                                          (o o)

+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| https://andreas.heigl.org |
+---------------------------------------------------------------------+
| https://hei.gl/appointmentwithandreas |
+---------------------------------------------------------------------+

3 years ago by Andreas Heigl — view source

unread

Hey All.

Hey all.

[...]

Now might be a good time to make this into an RFC. :)

I have a draft kicking around with a lot of analysis of current usage.
I will try to pick it back up after Christmas.

I just dug a bit deeper on the subject and found this RFC from 2016:

 https://wiki.php.net/rfc/remove_utf_8_decode_encode

Perhaps we can just revive that one!

Cheers

Andreas

                                                           ,,,
                                                          (o o)

+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| https://andreas.heigl.org |
+---------------------------------------------------------------------+
| https://hei.gl/appointmentwithandreas |
+---------------------------------------------------------------------+

3 years ago by Rowan Tommins — view source

unread

I just dug a bit deeper on the subject and found this RFC from 2016:

https://wiki.php.net/rfc/remove_utf_8_decode_encode

Perhaps we can just revive that one!

As I say, I have a draft with lots more detail in, which I will tidy up
after Christmas. I deliberately didn't link to it, because I want to
re-read it myself before letting other people comment on it, and don't
have the time right now.

My current inclination is to deprecate in 8.next, and remove in 9.0, but
I want to make sure the argument for that is solid before putting it to
a vote.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Hans Henrik Bergan — view source

unread

I wonder if anyone depends on utf8_* without also depending on mb_* ? I
imagine that is exceedingly rare

I just dug a bit deeper on the subject and found this RFC from 2016:

https://wiki.php.net/rfc/remove_utf_8_decode_encode

Perhaps we can just revive that one!

As I say, I have a draft with lots more detail in, which I will tidy up
after Christmas. I deliberately didn't link to it, because I want to
re-read it myself before letting other people comment on it, and don't
have the time right now.

My current inclination is to deprecate in 8.next, and remove in 9.0, but
I want to make sure the argument for that is solid before putting it to
a vote.

Regards,

--
Rowan Tommins
[IMSoP]

--

To unsubscribe, visit: https://www.php.net/unsub.php

3 years ago by Rowan Tommins — view source

unread

I wonder if anyone depends on utf8_* without also depending on mb_* ? I
imagine that is exceedingly rare

On the contrary, anyone who uses mb_* functions is likely to use
mb_convert_encoding rather than utf8_encode and utf8_decode.

In fact, the only legitimate uses of the functions I've seen are as a
fallback for when ext/mbstring is not loaded, since they are always
available (since PHP 7.2; before that, they were oddly part of ext/xml).
There is a very small set of use cases where you really do know you have
or want ISO 8859-1, and they are the most portable implementation.

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Sara Golemon — view source

unread

On Sun, Mar 21, 2021 at 9:18 AM Rowan Tommins rowan.collins@gmail.com
wrote:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.

C) Leave them alone forever. Treat it as the user's fault if they mess
things up by misunderstanding them.

My preference is for a deprecation notice (but not necessarily removal ever
-- We can argue that part a little).

As for what users should use instead, obviously there are multiple options
already in core (which you referenced), but those all have third party deps
and can't be guaranteed the way utf8_en/decode() can (this was the point of
moving them from xml).

While I'm normally in favor of userspace things belonging in userspace
(this particular conversion is trivial since it's a 1:1 mapping), I'm
actually willing to see this added under a new, clearer name in
ext/standard since this is something that's in long use, but used
incorrectly.

As for details, I don't love iso_8859_1_to_utf8(), but we can use the
common alias for iso-8859-1 known as latin1 and call the new functions:
utf8_from_latin1() and utf8_to_latin1() with the caveat that the later will
throw a ValueError for codepoints which are out of range (one of the more
problematic issues with utf8_decode()). That makes this not just a simple
rename for clarity, but what I'd consider a bug-fix for an unfortunately
unfixable function.

-Sara

4 years ago by Hans Henrik Bergan — view source

unread

i would prefer to soft-deprecate them like we did with the mysql_ api,
where they do not generate E_DEPRECATED for quite some time, but the
documentation say
"this function is deprecated, instead use mb_convert_encoding ( $str ,
"UTF-8", "ISO-8859-1" ); or iconv("ISO-8859-1","UTF-8", $str)"
and.. make it go E_DEPRECATED in the distant future..

Rowan said "they are commonly used, both correctly and
incorrectly", in my experience, no it's not used correctly, people who are
using it, are using it incorrectly to convert Windows-1252 to utf-8, not
ISO-8859-1...

On Sun, Mar 21, 2021 at 9:18 AM Rowan Tommins rowan.collins@gmail.com
wrote:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.

C) Leave them alone forever. Treat it as the user's fault if they mess
things up by misunderstanding them.

My preference is for a deprecation notice (but not necessarily removal ever
-- We can argue that part a little).

As for what users should use instead, obviously there are multiple options
already in core (which you referenced), but those all have third party deps
and can't be guaranteed the way utf8_en/decode() can (this was the point of
moving them from xml).

While I'm normally in favor of userspace things belonging in userspace
(this particular conversion is trivial since it's a 1:1 mapping), I'm
actually willing to see this added under a new, clearer name in
ext/standard since this is something that's in long use, but used
incorrectly.

As for details, I don't love iso_8859_1_to_utf8(), but we can use the
common alias for iso-8859-1 known as latin1 and call the new functions:
utf8_from_latin1() and utf8_to_latin1() with the caveat that the later will
throw a ValueError for codepoints which are out of range (one of the more
problematic issues with utf8_decode()). That makes this not just a simple
rename for clarity, but what I'd consider a bug-fix for an unfortunately
unfixable function.

-Sara

4 years ago by Rowan Tommins — view source

unread

My preference is for a deprecation notice (but not necessarily removal
ever -- We can argue that part a little).

I'm strongly against any concept of "indefinite deprecation". I consider
any deprecation notice a commitment to remove the feature in the future,
even if a specific timeline for that removal is not given.

If we want to have a separate status of "will be kept indefinitely, but
you shouldn't use it", then we need a separate E_DISCOURAGED, or some
boilerplate in the manual which doesn't use the word "deprecated".

As for details, I don't love iso_8859_1_to_utf8(), but we can use the
common alias for iso-8859-1 known as latin1 and call the new
functions: utf8_from_latin1() and utf8_to_latin1() with the caveat
that the later will throw a ValueError for codepoints which are out of
range (one of the more problematic issues with utf8_decode()). That
makes this not just a simple rename for clarity, but what I'd consider
a bug-fix for an unfortunately unfixable function.

While I can see the temptation here, I'm not sure who the target
audience for the new function would be:

People who just want to replace calls to utf8_decode won't want to go
through every call and make it exception safe.
People who want to write a polyfill couldn't use it, because they
wouldn't be able to recover the remainder of the string after an error
is thrown.
People who want transcoding without any optional extensions will be
disappointed to find only this one encoding supported.

You'd effectively be adding a completely new core function just for
those people who work with Latin1 text, and are confident that it's not
Windows-1252 in disguise.

It's tempting to make any C1 control characters an error as well -
although technically valid in Latin1, these are very rarely used, and
it's much more likely that any bytes in that range are intended as
characters in Windows-1252. But that would feel very odd without having
a corresponding utf8_from_windows1252 function to use instead, at which
point we're into designing a whole new conversion library. And of
course, once you've got that UTF-8 string, you can't do much with it,
because PHP's native string functions are all byte-based, so you've
basically got to re-invent large chunks of ext/mbstring...

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Sara Golemon — view source

unread

On Mon, Mar 22, 2021 at 5:24 AM Rowan Tommins rowan.collins@gmail.com
wrote:

I'm strongly against any concept of "indefinite deprecation". I consider
any deprecation notice a commitment to remove the feature in the future,
even if a specific timeline for that removal is not given.

I don't feel strongly about indefinite deprecation. If you wanna nuke it
in 9.0, have fun. I'm just saying I don't necessarily see the need to do
so. The problem being addressed here is that some users of this function
are probably misusing it, so it's worth putting guiderails on. I'm
hesitant to punish the ones who know exactly what they're doing as a result
of that well-meaning intention.

People who just want to replace calls to utf8_decode won't want to go
through every call and make it exception safe.

Then they shouldn't use these replacements, it's not for them. It's for
people using iso-8859-1.

People who want to write a polyfill couldn't use it, because they
wouldn't be able to recover the remainder of the string after an error
is thrown.

If you're writing a polyfill, then write a polyfill. The polyfill for the
old functions is trivial, I could have written it a dozen times in the
course of writing this email reply.
So this replacement is also not for them.

People who want transcoding without any optional extensions will be
disappointed to find only this one encoding supported.

This function isn't for them.It's for people using iso-8859-1.

There's a theme in here. :)

You'd effectively be adding a completely new core function just for
those people who work with Latin1 text, and are confident that it's not
Windows-1252 in disguise.

Yes. I'm specifically addressing the people who have been using
utf8_en/decode() correctly all this time. They shouldn't be punished for
the stupidity of others.

It's tempting to make any C1 control characters an error as well -
although technically valid in Latin1, these are very rarely used, and
it's much more likely that any bytes in that range are intended as
characters in Windows-1252. But that would feel very odd without having
a corresponding utf8_from_windows1252 function to use instead, at which
point we're into designing a whole new conversion library. And of
course, once you've got that UTF-8 string, you can't do much with it,
because PHP's native string functions are all byte-based, so you've
basically got to re-invent large chunks of ext/mbstring...

I disagree that you'd need to add utf8_from/to_windows1252 "for
completeness". The goal isn't to provide all possible conversion
utilities. The goal is only to not punish users by taking away a valid API
that they were using correctly (for those users who were using it
correctly).

-Sara

4 years ago by bjorn.x.larsson@telia.com — view source

unread

Den 2021-03-22 kl. 14:10, skrev Sara Golemon:

On Mon, Mar 22, 2021 at 5:24 AM Rowan Tommins rowan.collins@gmail.com
wrote:

I'm strongly against any concept of "indefinite deprecation". I consider
any deprecation notice a commitment to remove the feature in the future,
even if a specific timeline for that removal is not given.

I don't feel strongly about indefinite deprecation. If you wanna nuke it
in 9.0, have fun. I'm just saying I don't necessarily see the need to do
so. The problem being addressed here is that some users of this function
are probably misusing it, so it's worth putting guiderails on. I'm
hesitant to punish the ones who know exactly what they're doing as a result
of that well-meaning intention.

People who just want to replace calls to utf8_decode won't want to go
through every call and make it exception safe.

Then they shouldn't use these replacements, it's not for them. It's for
people using iso-8859-1.

People who want to write a polyfill couldn't use it, because they
wouldn't be able to recover the remainder of the string after an error
is thrown.

If you're writing a polyfill, then write a polyfill. The polyfill for the
old functions is trivial, I could have written it a dozen times in the
course of writing this email reply.
So this replacement is also not for them.

People who want transcoding without any optional extensions will be
disappointed to find only this one encoding supported.

This function isn't for them.It's for people using iso-8859-1.

There's a theme in here. :)

You'd effectively be adding a completely new core function just for
those people who work with Latin1 text, and are confident that it's not
Windows-1252 in disguise.

Yes. I'm specifically addressing the people who have been using
utf8_en/decode() correctly all this time. They shouldn't be punished for
the stupidity of others.

It's tempting to make any C1 control characters an error as well -
although technically valid in Latin1, these are very rarely used, and
it's much more likely that any bytes in that range are intended as
characters in Windows-1252. But that would feel very odd without having
a corresponding utf8_from_windows1252 function to use instead, at which
point we're into designing a whole new conversion library. And of
course, once you've got that UTF-8 string, you can't do much with it,
because PHP's native string functions are all byte-based, so you've
basically got to re-invent large chunks of ext/mbstring...

I disagree that you'd need to add utf8_from/to_windows1252 "for
completeness". The goal isn't to provide all possible conversion
utilities. The goal is only to not punish users by taking away a valid API
that they were using correctly (for those users who were using it
correctly).

-Sara

Think I'm one such user :-) So keeping them and improving a little would
be fine with me!

r//Björn L

4 years ago by Rowan Tommins — view source

unread

People who just want to replace calls to utf8_decode won't want to go
through every call and make it exception safe.

Then they shouldn't use these replacements, it's not for them. It's
for people using iso-8859-1.

This is a non-sequitur. Someone using the function correctly to convert
to ISO 8859-1 may also be relying on the documented and consistent
error-handling behaviour. Substituting the character may not always be
the best approach, but in some cases it's more useful than discarding
the entire string, let alone aborting the entire process with an
unhandled Throwable.

The goal is only to not punish users by taking away a valid API that
they were using correctly (for those users who were using it correctly).

I'm sympathetic to that aim, but if the new function is not the same,
you are taking away the existing API, and introducing a new one.
Neither of the following seems like it would be accepted:

Make utf8_decode() throw errors for unrepresentable characters.
Introduce a function specifically for converting from UTF-8 to
Latin-1, if we didn't already have one.

So it feels questionable to me to design a new function, which is
neither compatible with what we have, nor a reasonable addition on its
own merits.

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Aleksander Machniak — view source

unread

Make utf8_decode() throw errors for unrepresentable characters.

I'm not sure I understand this, but it sounds like it would be a BC
break for my case.

I'm using utf8_encode()/utf8_decode() to make input string safe to be
stored in DB, and back. In most cases the input is utf-8, but it
occasionally may contain "broken characters".

$str = '';
for ($x=0; $x<256; $x++) {
    $str .= chr($x);
}

$this->assertSame($str, utf8_decode(utf8_encode($str)));

$str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃";

$this->assertSame($str, utf8_decode(utf8_encode($str)));

Could anyone point to a sample input that will not work with my use-case?

--
Aleksander Machniak
Kolab Groupware Developer [https://kolab.org]
Roundcube Webmail Developer [https://roundcube.net]

PGP: 19359DC1 # Blog: https://kolabian.wordpress.com

4 years ago by Kamil Tekiela — view source

unread

I'm using utf8_encode()/utf8_decode() to make input string safe to be
stored in DB, and back. In most cases the input is utf-8, but it
occasionally may contain "broken characters".

What exactly do you mean by making the input string safe? If I understand
correctly utf8_decode(utf8_encode($str)) should just be an identity
function. Could you please explain what is the purpose of using these
functions in such a way?

4 years ago by Rowan Tommins — view source

unread

I'm using utf8_encode()/utf8_decode() to make input string safe to be
stored in DB, and back. In most cases the input is utf-8, but it
occasionally may contain "broken characters".

That is not what this function does, at all. The fact that its name
makes you think that is exactly why I want to get rid of that name.

 $str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃";

 $this->assertSame($str, utf8_decode(utf8_encode($str)));

Let's write that out with a more descriptive function name:

$str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃";

$this->assertSame($str, utf8_to_latin1(latin1_to_utf8($str)));

Since Latin-1 does not contain any Chinese, Japanese, or Emoji
characters, running latin1_to_uft8 on that string is clearly nonsensical.

The only reason it doesn't give you any errors is that every possible
byte is a valid character in Latin1, and every Latin1 character has a
Unicode code point. So the "グ" is interpreted as three Latin-1
characters: E3, 82, and B0; those then become the corresponding Unicode
code points U+00E3, U+00821, and U+00B0, represented in UTF-8. You then
run utf8_to_latin1, and they get converted back.

That code will never do anything useful.

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Aleksander Machniak — view source

unread

That code will never do anything useful.

I already proved it is useful, regardless of it's name/intention.

This is old code, not even mine, so maybe when it's been written the PHP
documentation wasn't that clear about the function(s) intention. Or the
intention was different.

ps. to Kamil,

We use utf8_encode() to make the string safe to be put in utf-8 database
column/table. We use utf8_decode() to convert that back to what it was
before.

The tests prove that the conversion is lossless.

--
Aleksander Machniak
Kolab Groupware Developer [https://kolab.org]
Roundcube Webmail Developer [https://roundcube.net]

PGP: 19359DC1 # Blog: https://kolabian.wordpress.com

4 years ago by Rowan Tommins — view source

unread

That code will never do anything useful.
I already proved it is useful, regardless of it's name/intention.

You have proven no such thing. If that function is saving you from
errors, it is completely by accident.

The same effect can be achieved using base64_encode() and
base64_decode(), or bin2hex() and hex2bin(), or any other function that
takes a series of bytes and applies an arbitrary encoding to it.

It could also be achieved by using a binary column type in the database,
because the values you have stored are not useful as strings; they might
as well be encrypted.

Given the sequence of bytes "\xE3\x82\zB0", which is a valid UTF-8
string representing U+30B0 KATAKANA LETTER GU グ calling utf8_encode()
will result in the sequence of bytes "\xC3\xA3\xC2\x82\xC2\xB0", which
is the UTF-8 representation of the following Unicode code points:

U+00E3 LATIN SMALL LETTER A WITH TILDE ã
U+0082 CONTROL: BREAK PERMITTED HERE
U+00B0 DEGREE SIGN °

This is clearly gibberish, and bears no relationship to the original
string; it is what is generally referred to as "mojibake".

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Chase Peeler — view source

unread

On Mon, Mar 22, 2021 at 1:22 PM Rowan Tommins rowan.collins@gmail.com
wrote:

That code will never do anything useful.
I already proved it is useful, regardless of it's name/intention.

You have proven no such thing. If that function is saving you from
errors, it is completely by accident.

Even if it is by accident, removing or changing the behavior of the
function is guaranteed to make something that currently works (by skill or
by luck) and risk it no longer working.

The same effect can be achieved using base64_encode() and
base64_decode(), or bin2hex() and hex2bin(), or any other function that
takes a series of bytes and applies an arbitrary encoding to it.

It could also be achieved by using a binary column type in the database,
because the values you have stored are not useful as strings; they might
as well be encrypted.

Given the sequence of bytes "\xE3\x82\zB0", which is a valid UTF-8
string representing U+30B0 KATAKANA LETTER GU グ calling utf8_encode()
will result in the sequence of bytes "\xC3\xA3\xC2\x82\xC2\xB0", which
is the UTF-8 representation of the following Unicode code points:

U+00E3 LATIN SMALL LETTER A WITH TILDE ã

U+0082 CONTROL: BREAK PERMITTED HERE

U+00B0 DEGREE SIGN °

This is clearly gibberish, and bears no relationship to the original
string; it is what is generally referred to as "mojibake".

Regards,

--
Rowan Tommins
[IMSoP]

--

To unsubscribe, visit: https://www.php.net/unsub.php

--
Chase Peeler
chasepeeler@gmail.com

4 years ago by Rowan Tommins — view source

unread

Even if it is by accident, removing or changing the behavior of the
function is guaranteed to make something that currently works (by
skill or by luck) and risk it no longer working.

This is absolutely true. However, at some point you have to draw the
line between supported use cases, and requests to re-enable spacebar
heating: https://xkcd.com/1172/

I think using utf8_encode to store binary data in a text column crosses
that line: the code was added because of a misunderstanding of the
function, it works by accident, and there are plenty of better ways to
solve the actual problem.

Just to be clear, the trick Aleksander and Alexandru stumbled on doesn't
just work for "corrupted UTF-8"; you could store a JPEG in a text column
by using utf8_encode(file_get_contents($image_file)). It's probably best
not to, though.

I also agree that users should have a clear guide to how to replace
their current usages. Fortunately, there are at least 4 other ways of
writing this functionality in PHP (iconv, mbstring, intl, and the
Symfony polyfill).

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by drealecs@gmail.com — view source

unread

That code will never do anything useful.

I already proved it is useful, regardless of it's name/intention.

This is old code, not even mine, so maybe when it's been written the PHP
documentation wasn't that clear about the function(s) intention. Or the
intention was different.

ps. to Kamil,

We use utf8_encode() to make the string safe to be put in utf-8 database
column/table. We use utf8_decode() to convert that back to what it was
before.

I just searched and found a hotfix I did a few years ago (when I was also
dumber) and the fix was just adding a utf8_encode to some data received in
$_POST before being sent to a logging service. And a utf8_decode after
reading it for further parsing.
The logging service storage was using a mysql database and the specific
column was declared TEXT instead of BLOB.
Apparently the fix is still in place.

The tests prove that the conversion is lossless.

There could have been better ways to fix it.
json_encode / json_decode would have worked just the same.

The problem was that the quickly identified cause was a non-utf8 string
trying to be stored in an utf8 text column and the solution was implemented
based on the fact that utf8_decode/encode sounded like a good idea when
time is limited; and also knowledge in my case.
I think it would be great to deprecate them somehow.

Regards,
Alex

4 years ago by drealecs@gmail.com — view source

unread

On Mon, Mar 22, 2021 at 7:24 PM Alexandru Pătrănescu drealecs@gmail.com
wrote:

There could have been better ways to fix it.
json_encode / json_decode would have worked just the same.

Nope, strings in a json object must be UTF-8.
As Rowan mentioned, base64_encode would have worked. But that means one
quarter of the available max column space would be lost as a downside.

Regards,
Alex

4 years ago by Rowan Tommins — view source

unread

As Rowan mentioned, base64_encode would have worked. But that means one
quarter of the available max column space would be lost as a downside.

Depending on the data, abusing Latin1-to-UTF8 translation can easily
result in a longer string than base64.

$str = '🤡🤡';

echo strlen(base64_encode($str));
// 12

echo strlen(utf8_encode($str));
// 16

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Sara Golemon — view source

unread

$str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃";

$this->assertSame($str, utf8_decode(utf8_encode($str)));

Woah. Yeah. No. Don't do that.
Doing that is what's wrong with utf8_en/decode().
Doing that convinces me that Rowan is right and we should deprecate then
remove those functions without offering a simple replacement.
Christ's sake... no.

What should we do with utf8_encode and utf8_decode?

Andreas

Andreas

-- Aleksander Machniak Kolab Groupware Developer [https://kolab.org] Roundcube Webmail Developer [https://roundcube.net]

-- Aleksander Machniak Kolab Groupware Developer [https://kolab.org] Roundcube Webmail Developer [https://roundcube.net]

--
Aleksander Machniak
Kolab Groupware Developer [https://kolab.org]
Roundcube Webmail Developer [https://roundcube.net]

--
Aleksander Machniak
Kolab Groupware Developer [https://kolab.org]
Roundcube Webmail Developer [https://roundcube.net]