Hi all,
utf8_decode()
and utf8_encode()
are not needed and causing problems
than solving.
https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.
Any comments?
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
On github utf8_encode have ~500.000 results and utf8_decode have ~400.000.
I too guess that 7.2 shouldn't introduces a BC like that. Maybe on 8.0.
Currently I guess that it is a good shortcut, but I really don't know
if make senses to keep it (utf16 or others are not implemented, only
utf8, for other side, utf8 seems to be more common).
2016-08-15 0:17 GMT-03:00 Yasuo Ohgaki yohgaki@ohgaki.net:
Hi all,
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.Any comments?
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net--
--
David Rodrigues
Hi!
Hi all,
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.
Why you think they are not needed?
Also, the manual says "utf8_encode — Encodes an ISO-8859-1 string to
UTF-8". If somebody uses unknown function without even glancing at the
first line of the manual, they deserve all the trouble they get. I mean,
it's not a PhD-level course, it's a single line to read, and not that
long one either, just 6 words. Many beer labels provide more reading
material :)
So, I'm not sure we need to spend time on this. Yes, it duplicates iconv
and recode. So what? It hurts no one.
https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
RFC says "7.1 - Remove utf8_decode()
and utf8_encode()
".
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.
I would like to see some justification for this - was any research
performed on how many users use them?
Stas Malyshev
smalyshev@gmail.com
Hi!
Hi all,
utf8_decode()
andutf8_encode()
are not needed and causing
problems than solving.Why you think they are not needed?
I'm guessing this came about from their mention on another thread. To
quote myself:
utf8_decode()
/utf8_encode are, at best, extremely misleading names.
Many uses of them in my experience go something like this: "I have an
encoding problem, it's something to do with UTF-8, I'll try
utf8_encode; hm, that didn't work, I'll try utf8_decode instead".
They are certainly used in many, many places; but I would wager that
almost all of those uses are broken because they make no effort to
confirm that 8859-1 is the right source / target encoding. I think
deprecating (but not removing) them might be sensible, because it would
discourage this broken logic.
They are also trivially polyfillable; and again, anyone doing so will
probably realise they misunderstood them and shouldn't be using them in
the first place.
Regards,
Rowan Collins
[IMSoP]
Hi!
utf8_decode()
/utf8_encode are, at best, extremely misleading names.
Many uses of them in my experience go something like this: "I have an
encoding problem, it's something to do with UTF-8, I'll try
utf8_encode; hm, that didn't work, I'll try utf8_decode instead".
I still think if one does it without even reading one-line description
in the manual, they deserve what they get.
I know people will click on any binary that says "please click on me,
honestly I'm not a trojan" but programmers are supposed to know a little
better? Like not running functions they don't know what they do, even
without reading one-line description?
They are certainly used in many, many places; but I would wager that
almost all of those uses are broken because they make no effort to
confirm that 8859-1 is the right source / target encoding. I think
deprecating (but not removing) them might be sensible, because it would
discourage this broken logic.
Deprecating - maybe, since there are better alternatives to it anyway
(like iconv or recode), but I don't see much point in removing unless we
have no usage at all.
--
Stas Malyshev
smalyshev@gmail.com
Hi all,
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.Any comments?
Many people concern about BC. It's reasonable.
How about document deprecation of them now? And not mention when they
will be removed. It may exist for long, but users are better to use generic
encoding conversion functions.
The additional note for them might be
utf8_encode/decode() is deprecated in favor of other encoding
conversion features. e.g. iconv()
, UConverter class and
mb_convert_encoding()
.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Yasuo
2016-08-16 0:00 GMT+02:00 Yasuo Ohgaki yohgaki@ohgaki.net:
Many people concern about BC. It's reasonable.
How about document deprecation of them now? And not mention when they
will be removed. It may exist for long, but users are better to use generic
encoding conversion functions.
I think that is a good idea. But instead of using PHP_FE_DEP, maybe
use an inline php_error_docref() to explain about possible
alternatives (in a short pointer, like: "this function is deprecated,
please look at alternatives such as extension X or extension Y", what
do you think?
--
regards,
Kalle Sommer Nielsen
kalle@php.net
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.Any comments?
Many people concern about BC. It's reasonable.
How about document deprecation of them now? And not mention when they
will be removed. It may exist for long, but users are better to use generic
encoding conversion functions.The additional note for them might be
utf8_encode/decode() is deprecated in favor of other encoding
conversion features. e.g.iconv()
, UConverter class and
mb_convert_encoding()
.
I'm afraid that's too easily overlooked. Actually, nobody who knows the
working of utf8_(en|de)code would even have a look at its man page,
particularly, if they're not working on the code that uses these functions.
Anyhow, actually I think that utf8_(en|de)code are not the real problem,
but rather the complete XML extension is. xml_set_object()
causes
memory issues which have to be handled in userland[1],
xml_parse_into_struct()
is more than doubtful[2] and the whole concept
of attaching callbacks manually (instead of extending a base class) is
so clumsy[3].
Furthermore the extension is basically without active maintainer. Thies
made his last commit in 2002, and Rob made only 1 bugfix since 2010,
even though there are several unresolved issues[4].
Unless a new active maintainer steps forward, we should consider to move
ext/xml to PECL, in which case the deprecation of utf8_(en|de)code might
be unnecessary.
[1] http://news.php.net/php.internals/94998
[2] http://news.php.net/php.internals/95033
[3] http://news.php.net/php.internals/95018
[4]
https://bugs.php.net/search.php?cmd=display&bug_type=All&status=Open&package_name%5B%5D=%2AXML+functions&by=Any&limit=30
--
Christoph M. Becker
Unless a new active maintainer steps forward, we should consider to move
ext/xml to PECL, in which case the deprecation of utf8_(en|de)code might
be unnecessary.
I don't think that really follows at all. Until this week, I had no idea
that utf8_(en|de)code had anything at all to do with ext/xml, and I can
pretty much guarantee that the majority of people using them also have
no idea.
Whether or not we happen to be removing all the functions in an
extension, before we can move anything to PECL, we still need to
consider who is using it, and what the appropriate deprecation time
frame is.
To be honest, I don't think any of the functions in that extension can
be removed without a decent period of deprecation, because however
horrible and unmaintained they are, they are extremely widely used.
Regards,
Rowan Collins
[IMSoP]
Hi all,
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.Any comments?
Yes, one, sadly short:
Absolutely no go.
I do not see a remotely valid reason to remove these functions which are
much more used than what this RFC or thread say.
I am against removing them, now or in 8.x. this is the kind of things that
makes migrations painful. They are broken on some cases? Fix them maybe?
I am against removing them, now or in 8.x. this is the kind of things that
makes migrations painful. They are broken on some cases? Fix them maybe?
The thing that is broken about the functions is not the functionality,
but the name.
I suppose you could change them to recode Windows CP-1252, which is more
common than latin-1 and mostly compatible with it, but that still breaks
European sites using ISO 8859-15 (latin 9), and frankly would just
further reinforce the common misapprehensions around "extended ASCII"
and "decoding UTF8".
What do you think of making them aliases of the new names latin1_to_utf8
and utf8_to_latin1? We needn't even deprecate the old names, but at
least it would draw more attention to what they actually do?
Regards,
Rowan Collins
[IMSoP]
Hi all,
I am against removing them, now or in 8.x. this is the kind of things that
makes migrations painful. They are broken on some cases? Fix them maybe?The thing that is broken about the functions is not the functionality, but
the name.I suppose you could change them to recode Windows CP-1252, which is more
common than latin-1 and mostly compatible with it, but that still breaks
European sites using ISO 8859-15 (latin 9), and frankly would just further
reinforce the common misapprehensions around "extended ASCII" and "decoding
UTF8".What do you think of making them aliases of the new names latin1_to_utf8 and
utf8_to_latin1? We needn't even deprecate the old names, but at least it
would draw more attention to what they actually do?
I agree the name is problematic, not what it does.
Having aliases is good solution. It worked well in pgsql module. Old
names are still usable and I think almost all codes use new names now.
I'll update the RFC to have aliases rather than removing. Since these
are XML module functions, the name would be xml_latin1_to_utf8() and
xml_utf8_to_latin1(). Suggestions for names are appreciated.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
I'll update the RFC to have aliases rather than removing. Since these
are XML module functions, the name would be xml_latin1_to_utf8() and
xml_utf8_to_latin1(). Suggestions for names are appreciated.
+1, I think those are the names to go.
--
Regards,
Mike
Since these
are XML module functions, the name would be xml_latin1_to_utf8() and
xml_utf8_to_latin1(). Suggestions for names are appreciated.
From a user's point of view, these functions have nothing to do with
XML, so I'm not sure the prefix really makes sense. I had no idea until
a few days ago that they were in the same extension in the source, and I
suspect most users aren't even aware that "built-in" functions like this
are arranged in "extensions" at all.
The naming convention in CODING_STANDARDS [1] doesn't actually make
reference to extensions, only a "parent set", so I don't think "xml_" is
a mandatory or natural prefix according to that rule.
[1] https://github.com/php/php-src/blob/master/CODING_STANDARDS
As far as I can see, these functions exist because the XML parser
infrastructure needed them, and someone thought it might be handy to
expose them to users. Funnily enough, the internal versions actually
take a parameter for the target encoding, but only support US-ASCII and
8859-1: https://github.com/php/php-src/blob/master/ext/xml/xml.c#L283
If anything, they should probably have a "str_" prefix, and maybe even
be moved into the string section of the source, exposed in such a way
that the XML parser can still make use of them.
Regards,
Rowan Collins
[IMSoP]
As far as I can see, these functions exist because the XML parser
infrastructure needed them, and someone thought it might be handy to
expose them to users. Funnily enough, the internal versions actually
take a parameter for the target encoding, but only support US-ASCII and
8859-1: https://github.com/php/php-src/blob/master/ext/xml/xml.c#L283If anything, they should probably have a "str_" prefix, and maybe even
be moved into the string section of the source, exposed in such a way
that the XML parser can still make use of them.
Thanks for looking deeper. That makes even more sense now.
--
Regards,
Mike
As far as I can see, these functions exist because the XML parser
infrastructure needed them, and someone thought it might be handy to
expose them to users. Funnily enough, the internal versions actually
take a parameter for the target encoding, but only support US-ASCII and
8859-1: https://github.com/php/php-src/blob/master/ext/xml/xml.c#L283If anything, they should probably have a "str_" prefix, and maybe even
be moved into the string section of the source, exposed in such a way
that the XML parser can still make use of them.
Thanks for looking deeper. That makes even more sense now.
The original code pre-dates the move to ext/ in 1999 where utf8_decode
is hard coded as ISO-8859-1 but uses xml_utf8_decode internally. At that
time of cause there was no provision for multi-byte characters and the
decoding of a string is hard code in the function. If you look closer
you will see that xml_utf8_decode still expects strings of type XML_Char
- and so
utf8_decode()
wraps that to hide the differences.
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
As far as I can see, these functions exist because the XML parser
infrastructure needed them, and someone thought it might be handy to
expose them to users. Funnily enough, the internal versions actually
take a parameter for the target encoding, but only support US-ASCII and
8859-1: https://github.com/php/php-src/blob/master/ext/xml/xml.c#L283If anything, they should probably have a "str_" prefix, and maybe even
be moved into the string section of the source, exposed in such a way
that the XML parser can still make use of them.Thanks for looking deeper. That makes even more sense now.
+1
--
Christoph M. Becker
Hi all,
As far as I can see, these functions exist because the XML parser
infrastructure needed them, and someone thought it might be handy to
expose them to users. Funnily enough, the internal versions actually
take a parameter for the target encoding, but only support US-ASCII and
8859-1: https://github.com/php/php-src/blob/master/ext/xml/xml.c#L283If anything, they should probably have a "str_" prefix, and maybe even
be moved into the string section of the source, exposed in such a way
that the XML parser can still make use of them.Thanks for looking deeper. That makes even more sense now.
Any more comments for prefixing "str_"?
str_latin1_to_utf8() == utf8_encode()
str_utf8_to_latin1() == utf8_decode()
I'm a little uncomfortable to have special new encoding conversion
functions for ISO-8859-1 in ext/standard. However, it's better than
keeping utf8_decode/encode() as primary function names forever.
Although encoding parameter is not exposed to users, but the XML
module internal code for utf8_encode/decode() supports ISO-8859-1 and
ASCII (convert chars > 127 to '?'). If this resolution is adopted,
I'll remove ASCII support and make it work only for ISO-8859-1. No
external library is used. New functions can be defined as ext/standard
functions.
Users cannot specify encoding, so there is no BC in userland. Internal
APIs are exposed. 3rd party modules may have BC. Internal APIs are
named xml_utf8_encode/decode(). I would not like to keep them in
ext/standard nor expose them to 3rd party module developers.
Alternatively, we may keep XML module as it is now and add "xml_"
prefix functions
xml_latin1_to_utf8() == utf8_encode()
xml_utf8_to_latin1() == utf8_decodoe()
then encourage users to use general encoding conversion functions in
the manual. I prefer this way.
May I have voting choices?
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Yasuo,
Yasuo Ohgaki wrote:
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.Any comments?
I don't agree with this. utf8_decode()
and _encode() are functions which
you probably ought not to use in modern code, and the names are maybe
unhelpful (decode to what? encode from what?). But the job they do is
sometimes needed (if you're dealing with this specific legacy encoding),
and I believe they work correctly. Plus, a lot of existing code uses
them. This seems like a needless deprecation for this reason.
I would propose something else: remove them from the XML extension, and
move them somewhere more fitting, like ext/intl, ext/mbstring or maybe
ext/standard. These are generic functions which work on any text, not
just XML, and their inclusion is mutually superfluous with respect to
XML: if you're decoding XML, you don't necessarily need to convert text
to/from UTF-8, and if you're converting text to/from UTF-8, you don't
necessarily need to deal with XML. Plus, given the names alone, you'd
have no idea they're part of the XML extension.
Also, to avoid confusion, maybe they could be renamed to
iso88591_to_utf8() and utf8_to_iso88591(), with the old names kept as
aliases. I got this idea from this comment:
http://php.net/manual/en/function.utf8-encode.php#104906
Another thing to consider is that the manual perhaps ought to warn the
user that ISO-8859-1 is not Windows-1252. A lot of text on the Internet
marked as the former is actually the latter (thanks to the widespread
use of Windows), and browsers assume this. Windows-1252 contains some
extra printable characters where ISO-8859-1 has control characters, such
as the Euro sign, curly quotes, the trademark sign, and some extra
lengths of dash. So, interpreting Windows-1252 text as ISO-8859-1 will
garble such characters.
Thanks.
--
Andrea Faulds
https://ajf.me/
Hi,
This is a follow-up to what I wrote in the utf8_encode()
/utf8_decode()
discussion earlier:
Andrea Faulds wrote:
I would propose something else: remove them from the XML extension, and
move them somewhere more fitting, like ext/intl, ext/mbstring or maybe
ext/standard. These are generic functions which work on any text, not
just XML, and their inclusion is mutually superfluous with respect to
XML: if you're decoding XML, you don't necessarily need to convert text
to/from UTF-8, and if you're converting text to/from UTF-8, you don't
necessarily need to deal with XML. Plus, given the names alone, you'd
have no idea they're part of the XML extension.
Since these functions are generic string functions that have no
dependency on libxml, I've written a patch to move them to ext/standard,
and simplified their code a little bit.
Pull request here: https://github.com/php/php-src/pull/2160
This doesn't currently do any function renaming or aliasing, but I
should probably do that next. Plus, the manual still needs updating.
Are there any objections to this move? There'd be no
backwards-compatibility break.
Thanks!
--
Andrea Faulds
https://ajf.me/
Den 2016-10-14 kl. 00:42, skrev Andrea Faulds:
Hi,
This is a follow-up to what I wrote in the
utf8_encode()
/utf8_decode()
discussion earlier:Andrea Faulds wrote:
I would propose something else: remove them from the XML extension, and
move them somewhere more fitting, like ext/intl, ext/mbstring or maybe
ext/standard. These are generic functions which work on any text, not
just XML, and their inclusion is mutually superfluous with respect to
XML: if you're decoding XML, you don't necessarily need to convert text
to/from UTF-8, and if you're converting text to/from UTF-8, you don't
necessarily need to deal with XML. Plus, given the names alone, you'd
have no idea they're part of the XML extension.Since these functions are generic string functions that have no
dependency on libxml, I've written a patch to move them to
ext/standard, and simplified their code a little bit.Pull request here: https://github.com/php/php-src/pull/2160
This doesn't currently do any function renaming or aliasing, but I
should probably do that next. Plus, the manual still needs updating.Are there any objections to this move? There'd be no
backwards-compatibility break.Thanks!
I think this is a very good way forward. At the moment we are
planning a migration project going from PHP 5.x to 7.x where
the content is mostly coded in ISO-8859-1 and in some places
UTF8.
We use these functions to convert when needed, so removing
them is in my eyes a bad idea since it would hamper our effort
to migrate towards PHP 7. And we want to focus on one thing
at the time. Meaning, not mixing PHP 7 & UTF8 migration. We
can also not justify the cost for moving content to UTF8 since
there is no added value for our end-users in it.
So please keep these functions and don't remove them!
Regards //Björn Larsson
Hi again,
I rewrote the manual entries on utf8_encode()
and utf8_decode()
to be
more helpful:
http://svn.php.net/viewvc?view=revision&revision=340506
And they have now been moved to ext/standard in master:
https://github.com/php/php-src/pull/2160
I hope this settles this issue mostly.
Thanks.
--
Andrea Faulds
https://ajf.me/
Hi all,
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.Any comments?
Thank you for comments!
I think it would be the best we document deprecation and leave as it is now.
I'll start vote tomorrow. If you have comment, please do so asap. Thank you.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
...
I think it would be the best we document deprecation and leave as it is now.I'll start vote tomorrow. If you have comment, please do so asap. Thank you.
RFC title is confusing. I think it should be changed to
"Deprecate utf8_decode()
and utf8_encode()
".
--
Oishi Kazuo
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
...
I think it would be the best we document deprecation and leave as it is now.I'll start vote tomorrow. If you have comment, please do so asap. Thank you.
RFC title is confusing. I think it should be changed to
"Deprecateutf8_decode()
andutf8_encode()
".
Thanks! Fixed!
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
2016-09-09 7:21 GMT+02:00 Yasuo Ohgaki yohgaki@ohgaki.net:
Hi all,
utf8_decode()
andutf8_encode()
are not needed and causing problems
than solving.https://wiki.php.net/rfc/remove_utf_8_decode_encode
Proposal
- Document deprecation them now
- Remove them from 7.2
I think only few users are using and they shouldn't have problem using
mbstring/iconv/intl functions.Any comments?
Thank you for comments!
I think it would be the best we document deprecation and leave as it is
now.I'll start vote tomorrow. If you have comment, please do so asap. Thank
you.Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net--
Do you plan to emit E_DEPRECATED
or only deprecate them via documentation?
What about the alias suggestions?
Regards, Niklas
Hi Niklas,
--
Do you plan to emit
E_DEPRECATED
or only deprecate them via documentation?What about the alias suggestions?
No E_DEPRECATED. Only document deprecation.
I thought alias will bring more problems than solving issue from the
discussion, so I left aliasing as future scope.
If anyone insists, I don't mind to have alias voting option in the RFC.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
2016-09-09 10:31 GMT+02:00 Yasuo Ohgaki yohgaki@ohgaki.net:
Hi Niklas,
--
Do you plan to emit
E_DEPRECATED
or only deprecate them via documentation?What about the alias suggestions?
No E_DEPRECATED. Only document deprecation.
I thought alias will bring more problems than solving issue from the
discussion, so I left aliasing as future scope.If anyone insists, I don't mind to have alias voting option in the RFC.
I don't think we should have a vote to deprecate something in the
manual, that is not actually emitting an E_DEPRECATED
in php-src. Just
go ahead and change the manual, we got lots of similar things already
in there.
Another note to keep in mind; If these were to be moved to any other
extension, we would have to use our own implementation, or a library
which we can use, because if my memory strikes me right, then these
functions are implemented from LibXML2, although we always enable
ext/libxml, at least on Windows, this could be a minor issue if moved.
I'm neutral on the renaming.
--
regards,
Kalle Sommer Nielsen
kalle@php.net
Hi Kalle and all,
2016-09-09 10:31 GMT+02:00 Yasuo Ohgaki yohgaki@ohgaki.net:
Hi Niklas,
--
Do you plan to emit
E_DEPRECATED
or only deprecate them via documentation?What about the alias suggestions?
No E_DEPRECATED. Only document deprecation.
I thought alias will bring more problems than solving issue from the
discussion, so I left aliasing as future scope.If anyone insists, I don't mind to have alias voting option in the RFC.
I don't think we should have a vote to deprecate something in the
manual, that is not actually emitting anE_DEPRECATED
in php-src. Just
go ahead and change the manual, we got lots of similar things already
in there.
I'm fine with this.
I'll just update manual in a few days.
Let me know if you have comments on this soon,
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net