json_encode() bug - Externals

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Right now, if json_encode sees wrong UTF-8 data, it just cuts the string
in the middle, no error returned, no message produced. Example:

var_dump(json_encode("ab\xE0"));
var_dump(json_encode("ab\xE0""));

Both strings get cut at "ab". I think it's not a good idea to just
silently cut the data. In fact, I think it is a bug caused by this code
in ext/json/utf8_to_utf16.c:
if (c < 0) {
return UTF8_END ? the_index : UTF8_ERROR;
}
which inherited this bug from code published on json.org. It should be:
if (c < 0) {
return (c == UTF8_END) ? the_index : UTF8_ERROR;
}
Now this is an easy fix but would lead to bad strings silently converted
to empty strings. The question is - should we have an error there? If
so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Also filed as bug #43941.
Any comments?

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Nathan Rixham — view source — reply

unread

quick work around for now..
base64_decode(json_decode(json_encode(base64_encode("ab\xE0" something"))))

Stanislav Malyshev wrote:

Hi!

Right now, if json_encode sees wrong UTF-8 data, it just cuts the string
in the middle, no error returned, no message produced. Example:

var_dump(json_encode("ab\xE0"));
var_dump(json_encode("ab\xE0""));

Both strings get cut at "ab". I think it's not a good idea to just
silently cut the data. In fact, I think it is a bug caused by this code
in ext/json/utf8_to_utf16.c:
if (c < 0) {
return UTF8_END ? the_index : UTF8_ERROR;
}
which inherited this bug from code published on json.org. It should be:
if (c < 0) {
return (c == UTF8_END) ? the_index : UTF8_ERROR;
}
Now this is an easy fix but would lead to bad strings silently converted
to empty strings. The question is - should we have an error there? If
so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Also filed as bug #43941.
Any comments?

18 years ago by Cristian Rodriguez — view source — reply

unread

2008/1/25, Stanislav Malyshev stas@zend.com:

Now this is an easy fix but would lead to bad strings silently converted
to empty strings. The question is - should we have an error there? If
so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.

Yes , E_WARNING is the right thing to have IMHO.

--
http://www.cristianrodriguez.net

18 years ago by Pierre Joye — view source — reply

unread

2008/1/25, Stanislav Malyshev stas@zend.com:

Now this is an easy fix but would lead to bad strings silently converted
to empty strings. The question is - should we have an error there? If
so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.

Yes , E_WARNING is the right thing to have IMHO.

Stupid question, who actually checks for E_* in his code at runtime
after having called such functions? Not me and I would hate to. It
sounds to me like a perfect exception use case. As this function can
return nearly everything scalar we have (boolean, string, null,
integer,...) we can't use our normal "returns FALSE on failure".

Cheers,

Pierre
http://blog.thepimp.net | http://www.libgd.org

18 years ago by Stanislav Malyshev — view source — reply

unread

Stupid question, who actually checks for E_* in his code at runtime
after having called such functions? Not me and I would hate to. It
sounds to me like a perfect exception use case. As this function can

General policy in PHP as far as I know is that non-OO functions do not
do exceptions.

return nearly everything scalar we have (boolean, string, null,
integer,...) we can't use our normal "returns FALSE on failure".

This function can return only a string, but I'm not sure returning false
if somewhere in some value deep down your data is some bad utf-8 is
warranted. It is doable, though, but I'm afraid most of current code
never check for return of `json_encode()` so they are in for big surprises
when json_encode won't produce valid JSON.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Pierre Joye — view source — reply

unread

Stupid question, who actually checks for E_* in his code at runtime
after having called such functions? Not me and I would hate to. It
sounds to me like a perfect exception use case. As this function can

General policy in PHP as far as I know is that non-OO functions do not
do exceptions.

return nearly everything scalar we have (boolean, string, null,
integer,...) we can't use our normal "returns FALSE on failure".

This function can return only a string, but I'm not sure returning false
if somewhere in some value deep down your data is some bad utf-8 is
warranted. It is doable, though, but I'm afraid most of current code
never check for return of json_encode() so they are in for big surprises
when json_encode won't produce valid JSON.

Oh right, I thought about/read json_decode. But the idea is the same
yes. Not checking return values is a very bad habit, I can't count how
many times I saw codes dying miserably only because they don't check
returns values. Would it make sense to use exception for such cases?
Aka to change our policy? I'm not an excpeption fan, but these cases
are good examples of exception usages. We can invent a meta type
"error" but it would be rather confusing ;)

--
Pierre
http://blog.thepimp.net | http://www.libgd.org

18 years ago by Derick Rethans — view source — reply

unread

Hi!

Right now, if json_encode sees wrong UTF-8 data, it just cuts the string in
the middle, no error returned, no message produced. Example:

var_dump(json_encode("ab\xE0"));
var_dump(json_encode("ab\xE0""));

Both strings get cut at "ab". I think it's not a good idea to just silently
cut the data. In fact, I think it is a bug caused by this code in
ext/json/utf8_to_utf16.c:
if (c < 0) {
return UTF8_END ? the_index : UTF8_ERROR;
}
which inherited this bug from code published on json.org. It should be:
if (c < 0) {
return (c == UTF8_END) ? the_index : UTF8_ERROR;
}
Now this is an easy fix but would lead to bad strings silently converted to
empty strings. The question is - should we have an error there? If so, which
one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Also filed as bug #43941.
Any comments?

Like I mentioned before (I think), it should not return an empty string
of course because programmatically it's not possible to check for this.
As most of our functions return false in those cases, so should this
function.

Derick

18 years ago by Stanislav Malyshev — view source — reply

unread

Like I mentioned before (I think), it should not return an empty string
of course because programmatically it's not possible to check for this.
As most of our functions return false in those cases, so should this
function.

AFAIR false is not valid JSON, so it would break a lot of code. Also, I
am not sure we should change json_encode to return false on whole
structure if one of the fields contains invalid utf-8.

--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Derick Rethans — view source — reply

unread

Like I mentioned before (I think), it should not return an empty string of
course because programmatically it's not possible to check for this. As most
of our functions return false in those cases, so should this function.

AFAIR false is not valid JSON, so it would break a lot of code. Also, I am not
sure we should change json_encode to return false on whole structure if one of
the fields contains invalid utf-8.

Of course we should make it possible to detect if there is broken data.
I think our normal mantra is "be strict with creating, be lenient while
parsing". Because we're generating data here, we should abort if we
find an error, and not return half baked results. To detect if we have
broken data, the option is to check a return value. As normally this
function returns a string, false (or null) is a good candidate for an
error-result.

regards,
Derick

--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org

18 years ago by Richard Lynch — view source — reply

unread

Like I mentioned before (I think), it should not return an empty
string
of course because programmatically it's not possible to check for
this.
As most of our functions return false in those cases, so should this
function.

AFAIR false is not valid JSON, so it would break a lot of code. Also,
I
am not sure we should change json_encode to return false on whole
structure if one of the fields contains invalid utf-8.

I see what you are saying, but...

If you want to write good quality code, you have to do something
reasonable here.

PERHAPS:

Return false

E_WARNING

Put partially decoded value somewhere in a variable to be accessed, so
that the programmer can make an informed decision about what to do.

--
Some people have a "gift" link here.
Know what I want?
I want you to buy a CD from some indie artist.
http://cdbaby.com/from/lynch
Yeah, I get a buck. So?

18 years ago by Derick Rethans — view source — reply

unread

Like I mentioned before (I think), it should not return an empty
string of course because programmatically it's not possible to
check for this. As most of our functions return false in those
cases, so should this function.

AFAIR false is not valid JSON, so it would break a lot of code.
Also, I am not sure we should change json_encode to return false on
whole structure if one of the fields contains invalid utf-8.

[snip]

Put partially decoded value somewhere in a variable to be accessed, so
that the programmer can make an informed decision about what to do.

We're talking about encoding here though...

regards,
Derick

--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org

17 years ago by Alan Knowles — view source — reply

unread

This should really be fixed similar to the iconv //IGNORE flag - so
that bad characters are just replaced with '?'

We use it to render spam email summaries, and dont really care if the
encoding is incorrect, just as long as it shows something.

Throwing a warning without having a fix/workaround, just reduces the
usefulness of the function.

Regards
Alan

Stanislav Malyshev wrote:

Hi!

Right now, if json_encode sees wrong UTF-8 data, it just cuts the
string in the middle, no error returned, no message produced. Example:

var_dump(json_encode("ab\xE0"));
var_dump(json_encode("ab\xE0""));

Both strings get cut at "ab". I think it's not a good idea to just
silently cut the data. In fact, I think it is a bug caused by this
code in ext/json/utf8_to_utf16.c:
if (c < 0) {
return UTF8_END ? the_index : UTF8_ERROR;
}
which inherited this bug from code published on json.org. It should be:
if (c < 0) {
return (c == UTF8_END) ? the_index : UTF8_ERROR;
}
Now this is an easy fix but would lead to bad strings silently
converted to empty strings. The question is - should we have an error
there? If so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Also filed as bug #43941.
Any comments?

17 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

This should really be fixed similar to the iconv //IGNORE flag - so
that bad characters are just replaced with '?'

Maybe using iconv is a way to go in your scenario?

--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com