Hi!
Right now, if json_encode sees wrong UTF-8 data, it just cuts the string
in the middle, no error returned, no message produced. Example:
var_dump(json_encode("ab\xE0"));
var_dump(json_encode("ab\xE0""));
Both strings get cut at "ab". I think it's not a good idea to just
silently cut the data. In fact, I think it is a bug caused by this code
in ext/json/utf8_to_utf16.c:
if (c < 0) {
return UTF8_END ? the_index : UTF8_ERROR;
}
which inherited this bug from code published on json.org. It should be:
if (c < 0) {
return (c == UTF8_END) ? the_index : UTF8_ERROR;
}
Now this is an easy fix but would lead to bad strings silently converted
to empty strings. The question is - should we have an error there? If
so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Also filed as bug #43941.
Any comments?
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
quick work around for now..
base64_decode(json_decode(json_encode(base64_encode("ab\xE0" something"))))
Stanislav Malyshev wrote:
Hi!
Right now, if json_encode sees wrong UTF-8 data, it just cuts the string
in the middle, no error returned, no message produced. Example:var_dump(json_encode("ab\xE0"));
var_dump(json_encode("ab\xE0""));Both strings get cut at "ab". I think it's not a good idea to just
silently cut the data. In fact, I think it is a bug caused by this code
in ext/json/utf8_to_utf16.c:
if (c < 0) {
return UTF8_END ? the_index : UTF8_ERROR;
}
which inherited this bug from code published on json.org. It should be:
if (c < 0) {
return (c == UTF8_END) ? the_index : UTF8_ERROR;
}
Now this is an easy fix but would lead to bad strings silently converted
to empty strings. The question is - should we have an error there? If
so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Also filed as bug #43941.
Any comments?
2008/1/25, Stanislav Malyshev stas@zend.com:
Now this is an easy fix but would lead to bad strings silently converted
to empty strings. The question is - should we have an error there? If
so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Yes , E_WARNING
is the right thing to have IMHO.
2008/1/25, Stanislav Malyshev stas@zend.com:
Now this is an easy fix but would lead to bad strings silently converted
to empty strings. The question is - should we have an error there? If
so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.Yes ,
E_WARNING
is the right thing to have IMHO.
Stupid question, who actually checks for E_* in his code at runtime
after having called such functions? Not me and I would hate to. It
sounds to me like a perfect exception use case. As this function can
return nearly everything scalar we have (boolean, string, null,
integer,...) we can't use our normal "returns FALSE
on failure".
Cheers,
Stupid question, who actually checks for E_* in his code at runtime
after having called such functions? Not me and I would hate to. It
sounds to me like a perfect exception use case. As this function can
General policy in PHP as far as I know is that non-OO functions do not
do exceptions.
return nearly everything scalar we have (boolean, string, null,
integer,...) we can't use our normal "returnsFALSE
on failure".
This function can return only a string, but I'm not sure returning false
if somewhere in some value deep down your data is some bad utf-8 is
warranted. It is doable, though, but I'm afraid most of current code
never check for return of json_encode()
so they are in for big surprises
when json_encode won't produce valid JSON.
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Stupid question, who actually checks for E_* in his code at runtime
after having called such functions? Not me and I would hate to. It
sounds to me like a perfect exception use case. As this function canGeneral policy in PHP as far as I know is that non-OO functions do not
do exceptions.return nearly everything scalar we have (boolean, string, null,
integer,...) we can't use our normal "returnsFALSE
on failure".This function can return only a string, but I'm not sure returning false
if somewhere in some value deep down your data is some bad utf-8 is
warranted. It is doable, though, but I'm afraid most of current code
never check for return ofjson_encode()
so they are in for big surprises
when json_encode won't produce valid JSON.
Oh right, I thought about/read json_decode. But the idea is the same
yes. Not checking return values is a very bad habit, I can't count how
many times I saw codes dying miserably only because they don't check
returns values. Would it make sense to use exception for such cases?
Aka to change our policy? I'm not an excpeption fan, but these cases
are good examples of exception usages. We can invent a meta type
"error" but it would be rather confusing ;)
--
Pierre
http://blog.thepimp.net | http://www.libgd.org
Hi!
Right now, if json_encode sees wrong UTF-8 data, it just cuts the string in
the middle, no error returned, no message produced. Example:var_dump(json_encode("ab\xE0"));
var_dump(json_encode("ab\xE0""));Both strings get cut at "ab". I think it's not a good idea to just silently
cut the data. In fact, I think it is a bug caused by this code in
ext/json/utf8_to_utf16.c:
if (c < 0) {
return UTF8_END ? the_index : UTF8_ERROR;
}
which inherited this bug from code published on json.org. It should be:
if (c < 0) {
return (c == UTF8_END) ? the_index : UTF8_ERROR;
}
Now this is an easy fix but would lead to bad strings silently converted to
empty strings. The question is - should we have an error there? If so, which
one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Also filed as bug #43941.
Any comments?
Like I mentioned before (I think), it should not return an empty string
of course because programmatically it's not possible to check for this.
As most of our functions return false in those cases, so should this
function.
Derick
Like I mentioned before (I think), it should not return an empty string
of course because programmatically it's not possible to check for this.
As most of our functions return false in those cases, so should this
function.
AFAIR false is not valid JSON, so it would break a lot of code. Also, I
am not sure we should change json_encode to return false on whole
structure if one of the fields contains invalid utf-8.
--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Like I mentioned before (I think), it should not return an empty string of
course because programmatically it's not possible to check for this. As most
of our functions return false in those cases, so should this function.AFAIR false is not valid JSON, so it would break a lot of code. Also, I am not
sure we should change json_encode to return false on whole structure if one of
the fields contains invalid utf-8.
Of course we should make it possible to detect if there is broken data.
I think our normal mantra is "be strict with creating, be lenient while
parsing". Because we're generating data here, we should abort if we
find an error, and not return half baked results. To detect if we have
broken data, the option is to check a return value. As normally this
function returns a string, false (or null) is a good candidate for an
error-result.
regards,
Derick
--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org
Like I mentioned before (I think), it should not return an empty
string
of course because programmatically it's not possible to check for
this.
As most of our functions return false in those cases, so should this
function.AFAIR false is not valid JSON, so it would break a lot of code. Also,
I
am not sure we should change json_encode to return false on whole
structure if one of the fields contains invalid utf-8.
I see what you are saying, but...
If you want to write good quality code, you have to do something
reasonable here.
PERHAPS:
Return false
E_WARNING
Put partially decoded value somewhere in a variable to be accessed, so
that the programmer can make an informed decision about what to do.
--
Some people have a "gift" link here.
Know what I want?
I want you to buy a CD from some indie artist.
http://cdbaby.com/from/lynch
Yeah, I get a buck. So?
Like I mentioned before (I think), it should not return an empty
string of course because programmatically it's not possible to
check for this. As most of our functions return false in those
cases, so should this function.AFAIR false is not valid JSON, so it would break a lot of code.
Also, I am not sure we should change json_encode to return false on
whole structure if one of the fields contains invalid utf-8.
[snip]
Put partially decoded value somewhere in a variable to be accessed, so
that the programmer can make an informed decision about what to do.
We're talking about encoding here though...
regards,
Derick
--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org
This should really be fixed similar to the iconv //IGNORE flag - so
that bad characters are just replaced with '?'
We use it to render spam email summaries, and dont really care if the
encoding is incorrect, just as long as it shows something.
Throwing a warning without having a fix/workaround, just reduces the
usefulness of the function.
Regards
Alan
Stanislav Malyshev wrote:
Hi!
Right now, if json_encode sees wrong UTF-8 data, it just cuts the
string in the middle, no error returned, no message produced. Example:var_dump(json_encode("ab\xE0"));
var_dump(json_encode("ab\xE0""));Both strings get cut at "ab". I think it's not a good idea to just
silently cut the data. In fact, I think it is a bug caused by this
code in ext/json/utf8_to_utf16.c:
if (c < 0) {
return UTF8_END ? the_index : UTF8_ERROR;
}
which inherited this bug from code published on json.org. It should be:
if (c < 0) {
return (c == UTF8_END) ? the_index : UTF8_ERROR;
}
Now this is an easy fix but would lead to bad strings silently
converted to empty strings. The question is - should we have an error
there? If so, which one - E_WARNING, E_NOTICE? I'm for E_WARNING.
Also filed as bug #43941.
Any comments?
Hi!
This should really be fixed similar to the iconv //IGNORE flag - so
that bad characters are just replaced with '?'
Maybe using iconv is a way to go in your scenario?
--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com