Hi,
There are two issues (reported bugs but not really bugs) in json_decode
related to \u escape.
First one is
json_decode('{"\u0000": 1}');
reported in https://bugs.php.net/bug.php?id=68546
That code result in fatal error due to using malformed property (private
props starting with \0). I don't think that anything parsed in json_decode
should result in a fatal error. That's why I would like to introduce a new
json error called JSON_ERROR_MANGLED_PROPERTY_NAME .
Second one is
json_decode('"\ud834"');
which relusts non UTF string from JSON decoder. This is conformant to the
JSON RFC 7159 as noted in section 8.2:
However, the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate). Instances of this have been observed, for example, when
a library truncates a UTF-16 string without checking whether the
truncation split a surrogate pair. The behavior of software that
receives JSON texts containing such values is unpredictable; for
example, implementations might return different values for the length
of a string value or even suffer fatal runtime exceptions.
As the behavior is unpredictable, the current default result seems
reasonable because PHP strings are not internally unicode encode. However
there might be cases when user want to make sure that he/she gets unicode
string. In that case I would like to add an option called:
JSON_VALID_ESCAPED_UNICODE which will emit error called JSON_ERROR_UTF16
when such escape appears. I implemented this in jsond long time ago and
think that it would be useful for the json as well.
Thoughts?
I'm happy with changing constant names if someone come up with a better
names.
I would like to patch master sometimes next week if they are no objections.
Cheers
Jakub
Hi Jakub,
There are two issues (reported bugs but not really bugs) in json_decode
related to \u escape.First one is
json_decode('{"\u0000": 1}');
reported in https://bugs.php.net/bug.php?id=68546That code result in fatal error due to using malformed property (private
props starting with \0). I don't think that anything parsed in json_decode
should result in a fatal error. That's why I would like to introduce a new
json error called JSON_ERROR_MANGLED_PROPERTY_NAME .
Any invalid chars as variable/property name should be handled as invalid.
Valid variable name: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'
http://php.net/manual/en/language.variables.basics.php
This violates JSON spec, but if user would like to allow invalid names. It
should be an option rather than the default. IMO.
[yohgaki@dev ~]$ php
<?php
$o = new StdClass;
$o->{123} = 11;
var_dump($o);
?>
class stdClass#1 (1) {
public $123 =>
int(11)
}
[yohgaki@dev ~]$ php
<?php
$o = new StdClass;
$o->123;
var_dump($o);
?>
PHP Parse error: syntax error, unexpected '123' (T_LNUMBER), expecting
identifier (T_STRING) or variable (T_VARIABLE) or '{' or '$' in - on line 3
Since JSON string must be UTF-8/16/32, any invalid UTF sequence
could be treated as invalid.
8.1. Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as
UTF-16 and UTF-32).
Implementations MUST NOT add a byte order mark to the beginning of a
JSON text. In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
https://tools.ietf.org/html/rfc7159#section-8.1
I prefer BOM as invalid sequence and raising error/return NULL.
Second one is
json_decode('"\ud834"');
which relusts non UTF string from JSON decoder. This is conformant to the
JSON RFC 7159 as noted in section 8.2:However, the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate). Instances of this have been observed, for example, when
a library truncates a UTF-16 string without checking whether the
truncation split a surrogate pair. The behavior of software that
receives JSON texts containing such values is unpredictable; for
example, implementations might return different values for the length
of a string value or even suffer fatal runtime exceptions.As the behavior is unpredictable, the current default result seems
reasonable because PHP strings are not internally unicode encode. However
there might be cases when user want to make sure that he/she gets unicode
string. In that case I would like to add an option called:
JSON_VALID_ESCAPED_UNICODE which will emit error called JSON_ERROR_UTF16
JSON_ERROR_UTF16 would be better defined as JSON_ERROR_UTF as
JSON accepts valid UTF sequence.
It's also better to reject any invalid UTF sequence, not limited to Unicode
escaped
(\uXXXX) string. If it does not validate Unicode sequence, I would add the
validation.
when such escape appears. I implemented this in jsond long time ago and
think that it would be useful for the json as well.Thoughts?
JSON does not forbid object property begins with digits. I'm not sure how
currently handled, but it should result in error like NULL. IMO.
I'm happy with changing constant names if someone come up with a better
names.I would like to patch master sometimes next week if they are no objections.
I don't object the change, but the changes is better if it is extended.
Since OWASP starts advocating Unicode escape for all names and values in
JSON, I would like to have ability to encode all chars as \uXXXX by
default.
i.e. Escape all \r, \n, a, b, c, 0, 1, 2, etc as \uXXXX by default, disable
\uXXXX
encoding as an option.
BTW, any progress on disabling automatic float conversion against float like
values? This is mandatory, IMHO.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Yasuo,
Any invalid chars as variable/property name should be handled as invalid.
Valid variable name: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'
http://php.net/manual/en/language.variables.basics.phpThis violates JSON spec, but if user would like to allow invalid names. It
should be an option rather than the default. IMO.
[yohgaki@dev ~]$ php
<?php
$o = new StdClass;
$o->{123} = 11;var_dump($o);
?>class stdClass#1 (1) {
public $123 =>
int(11)
}
[yohgaki@dev ~]$ php
<?php
$o = new StdClass;
$o->123;var_dump($o);
?>PHP Parse error: syntax error, unexpected '123' (T_LNUMBER), expecting
identifier (T_STRING) or variable (T_VARIABLE) or '{' or '$' in - on line 3
As you showed in your example, these names are not invalid, you just need
to enclose them ( $o->{"123"} ). This is a basic PHP thing and JSON parser
should not worry about users that don't know that.
Since JSON string must be UTF-8/16/32, any invalid UTF sequence
could be treated as invalid.8.1. Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as
UTF-16 and UTF-32).Implementations MUST NOT add a byte order mark to the beginning of a
JSON text. In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
https://tools.ietf.org/html/rfc7159#section-8.1I prefer BOM as invalid sequence and raising error/return NULL.
PHP JSON parser accepts only UTF-8 and this is already correctly validated
so I don't see any issue here either.
JSON_ERROR_UTF16 would be better defined as JSON_ERROR_UTF as
JSON accepts valid UTF sequence.
The thing is that we have already JSON_ERROR_UTF8 error that is raised when
input binary string is invalid. So the JSON_ERROR_UTF16 was meant to
distinguish these two errors. I'm happy for other ideas but not sure about
JSON_ERROR_UTF as it might be confusing with JSON_ERROR_UTF8.
It's also better to reject any invalid UTF sequence, not limited to
Unicode escaped
(\uXXXX) string. If it does not validate Unicode sequence, I would add the
validation.
The single surrogate is actually the only case when it can result in
invalid unicode string.
JSON does not forbid object property begins with digits. I'm not sure how
currently handled, but it should result in error like NULL. IMO.
As noted above: see http://3v4l.org/sJo8p
Since OWASP starts advocating Unicode escape for all names and values in
JSON, I would like to have ability to encode all chars as \uXXXX by
default.
i.e. Escape all \r, \n, a, b, c, 0, 1, 2, etc as \uXXXX by default,
disable \uXXXX
encoding as an option.
I think that we are a bit late for such change as it is a bit bigger and
also a BC break which would require RFC.
BTW, any progress on disabling automatic float conversion against float
like
values? This is mandatory, IMHO.
The RFC ( https://wiki.php.net/rfc/json_numeric_as_string ) is under
discussion:
internals@lists.php.net/msg78683.html" rel="nofollow" target="_blank">https://www.mail-archive.com/internals@lists.php.net/msg78683.html
Cheers
Jakub
Hi,
Hi,
There are two issues (reported bugs but not really bugs) in json_decode
related to \u escape.First one is
json_decode('{"\u0000": 1}');
reported in https://bugs.php.net/bug.php?id=68546That code result in fatal error due to using malformed property (private
props starting with \0). I don't think that anything parsed in json_decode
should result in a fatal error. That's why I would like to introduce a new
json error called JSON_ERROR_MANGLED_PROPERTY_NAME .
I have just created a PR for that: https://github.com/php/php-src/pull/1332
. So if any objecting (e.g. error name), then shout now before I merge it
to master...
Second one is
json_decode('"\ud834"');
which relusts non UTF string from JSON decoder. This is conformant to the
JSON RFC 7159 as noted in section 8.2:However, the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate). Instances of this have been observed, for example, when
a library truncates a UTF-16 string without checking whether the
truncation split a surrogate pair. The behavior of software that
receives JSON texts containing such values is unpredictable; for
example, implementations might return different values for the length
of a string value or even suffer fatal runtime exceptions.As the behavior is unpredictable, the current default result seems
reasonable because PHP strings are not internally unicode encode. However
there might be cases when user want to make sure that he/she gets unicode
string. In that case I would like to add an option called:
JSON_VALID_ESCAPED_UNICODE which will emit error called JSON_ERROR_UTF16
when such escape appears. I implemented this in jsond long time ago and
think that it would be useful for the json as well.
I have been thinking about this a bit more and I would like to make the
error by default and not introduce a new option for that. The RFC actually
calls that behavior unpredictable and allows raising error so it's not
against the RFC. It also makes sense because other parsers (e.g. Python 2
and 3) do the same. I can't imagine anyone relaying on \uDEAD producing
invalid unicode. I think that there are much more users that actually
expects valid unicode always produced by json_decode which is not the case
at the moment. So it really does not make sense to keep it in PHP 7. If
there are no objection, I will create a PR next week.
Cheers
Jakub
Jakub Zelenka wrote:
There are two issues (reported bugs but not really bugs) in json_decode
related to \u escape.First one is
json_decode('{"\u0000": 1}');
reported in https://bugs.php.net/bug.php?id=68546That code result in fatal error due to using malformed property (private
props starting with \0). I don't think that anything parsed in json_decode
should result in a fatal error. That's why I would like to introduce a new
json error called JSON_ERROR_MANGLED_PROPERTY_NAME .I have just created a PR for that: https://github.com/php/php-src/pull/1332
.. So if any objecting (e.g. error name), then shout now before I merge it
to master...
Have you considered JSON_ERROR_INVALID_PROPERTY_NAME
instead of
JSON_ERROR_MANGLED_PROPERTY_NAME, or for brevity JSON_ERROR_PROPERTY_NAME?
If the issue is considered to be a bug, it might be reasonable to
backport the fix to PHP 5.5 and 5.6. The json-0 patch, which is
attached to the ticket, would have to be modified according to the error
constant.
--
Christoph M. Becker
Jakub Zelenka wrote:
There are two issues (reported bugs but not really bugs) in json_decode
related to \u escape.First one is
json_decode('{"\u0000": 1}');
reported in https://bugs.php.net/bug.php?id=68546That code result in fatal error due to using malformed property (private
props starting with \0). I don't think that anything parsed in
json_decode
should result in a fatal error. That's why I would like to introduce a
new
json error called JSON_ERROR_MANGLED_PROPERTY_NAME .I have just created a PR for that:
https://github.com/php/php-src/pull/1332
.. So if any objecting (e.g. error name), then shout now before I merge
it
to master...Have you considered
JSON_ERROR_INVALID_PROPERTY_NAME
instead of
JSON_ERROR_MANGLED_PROPERTY_NAME, or for brevity JSON_ERROR_PROPERTY_NAME?
JSON_ERROR_INVALID_PROPERTY_NAME
might be better. The mangled was meant to
show the real cause of the error but it might sound a bit strange for some.
I don't really mind what the name is. So if more people is for that invalid
name, then I'm happy to change it.
If the issue is considered to be a bug, it might be reasonable to
backport the fix to PHP 5.5 and 5.6. The json-0 patch, which is
attached to the ticket, would have to be modified according to the error
constant.
It's more improvement than a bug fix. It's also a very rare case so the
backport is not worth it. I will add it to jsond that will have in a few
months stable version that is a fully drop-in replacement (including
function replacement if they exists) so if anyone really wants it, then
jsond should be the choice.
Cheers
Jakub