Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:86464
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.218.42 as permitted sender)
MIME-Version: 1.0
Sender: yohgaki@gmail.com
In-Reply-To: <CAEKnhAGBK5KYhd2a53d8qx=UjLHxAK9=Hns3438Pd0aJ6PnmYg@mail.gmail.com>
References: <CAEKnhAGBK5KYhd2a53d8qx=UjLHxAK9=Hns3438Pd0aJ6PnmYg@mail.gmail.com>
Date: Mon, 1 Jun 2015 09:10:51 +0900
Message-ID: <CAGa2bXb5EGyjr1cOYsCww7_NfvuSQk6fiH65+hDeDqKeWuRfrA@mail.gmail.com>
To: Jakub Zelenka <bukka@php.net>
Cc: PHP internals list <internals@lists.php.net>
Content-Type: multipart/alternative; boundary=047d7b2e44d6032818051769ab48
Subject: Re: [PHP-DEV] JSON unicode escape issue and new constants
From: yohgaki@ohgaki.net (Yasuo Ohgaki)

--047d7b2e44d6032818051769ab48
Content-Type: text/plain; charset=UTF-8

Hi Jakub,

On Fri, May 29, 2015 at 3:53 AM, Jakub Zelenka <bukka@php.net> wrote:

> There are two issues (reported bugs but not really bugs) in json_decode
> related to \u escape.
>
> First one is
> json_decode('{"\u0000": 1}');
> reported in https://bugs.php.net/bug.php?id=68546
>
> That code result in fatal error due to using malformed property (private
> props starting with \0). I don't think that anything parsed in json_decode
> should result in a fatal error. That's why I would like to introduce a new
> json error called JSON_ERROR_MANGLED_PROPERTY_NAME .
>

Any invalid chars as variable/property name should be handled as invalid.

Valid variable name:  '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'
http://php.net/manual/en/language.variables.basics.php

This violates JSON spec, but if user would like to allow invalid names. It
should be an option rather than the default. IMO.

[yohgaki@dev ~]$ php
<?php
$o = new StdClass;
$o->{123} = 11;

var_dump($o);
?>

class stdClass#1 (1) {
  public $123 =>
  int(11)
}
[yohgaki@dev ~]$ php
<?php
$o = new StdClass;
$o->123;

var_dump($o);
?>

PHP Parse error:  syntax error, unexpected '123' (T_LNUMBER), expecting
identifier (T_STRING) or variable (T_VARIABLE) or '{' or '$' in - on line 3


Since JSON string must be UTF-8/16/32, any invalid UTF sequence
could be treated as invalid.

8.1.  Character Encoding

   JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.  The default
   encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
   interoperable in the sense that they will be read successfully by the
   maximum number of implementations; there are many implementations
   that cannot successfully read texts in other encodings (such as
   UTF-16 and UTF-32).

   Implementations MUST NOT add a byte order mark to the beginning of a
   JSON text.  In the interests of interoperability, implementations
   that parse JSON texts MAY ignore the presence of a byte order mark
   rather than treating it as an error.
https://tools.ietf.org/html/rfc7159#section-8.1

I prefer BOM as invalid sequence and raising error/return NULL.



>
>
> Second one is
> json_decode('"\ud834"');
> which relusts non UTF string from JSON decoder. This is conformant to the
> JSON RFC 7159 as noted in section 8.2:
>
>    However, the ABNF in this specification allows member names and
>    string values to contain bit sequences that cannot encode Unicode
>    characters; for example, "\uDEAD" (a single unpaired UTF-16
>    surrogate).  Instances of this have been observed, for example, when
>    a library truncates a UTF-16 string without checking whether the
>    truncation split a surrogate pair.  The behavior of software that
>    receives JSON texts containing such values is unpredictable; for
>    example, implementations might return different values for the length
>    of a string value or even suffer fatal runtime exceptions.
>
>
> As the behavior is unpredictable, the current default result  seems
> reasonable because PHP strings are not internally unicode encode. However
> there might be cases when user want to make sure that he/she gets unicode
> string. In that case I would like to add an option called:
> JSON_VALID_ESCAPED_UNICODE which will emit error called JSON_ERROR_UTF16
>

JSON_ERROR_UTF16 would be better defined as JSON_ERROR_UTF as
JSON accepts valid UTF sequence.

It's also better to reject any invalid UTF sequence, not limited to Unicode
escaped
(\uXXXX) string. If it does not validate Unicode sequence, I would add the
validation.


> when such escape appears. I implemented this in jsond long time ago and
> think that it would be useful for the json as well.
>
> Thoughts?
>

JSON does not forbid object property begins with digits. I'm not sure how
currently handled, but it should result in error like NULL. IMO.


>
> I'm happy with changing constant names if someone come up with a better
> names.
>
> I would like to patch master sometimes next week if they are no objections.
>

I don't object the change, but the changes is better if it is extended.

Since OWASP starts advocating Unicode escape for all names and values in
JSON, I would like to have ability to encode all chars as \uXXXX by
default.
i.e. Escape all \r, \n, a, b, c, 0, 1, 2, etc as \uXXXX by default, disable
\uXXXX
encoding as an option.

BTW, any progress on disabling automatic float conversion against float like
values? This is mandatory, IMHO.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

--047d7b2e44d6032818051769ab48--