Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:86464 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 5057 invoked from network); 1 Jun 2015 00:11:37 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 1 Jun 2015 00:11:37 -0000 Authentication-Results: pb1.pair.com smtp.mail=yohgaki@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=yohgaki@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.218.42 as permitted sender) X-PHP-List-Original-Sender: yohgaki@gmail.com X-Host-Fingerprint: 209.85.218.42 mail-oi0-f42.google.com Received: from [209.85.218.42] ([209.85.218.42:32932] helo=mail-oi0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 35/20-01828-633AB655 for ; Sun, 31 May 2015 20:11:35 -0400 Received: by oiww2 with SMTP id w2so90363536oiw.0 for ; Sun, 31 May 2015 17:11:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=GeiCSokGHcL03uyqXzuJF5UeU4uWPo/2PuEhlgU4Fs0=; b=JmYiiCBgXFzrRUADwnxbyu0kNsFgYVZ11NpKoSg4wtz83FExddRgg7NszMV6U39tm2 L4WoZop58Eq3xyVHOUPCT+TGfpqVK+OE4KCNTgwGyycM6074eAFB8RVdETKmypvjZFM6 rckWXvJ2CODLXIolpMNw23oexscWzugEXyImTuAaNb+LXKFAnhtkXg1rFVx2/+asansv 7gkJO6Wcn7eMYyya1Ka6ThwIi9L037PqsDtluv4o9++fHSeIaBsEm4w63v5oRjKXVKxM zdkDohWIMpH+RL1sl8P+VY+4S87uGURkYGEjf58A2R/zhatTi9uuY8vGzPqX4ubEEqtE agYA== X-Received: by 10.182.78.9 with SMTP id x9mr11038345obw.72.1433117491656; Sun, 31 May 2015 17:11:31 -0700 (PDT) MIME-Version: 1.0 Sender: yohgaki@gmail.com Received: by 10.202.170.196 with HTTP; Sun, 31 May 2015 17:10:51 -0700 (PDT) In-Reply-To: References: Date: Mon, 1 Jun 2015 09:10:51 +0900 X-Google-Sender-Auth: GdRfdhUSe6ZHBVg8JpYLOXnIMtM Message-ID: To: Jakub Zelenka Cc: PHP internals list Content-Type: multipart/alternative; boundary=047d7b2e44d6032818051769ab48 Subject: Re: [PHP-DEV] JSON unicode escape issue and new constants From: yohgaki@ohgaki.net (Yasuo Ohgaki) --047d7b2e44d6032818051769ab48 Content-Type: text/plain; charset=UTF-8 Hi Jakub, On Fri, May 29, 2015 at 3:53 AM, Jakub Zelenka wrote: > There are two issues (reported bugs but not really bugs) in json_decode > related to \u escape. > > First one is > json_decode('{"\u0000": 1}'); > reported in https://bugs.php.net/bug.php?id=68546 > > That code result in fatal error due to using malformed property (private > props starting with \0). I don't think that anything parsed in json_decode > should result in a fatal error. That's why I would like to introduce a new > json error called JSON_ERROR_MANGLED_PROPERTY_NAME . > Any invalid chars as variable/property name should be handled as invalid. Valid variable name: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' http://php.net/manual/en/language.variables.basics.php This violates JSON spec, but if user would like to allow invalid names. It should be an option rather than the default. IMO. [yohgaki@dev ~]$ php {123} = 11; var_dump($o); ?> class stdClass#1 (1) { public $123 => int(11) } [yohgaki@dev ~]$ php 123; var_dump($o); ?> PHP Parse error: syntax error, unexpected '123' (T_LNUMBER), expecting identifier (T_STRING) or variable (T_VARIABLE) or '{' or '$' in - on line 3 Since JSON string must be UTF-8/16/32, any invalid UTF sequence could be treated as invalid. 8.1. Character Encoding JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error. https://tools.ietf.org/html/rfc7159#section-8.1 I prefer BOM as invalid sequence and raising error/return NULL. > > > Second one is > json_decode('"\ud834"'); > which relusts non UTF string from JSON decoder. This is conformant to the > JSON RFC 7159 as noted in section 8.2: > > However, the ABNF in this specification allows member names and > string values to contain bit sequences that cannot encode Unicode > characters; for example, "\uDEAD" (a single unpaired UTF-16 > surrogate). Instances of this have been observed, for example, when > a library truncates a UTF-16 string without checking whether the > truncation split a surrogate pair. The behavior of software that > receives JSON texts containing such values is unpredictable; for > example, implementations might return different values for the length > of a string value or even suffer fatal runtime exceptions. > > > As the behavior is unpredictable, the current default result seems > reasonable because PHP strings are not internally unicode encode. However > there might be cases when user want to make sure that he/she gets unicode > string. In that case I would like to add an option called: > JSON_VALID_ESCAPED_UNICODE which will emit error called JSON_ERROR_UTF16 > JSON_ERROR_UTF16 would be better defined as JSON_ERROR_UTF as JSON accepts valid UTF sequence. It's also better to reject any invalid UTF sequence, not limited to Unicode escaped (\uXXXX) string. If it does not validate Unicode sequence, I would add the validation. > when such escape appears. I implemented this in jsond long time ago and > think that it would be useful for the json as well. > > Thoughts? > JSON does not forbid object property begins with digits. I'm not sure how currently handled, but it should result in error like NULL. IMO. > > I'm happy with changing constant names if someone come up with a better > names. > > I would like to patch master sometimes next week if they are no objections. > I don't object the change, but the changes is better if it is extended. Since OWASP starts advocating Unicode escape for all names and values in JSON, I would like to have ability to encode all chars as \uXXXX by default. i.e. Escape all \r, \n, a, b, c, 0, 1, 2, etc as \uXXXX by default, disable \uXXXX encoding as an option. BTW, any progress on disabling automatic float conversion against float like values? This is mandatory, IMHO. Regards, -- Yasuo Ohgaki yohgaki@ohgaki.net --047d7b2e44d6032818051769ab48--