Hi,
I'm proposing a small change in the behavior of json_encode(str, JSON_UNESCAPED_UNICODE)
around the issue of line terminators.
The U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR
characters are allowed unescaped in JSON strings, but not allowed unescaped
in Javascript. This is widely considered a minor wart in the JSON specification.
https://medium.com/joys-of-javascript/json-js-42a28471221d
As a result, the JSON_UNESCAPED_UNICODE
flag is dangerous to use when
generating HTML. For example, this will generate a Javascript error ("Unexpected
token ILLEGAL") in the user's browser:
$x = mb_convert_encoding('
', 'UTF-8', 'HTML-ENTITIES');
echo '<script>x = ', json_encode($x, JSON_UNESCAPED_UNICODE), ';</script>';
The proposal is for json_encode(..., JSON_UNESCAPED_UNICODE)
to
escape the U+2028 and U+2029 characters as \u2028 and \u2029. A new flag,
JSON_UNESCAPED_LINE_TERMINATORS, preserves the former behavior.
It's important to note that this change only affects the non-default
JSON_UNESCAPED_UNICODE
flag.
Jakub Zelenka approves of this change, which we've discussed on Github
https://github.com/php/php-src/pull/1701, but since it is a small change in
behavior, he asked me to email internals in case anyone objects.
Thanks all,
Eddie Kohler
Hi Eddie,
Eddie Kohler wrote:
The U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR
characters are allowed unescaped in JSON strings, but not allowed unescaped
in Javascript. This is widely considered a minor wart in the JSON specification.
https://medium.com/joys-of-javascript/json-js-42a28471221dAs a result, the
JSON_UNESCAPED_UNICODE
flag is dangerous to use when
generating HTML. For example, this will generate a Javascript error ("Unexpected
token ILLEGAL") in the user's browser:$x = mb_convert_encoding('
', 'UTF-8', 'HTML-ENTITIES'); echo '<script>x = ', json_encode($x, JSON_UNESCAPED_UNICODE), ';</script>';
The proposal is for
json_encode(..., JSON_UNESCAPED_UNICODE)
to
escape the U+2028 and U+2029 characters as \u2028 and \u2029. A new flag,
JSON_UNESCAPED_LINE_TERMINATORS, preserves the former behavior.It's important to note that this change only affects the non-default
JSON_UNESCAPED_UNICODE
flag.
This sounds reasonable. I'd like to ask, though, does this mean that
without that flag, U+2028 and U+2029 are always escaped?
Thanks.
Andrea Faulds
https://ajf.me/
Yes, without the JSON_UNESCAPED_UNICODE
flag, all characters with
Unicode values >= 0x80 are escaped. That's the default behavior.
Hi Eddie,
Eddie Kohler wrote:
The U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR
characters are allowed unescaped in JSON strings, but not allowed
unescaped
in Javascript. This is widely considered a minor wart in the JSON
specification.
https://medium.com/joys-of-javascript/json-js-42a28471221dAs a result, the
JSON_UNESCAPED_UNICODE
flag is dangerous to use when
generating HTML. For example, this will generate a Javascript error
("Unexpected
token ILLEGAL") in the user's browser:$x = mb_convert_encoding('
', 'UTF-8', 'HTML-ENTITIES'); echo '<script>x = ', json_encode($x, JSON_UNESCAPED_UNICODE), ';</script>';
The proposal is for
json_encode(..., JSON_UNESCAPED_UNICODE)
to
escape the U+2028 and U+2029 characters as \u2028 and \u2029. A new flag,
JSON_UNESCAPED_LINE_TERMINATORS, preserves the former behavior.It's important to note that this change only affects the non-default
JSON_UNESCAPED_UNICODE
flag.This sounds reasonable. I'd like to ask, though, does this mean that without
that flag, U+2028 and U+2029 are always escaped?Thanks.
Andrea Faulds
https://ajf.me/