Strings, invalid escape sequences and parse errors

10 years ago by Peter Cowburn — view source — reply

unread

Happy Friday, internals!

Prior to PHP 7, any "invalid" escape sequences within strings (as far as I
can see) were ignored and the characters treated literally. For example:
"\xGG" ("broken" hex sequence) gives "\xGG", "\99" ("broken" octal
sequence) gives "\99", "\m" (not a recognised sequence at all) gives "\m"
and so on.

PHP 7 introduced a new escape sequence for unicode codepoints "\u{...}".
This deliberately breaks away from the pack and raises a Parse Error when
an escape sequence starting with "\u{" is not followed by the required
characters to make it a "valid" escape sequence (i.e. 1 to 6 hex characters
followed by a curly brace).

Why does \u{} behave differently for any other escape sequence? Because the
author prefers it that way,and indeed thinks all "invalid" escape sequences
should result in the same error. [pers. comm.]

The question I'd like to bring forward is: can we either:

a) change all other "invalid" escape sequences to be a parse error [that
would mean "\m" would raise a parse error!]

b) change \u{} to behave like any other escape sequence, by not raising a
parse error and instead keeping the literal characters

or c) tell me to keep quiet and accept the oddball behaviour, having quirks
is The PHP Way after all.

Either way, I'd like to see some resolution to this sooner rather than
later as we're very late in the PHP 7.0.0 game.

Cheers, and enjoy your weekends,

Peter

10 years ago by Bishop Bettini — view source — reply

unread

On Fri, Oct 2, 2015 at 4:18 AM, Peter Cowburn petercowburn@gmail.com
wrote:

a) change all other "invalid" escape sequences to be a parse error [that
would mean "\m" would raise a parse error!]

b) change \u{} to behave like any other escape sequence, by not raising a
parse error and instead keeping the literal characters

or c) tell me to keep quiet and accept the oddball behaviour, having quirks
is The PHP Way after all.

Well, I think option (a) would break parsed strings containing regex:

$subject = "there are words here";
$pattern = "/\w+/"; // problem
$rc = preg_match_all($pattern, $subject);
echo "Matches: \x1b[7m$rc\x1b[0m\n"; // not a problem

Option (b) sounds reasonable, but there's probably A Solid Reason it was
implemented that way, which if so leaves (c.ii): accepting the odd-ball
behavior....

10 years ago by Sara Golemon — view source — reply

unread

On Fri, Oct 2, 2015 at 4:18 AM, Peter Cowburn petercowburn@gmail.com
wrote:

a) change all other "invalid" escape sequences to be a parse error [that
would mean "\m" would raise a parse error!]

b) change \u{} to behave like any other escape sequence, by not raising a
parse error and instead keeping the literal characters

or c) tell me to keep quiet and accept the oddball behaviour, having quirks
is The PHP Way after all.

Well, I think option (a) would break parsed strings containing regex:

Oh holy hell. I was about to point towards A because I agree with
Andrea that our invalid escape handling makes no sense, then you throw
this wrench in the gears.

While I still think that ignoring invalid sequences is bad and a
recipe for disaster (for example, in a given regex string, you have
some "escapes" passed to the engine as-is, while others like
\t\v\f\r\n do get interpolated, which is so inconsistent and entirely
php it's practically its own meme), I have to be practical about the
fact that there is a TON of existing regex out there (and no small
amount of "\u1234" sequences in JSON blobs). A ton of that existing
regex is also needlessly using double-quotes strings where
single-quotes would have worked, meaning we can't just bifurcate on
that (even though allowing invalid sequences through on single-quotes
makes some sense).

Ugh... No, that's too big of a change to existing scripts. Can't do
option A, much as I'd like.

Option (b) sounds reasonable, but there's probably A Solid Reason it was
implemented that way

AIUI, the "solid reason" was because it's dangerous to fail silently
where you have high confidence that something is wrong. Again, I
believe in it, but the arguments against option A illustrate why it
might not be practical. I hate to say this, but in the interest of
consistency (were 7.0 not in its final stage) I'd vote for B.

which if so leaves (c.ii): accepting the odd-ball behavior....

Given that 7.0 is in its final stage, and changing this behaviour is
probably a non-starter at this point. C seems the most sane^W
pragmatic. It's not the first inconsistency PHP's picked up, it won't
be the last.

-Sara

10 years ago by Tom Worster — view source — reply

unread

On Fri, Oct 2, 2015 at 4:18 AM, Peter Cowburn petercowburn@gmail.com
wrote:

a) change all other "invalid" escape sequences to be a parse error [that
would mean "\m" would raise a parse error!]

b) change \u{} to behave like any other escape sequence, by not raising a
parse error and instead keeping the literal characters

or c) tell me to keep quiet and accept the oddball behaviour, having quirks
is The PHP Way after all.

Well, I think option (a) would break parsed strings containing regex:

Oh holy hell. I was about to point towards A because I agree with
Andrea that our invalid escape handling makes no sense, then you throw
this wrench in the gears.

While I still think that ignoring invalid sequences is bad and a
recipe for disaster (for example, in a given regex string, you have
some "escapes" passed to the engine as-is, while others like
\t\v\f\r\n do get interpolated, which is so inconsistent and entirely
php it's practically its own meme), I have to be practical about the
fact that there is a TON of existing regex out there (and no small
amount of "\u1234" sequences in JSON blobs). A ton of that existing
regex is also needlessly using double-quotes strings where
single-quotes would have worked, meaning we can't just bifurcate on
that (even though allowing invalid sequences through on single-quotes
makes some sense).

Ugh... No, that's too big of a change to existing scripts. Can't do
option A, much as I'd like.

Option (b) sounds reasonable, but there's probably A Solid Reason it was
implemented that way

AIUI, the "solid reason" was because it's dangerous to fail silently
where you have high confidence that something is wrong. Again, I
believe in it, but the arguments against option A illustrate why it
might not be practical. I hate to say this, but in the interest of
consistency (were 7.0 not in its final stage) I'd vote for B.

which if so leaves (c.ii): accepting the odd-ball behavior....

Given that 7.0 is in its final stage, and changing this behaviour is
probably a non-starter at this point. C seems the most sane^W
pragmatic. It's not the first inconsistency PHP's picked up, it won't
be the last.

I agree with Sara all the way except the opinion that it's too late to
fix this bug with option B, which I think is the right one.

I simply don't know if it is too late or not so I suggest Peter enter a
bug report and see what happens. If it's too late for 7.0.0 do it in
.0.1, which is ok because people will expect instability with 7.0.0.
\u{394}semver > 1 is sufficient warning, I think.

Tom

10 years ago by Andrea Faulds — view source — reply

unread

Hi,

Tom Worster wrote:

I agree with Sara all the way except the opinion that it's too late to
fix this bug with option B, which I think is the right one.

I simply don't know if it is too late or not so I suggest Peter enter a
bug report and see what happens. If it's too late for 7.0.0 do it in
.0.1, which is ok because people will expect instability with 7.0.0.
\u{394}semver > 1 is sufficient warning, I think.

The RFC was passed with 92% voting in favour, and that RFC INCLUDED the
error behaviour explicitly.

If people really want to butcher it under the guise of "consistency" and
cause future pain when the Unicode consortium does something weird, we
should have yet another discussion and voting period. We shouldn't just
"fix" it quickly because three people on the mailing list think so.

Thanks.

--
Andrea Faulds
http://ajf.me/

10 years ago by Andrea Faulds — view source — reply

unread

Hey Sara,

Sara Golemon wrote:

Option (b) sounds reasonable, but there's probably A Solid Reason it was
implemented that way

AIUI, the "solid reason" was because it's dangerous to fail silently
where you have high confidence that something is wrong. Again, I
believe in it, but the arguments against option A illustrate why it
might not be practical. I hate to say this, but in the interest of
consistency (were 7.0 not in its final stage) I'd vote for B.

I made \u work the way it does because I don't like repeating the
mistakes of the past. It deliberately uses {} rather than a
variable-length sequence of hex digits. Why? Because this way it's
always completely clear where the boundary is between it and any
following characters, unlike the mess which is octal and hex escapes. I
also made it an error because, well, why shouldn't it be? People don't
like it when something fails and you don't tell them. We don't even
produce an E_NOTICE or E_STRICT or something for invalid escapes, we
just pretend it's not an escape sequence. That's awful: it means code
made for a PHP version which supports these escapes will silently break
on other PHP versions. In \u's case, it would mean that if the Unicode
standard extended or shrunk the range (U+0000–U+10FFFF) in future, we
wouldn't be able to change \u's limits to match, because it would
silently break PHP code.

I'd love it if we could also make a slash followed by an unrecognised
character always be an error, but alas, sounds like we can't do that. So
for now, we can only make new escape sequences behave sensibly.

\u is deliberately inconsistent. You can go replace it with \uXXXX if
you really want to, but PHP's users will not thank you.

Thanks.

--
Andrea Faulds
http://ajf.me/