Improving output of syntax errors

5 years ago by Rowan Tommins — view source

unread

Hi all,

During the discussion of "T_PAAMAYIM_NEKUDOTAYIM", there was broad
agreement that internal token names should not be included in errors
shown to users. I now have an implementation of this.

Note that we currently only customise the token descriptions placed into
Bison's hard-coded template, so the format remains "syntax error,
unexpected %s, expecting %s or %s or %s".

I have distinguished between two types of token:

Tokens which always represent the same text (e.g. keywords and
operators) are represented by their standard form, e.g. 'unexpected
token "static", expecting "function" ...'
Tokens which have variable content are given a user-friendly name,
shown as well as the actual text when possible, e.g. 'unexpected
identifier "foo", expecting quoted string ...'

As a special-case, quoted strings show the string's content in double
quotes, e.g. 'unexpected quoted string "foo" ...' rather than
'unexpected quoted string ""foo"" ...' or 'unexpected quoted string
"'foo'" ...'.

A "..." is also included if the text is longer than 30 bytes (where
previously it would have been silently truncated).

For example, given the following:

<<<<<<<<<<<<

The current 8.0 alpha will show this:

Parse error: syntax error, unexpected '<<' (T_SL), expecting identifier
(T_STRING) or static (T_STATIC) or namespace (T_NAMESPACE) or \
(T_NS_SEPARATOR) in filename.php on line N

The proposed patch will instead show this:

Parse error: syntax error, unexpected token "<<", expecting identifier
or "static" or "namespace" or "" in filename.php on line N

For more examples, see:
https://rwec.co.uk/x/php-parse-errors/comparison.html

The patch can be reviewed at: https://github.com/php/php-src/pull/5722

I am happy to post a small RFC if people think this requires a vote.

Any other feedback is welcome.

(As an aside, the other commonly requested change was to include column
numbers; this appears to be feasible, but definitely more complex, and
with potential performance trade-offs. I hope to re-visit this later.)

Regards,

--
Rowan Tommins (né Collins)
[IMSoP]

5 years ago by Stanislav Malyshev — view source

unread

Hi!

For more examples, see:
https://rwec.co.uk/x/php-parse-errors/comparison.html

The patch can be reviewed at: https://github.com/php/php-src/pull/5722

I think this is great, thanks for implementing it!

I am happy to post a small RFC if people think this requires a vote.

Any other feedback is welcome.

I personally don't think it requires a vote, as error messages are an
implementation detail and nobody should ever rely on it. Let's wait for
a bit though if somebody argues that it needs an RFC, but otherwise I
think we can merge it.

(As an aside, the other commonly requested change was to include column
numbers; this appears to be feasible, but definitely more complex, and
with potential performance trade-offs. I hope to re-visit this later.)

Column numbers would be awesome too.

--
Stas Malyshev
smalyshev@gmail.com

5 years ago by Joe Watkins — view source

unread

Nicely done.

No need for an RFC.

Thanks.

Hi!

For more examples, see:
https://rwec.co.uk/x/php-parse-errors/comparison.html

The patch can be reviewed at: https://github.com/php/php-src/pull/5722

I think this is great, thanks for implementing it!

I am happy to post a small RFC if people think this requires a vote.

Any other feedback is welcome.

I personally don't think it requires a vote, as error messages are an
implementation detail and nobody should ever rely on it. Let's wait for
a bit though if somebody argues that it needs an RFC, but otherwise I
think we can merge it.

(As an aside, the other commonly requested change was to include column
numbers; this appears to be feasible, but definitely more complex, and
with potential performance trade-offs. I hope to re-visit this later.)

Column numbers would be awesome too.

--
Stas Malyshev
smalyshev@gmail.com

--

To unsubscribe, visit: https://www.php.net/unsub.php

5 years ago by Claude Pache — view source

unread

As a special-case, quoted strings show the string's content in double quotes, e.g. 'unexpected quoted string "foo" ...' rather than 'unexpected quoted string ""foo"" ...' or 'unexpected quoted string "'foo'" ...'.

For me, in this case, you have dropped the wrong pair of quotes. That is, I prefer to see the quotes used in the source text (“unexpected string 'foo' ...” or “unexpected string "foo" ...”), over the quotes that make error messages consistent with other error messages. This is because I like to have hints that help me to visually identify the offending string in the source text when it is of length 0 or 1, but on other hand I don’t care about consistency across error messages.

—Claude

5 years ago by Rowan Tommins — view source

unread

Hi Claude,

As a special-case, quoted strings show the string's content in double
quotes, e.g. 'unexpected quoted string "foo" ...' rather than 'unexpected
quoted string ""foo"" ...' or 'unexpected quoted string "'foo'" ...'.

For me, in this case, you have dropped the wrong pair of quotes. That is,
I prefer to see the quotes used in the source text (“unexpected string
'foo' ...” or “unexpected string "foo" ...”), over the quotes that make
error messages consistent with other error messages. This is because I like
to have hints that help me to visually identify the offending string in the
source text when it is of length 0 or 1, but on other hand I don’t care
about consistency across error messages.

I did actually have strings matching the quote marks used in an earlier
version of the patch, but dropped it partly to simplify the implementation

because it sometimes truncates strings, it can't just use the real
quotes, but has to strip and re-add them.

The consistency should aid anyone trying to parse the error message, since
the pattern /unexpected ([^"]+) "(.*?)"/ will always give you a token name
(even if just 'token') and its content.

I can re-visit that part if you feel strongly, though.

Regards,

Rowan Tommins
[IMSoP]

5 years ago by Claude Pache — view source

unread

Le 29 juin 2020 à 10:29, Rowan Tommins rowan.collins@gmail.com a écrit :

I did actually have strings matching the quote marks used in an earlier
version of the patch, but dropped it partly to simplify the implementation

because it sometimes truncates strings, it can't just use the real
quotes, but has to strip and re-add them.

The consistency should aid anyone trying to parse the error message, since
the pattern /unexpected ([^"]+) "(.*?)"/ will always give you a token name
(even if just 'token') and its content.

I can re-visit that part if you feel strongly, though.

At this point, I don’t feel strongly, because currently, when I see
“unexpected '''' (T_CONSTANT_ENCAPSED_STRING)”,
my automatism is not looking for a string with specific quote style (which is not evident by looking at four consecutive quotes), but for a missing or misplaced punctuation mark, because, by experience, it is most of the time the root cause of an error message containing that specific token name.

Another way to give the programmer the appropriate information, is replacing “quoted string” with “single-quoted string” and “double-quoted string”.

And while I am thinking about it: a case that tends to cause divergent squint is: unexpected '"', which you changed into: unexpected token """. In that specific case, it would be nice if it the common name of the offending token was spelt out, something like: unexpected double quotation mark.

—Claude

5 years ago by Rowan Tommins — view source

unread

At this point, I don’t feel strongly, because currently, when I see
“unexpected '''' (T_CONSTANT_ENCAPSED_STRING)”,
my automatism is not looking for a string with specific quote style (which
is not evident by looking at four consecutive quotes), but for a missing or
misplaced punctuation mark, because, by experience, it is most of the time
the root cause of an error message containing that specific token name.

Yes, there's a bunch of cases where what the parser sees and expects isn't
really that helpful, and what's parsed as a zero-length string may not
actually look that way in the source. There's probably diminishing returns
trying to handle that at the message level, rather than extending the
parser itself as has been done for mis-matched brackets.

Another way to give the programmer the appropriate information, is
replacing “quoted string” with “single-quoted string” and “double-quoted
string”.

That would actually make a lot of sense. I'll have a look if it can be done
without changing the actual parser definitions (which recognise both forms
as the same token type).

And while I am thinking about it: a case that tends to cause divergent
squint is: unexpected '"', which you changed into: unexpected token """. In
that specific case, it would be nice if it the common name of the offending
token was spelt out, something like: unexpected double quotation mark.

That would be a reasonable special-case; theoretically, the same would
apply for a lone single-quote, although I think the parser happens never to
see that as a separate token.

Thanks for the feedback,

Rowan Tommins
[IMSoP]

5 years ago by Rowan Tommins — view source

unread

Another way to give the programmer the appropriate information, is
replacing “quoted string” with “single-quoted string” and
“double-quoted string”.

And while I am thinking about it: a case that tends to cause divergent
squint is: unexpected '"', which you changed into: unexpected token """

I've pushed up a new version of the patch with three new special cases:

unexpected token """ -> unexpected double-quote mark
unexpected quoted string "foo" -> unexpected single-quoted string
"foo" / unexpected double-quoted string "foo"
unexpected illegal character "_" -> unexpected character 0xNN (where _
is almost certainly a control character, and NN is instead the
hexadecimal value of the byte)

I've updated the example table with a new column showing those
additions: https://rwec.co.uk/x/php-parse-errors/comparison.html

If there's no other comments or objections, I would appreciate if
someone could merge this in so it can be tested in the next alpha :)

Regards,

--
Rowan Tommins (né Collins)
[IMSoP]