[RFC][DISCUSSION] Context Sensitive lexer

10 years ago by Marcio Almada — view source

unread

Hi internals,

I'd like to put the "Context Sensitive Lexer" RFC into discussion phase:

RFC: https://wiki.php.net/rfc/context_sensitive_lexer
TL;DR commit: https://github.com/marcioAlmada/php-src/commit/c01014f9
PR: https://github.com/php/php-src/pull/1054

PHP currently has ~64 globally reserved words. Not infrequently, these
reserved words end up clashing with legit alternatives to userland API
declarations. This RFC proposes minimal changes to have a context sensitive
lexer with support for semi-reserved words on PHP7 without causing
maintenance issues.

This could be especially useful to:

Reduce the surface of BC breaks whenever new keywords are introduced
Avoid restricting userland APIs. Dispensing the need for hacks like
unecessary magic method calls or prefixed identifiers.

The patch is 98% finished, the entire test suite is passing. I'm still
adding more tests to it but the hard part is done. So it's time to discuss!
Sincerely,
Márcio Almada

10 years ago by Nikita Popov — view source

unread

On Fri, Feb 20, 2015 at 8:29 AM, Marcio Almada marcio.web2@gmail.com
wrote:

Hi internals,

I'd like to put the "Context Sensitive Lexer" RFC into discussion phase:

RFC: https://wiki.php.net/rfc/context_sensitive_lexer
TL;DR commit: https://github.com/marcioAlmada/php-src/commit/c01014f9
PR: https://github.com/php/php-src/pull/1054

PHP currently has ~64 globally reserved words. Not infrequently, these
reserved words end up clashing with legit alternatives to userland API
declarations. This RFC proposes minimal changes to have a context sensitive
lexer with support for semi-reserved words on PHP7 without causing
maintenance issues.

This could be especially useful to:

Reduce the surface of BC breaks whenever new keywords are introduced

Avoid restricting userland APIs. Dispensing the need for hacks like
unecessary magic method calls or prefixed identifiers.

The patch is 98% finished, the entire test suite is passing. I'm still
adding more tests to it but the hard part is done. So it's time to discuss!
Sincerely,
Márcio Almada

I think we all agree that it would be nice to not be so strict about
reserved keywords in some places. As such this RFC hinges on questions of
implementation.

The RFC uses a purely lexer-based approach, which is nice in principle,
because ext/tokenizer benefits from it as well.

The disadvantage of doing this in the lexer and in the scope that you're
proposing (i.e. including class names) is that it requires reimplementing
quite a number of parser rules via lookahead in the lexer. This means that
a) the implementation depends on a complete understanding of the PHP
syntax, otherwise we'll miss edge cases or be too strict in others and
b) may limit us in future, because we may not be able to introduce syntax
that can't be reasonably recognized with simple lexer state management or
lookahead.

To give you an example of a), your patch currently handles a single
interface name properly

nikic@saturn:~/php-src$ sapi/cli/php -r 'class Foo implements Interface {}'
Fatal error: Interface 'Interface' not found in Command line code on line 1

but fails as soon as you implement multiple interfaces:

nikic@saturn:~/php-src$ sapi/cli/php -r 'class Foo implements Interface,
Array {}'
Parse error: syntax error, unexpected 'Array' (T_ARRAY), expecting
identifier (T_STRING) or namespace (T_NAMESPACE) or \ (T_NS_SEPARATOR) in
Command line code on line 1

So, I'm sure this can be worked around with a couple of new lexer rules,
I'm just trying to show the systematic issues of this approach.

An example for b) is harder to come by (as I'm not terribly familiar with
what we can easily detect in the lexer and what we can't). One thing that
comes to mind is supporting a short lambda syntax like the one available in
Hack:

(ClassName $a, $b, $c, $d) ==> $a

As this has no prefixing "function" or similar, I suspect that it may be
rather hard to detect that "ClassName" is actually a class name here and
requires special treatment. Un-reserving class names now may make features
like this impossible (or unnecessarily hard) to implement in the future.

Due to these issues, I don't like the RFC in the current form - I think
it's too ambitious. Class names simply occur in too many and diverse places.

I would suggest going with a more limited approach instead, which targets
only method and class constant names. I.e. the label after -> and :: should
not be reserved (we already do this for ->) and the label after "function"
and "const" shouldn't be either. Of course this would also allow defining
global reserved-keyword function/const names as well, so we might want to
check their names against the list of reserved keywords. Though even that
is just a courtesy to the user, e.g. it's already possible to define and
access reserved-keyword constants using define() and constant().

Nikita

10 years ago by Marcio Almada — view source

unread

Hi, Nikita

2015-02-20 9:26 GMT-03:00 Nikita Popov nikita.ppv@gmail.com:

I think we all agree that it would be nice to not be so strict about
reserved keywords in some places. As such this RFC hinges on questions of
implementation.

The RFC uses a purely lexer-based approach, which is nice in principle,
because ext/tokenizer benefits from it as well.

The disadvantage of doing this in the lexer and in the scope that you're
proposing (i.e. including class names) is that it requires reimplementing
quite a number of parser rules via lookahead in the lexer. This means that
a) the implementation depends on a complete understanding of the PHP
syntax, otherwise we'll miss edge cases or be too strict in others and
b) may limit us in future, because we may not be able to introduce syntax
that can't be reasonably recognized with simple lexer state management or
lookahead.

To give you an example of a), your patch currently handles a single
interface name properly

nikic@saturn:~/php-src$ sapi/cli/php -r 'class Foo implements Interface
{}'
Fatal error: Interface 'Interface' not found in Command line code on line 1

but fails as soon as you implement multiple interfaces:

nikic@saturn:~/php-src$ sapi/cli/php -r 'class Foo implements Interface,
Array {}'
Parse error: syntax error, unexpected 'Array' (T_ARRAY), expecting
identifier (T_STRING) or namespace (T_NAMESPACE) or \ (T_NS_SEPARATOR) in
Command line code on line 1

So, I'm sure this can be worked around with a couple of new lexer rules,
I'm just trying to show the systematic issues of this approach.

Yes, in fact this was an easy fix, thanks for pointing that out. Now, about
the major issue regarding future languages changes:

An example for b) is harder to come by (as I'm not terribly familiar with
what we can easily detect in the lexer and what we can't). One thing that
comes to mind is supporting a short lambda syntax like the one available in
Hack:
(ClassName $a, $b, $c, $d) ==> $a
As this has no prefixing "function" or similar, I suspect that it may be
rather hard to detect that "ClassName" is actually a class name here and
requires special treatment. Un-reserving class names now may make features
like this impossible (or unnecessarily hard) to implement in the future.

This is a very legit problem, thanks for raising it so early.

Due to these issues, I don't like the RFC in the current form - I think
it's too ambitious. Class names simply occur in too many and diverse places.

I would suggest going with a more limited approach instead, which targets
only method and class constant names. I.e. the label after -> and :: should
not be reserved (we already do this for ->) and the label after "function"
and "const" shouldn't be either. Of course this would also allow defining
global reserved-keyword function/const names as well, so we might want to
check their names against the list of reserved keywords. Though even that
is just a courtesy to the user, e.g. it's already possible to define and
access reserved-keyword constants using define() and constant().

Nikita

We still have some time before the feature freeze. I'll give it some more
thought on the argument list detection problem. If nothing good comes, I'll
revert the proposal to it's previous version tailoring only class|object
members declaration and access.

I think we both already agree that the less ambitious proposal wouldn't
offer any drawback and would have many benefits specially now that we are
discussing necessary RFCs that could globally reserve even more words:

Cheers,
Márcio

10 years ago by Stanislav Malyshev — view source

unread

Hi!

RFC: https://wiki.php.net/rfc/context_sensitive_lexer
TL;DR commit: https://github.com/marcioAlmada/php-src/commit/c01014f9
PR: https://github.com/php/php-src/pull/1054

I like the idea. But we need to examine the cases carefully so we don't
block some future routes - especially this is with regards to such
things as type names which we wanted to reserve.

I.e. method names resolution is probably clear, since they appear after
-> or ::, but for class names the context may be much more varied.

Stas Malyshev
smalyshev@gmail.com

10 years ago by Marcio Almada — view source

unread

Hi, Stas

2015-02-22 19:20 GMT-03:00 Stanislav Malyshev smalyshev@gmail.com:

Hi!

I like the idea. But we need to examine the cases carefully so we don't
block some future routes - especially this is with regards to such
things as type names which we wanted to reserve.

I.e. method names resolution is probably clear, since they appear after
-> or ::, but for class names the context may be much more varied.

Stas Malyshev
smalyshev@gmail.com

I agree. You and Nikita are right. Doing more than that with a pure lexical
approach, without migrating to another lexer generator (which was already
attempted before) or using some form of lexer feedback (which at current
state breaks ext tokenizer) would be inadequate and create future issues.
I'll probably work on a more ambitious and adequate solution for PHP
7.1~7.2.

For now, as said before, I'll revert the RFC, and proposed patch, to
version 0.2 aiming only class|object members declaration and access. This
is perfectly achievable, has no drawbacks and brings many benefits. The RFC
will probably be ready for discussion again in ~2 days.

Thanks,
Márcio

[RFC][DISCUSSION] Context Sensitive lexer

I.e. method names resolution is probably clear, since they appear after -> or ::, but for class names the context may be much more varied.

I.e. method names resolution is probably clear, since they appear after -> or ::, but for class names the context may be much more varied.

I.e. method names resolution is probably clear, since they appear after
-> or ::, but for class names the context may be much more varied.

I.e. method names resolution is probably clear, since they appear after
-> or ::, but for class names the context may be much more varied.