Context Sensitive Language RFC - Implementation Candidate

10 years ago by Marcio Almada — view source

unread

Hi,

The Context Sensitive Lexer RFC
https://wiki.php.net/rfc/context_sensitive_lexer passed :) and by the
time of the voting phase, we decided to vote for the feature only and later
discuss quality analysis on the implementations aimed to fulfill the RFC.

First, I'd like to thank you all for that decision. I know that's an
exception on the RFC process, and I am glad we choose this path for the
following reasons:

Voting different RFCs describing the same feature with slightly
different implementations would cause us to waste many voting cycles (maybe
entire release cycles) without a guarantee of quality. The main reason to
establish an RFC process is to chase for quality - to follow all the rules
strictly, in this case, would be to contradict our main objective here.
Knowing in advance that the feature was already approved is a motivating
factor to go on and try a good number of possible implementations and
propose the best ones, instead of recursively voting until an
implementation pass.

With that said, this is the proposed pull request:

Pull Request: https://github.com/php/php-src/pull/1221
Diff: https://github.com/php/php-src/pull/1221/files
RFC: https://wiki.php.net/rfc/context_sensitive_lexer

There is sufficient description of the pull request itself. The ones that
participated in the previous discussions probably won't have trouble to
understand it, but feel free to share any doubts or suggestions here, if
necessary.

Thanks,
Márcio

10 years ago by Nikita Popov — view source

unread

On Mon, Apr 20, 2015 at 5:32 PM, Marcio Almada marcio.web2@gmail.com
wrote:

Hi,

The Context Sensitive Lexer RFC
https://wiki.php.net/rfc/context_sensitive_lexer passed :) and by the
time of the voting phase, we decided to vote for the feature only and later
discuss quality analysis on the implementations aimed to fulfill the RFC.

First, I'd like to thank you all for that decision. I know that's an
exception on the RFC process, and I am glad we choose this path for the
following reasons:

Voting different RFCs describing the same feature with slightly
different implementations would cause us to waste many voting cycles (maybe
entire release cycles) without a guarantee of quality. The main reason to
establish an RFC process is to chase for quality - to follow all the rules
strictly, in this case, would be to contradict our main objective here.

Knowing in advance that the feature was already approved is a motivating
factor to go on and try a good number of possible implementations and
propose the best ones, instead of recursively voting until an
implementation pass.

With that said, this is the proposed pull request:

Pull Request: https://github.com/php/php-src/pull/1221
Diff: https://github.com/php/php-src/pull/1221/files
RFC: https://wiki.php.net/rfc/context_sensitive_lexer

There is sufficient description of the pull request itself. The ones that
participated in the previous discussions probably won't have trouble to
understand it, but feel free to share any doubts or suggestions here, if
necessary.

Thanks,
Márcio

Sorry for late response, forgot about this RFC. I've only glanced over it,
but the patch looks okay from the technical side.

The thing that's bothering me is the fact that this patch is basically
saying: "It is no longer possible to correctly tokenize PHP without also
parsing it."

For example, if you're writing a syntax highlighter for PHP and you want
that syntax highlighter to be correct, you'll be writing not only a lexer,
but also a parser for PHP (which is significantly more complicated).
Actually it's worse than that: The approach of running a parser
concurrently with the token collection does not work for highlighting code
snippets for example, where the snippet may not form syntactically fully
valid code. Syntax highlighting only being an example, this applies to any
external tooling that's not written in PHP and does not have the benefit of
using token_get_all().

I don't know how important this is to us, but I'm somewhat vary of going
more into the C++ direction (where you essentially need a full
type-analyzer to do a parse).

This is why I still prefer the dead-simple approach of making the next
label after :: and "function" unreserved (what we do for the label after ->
already), combined with forbidding reserved names for free functions in the
compiler (similar to the blacklist we have for classes). This doesn't cover
everything (like trait adaptations), but I think it covers the 97% case
(and actually allows us to really allow all names, without exceptions).

Nikita

10 years ago by Marcio Almada — view source

unread

Hi!

Sorry for late response, forgot about this RFC. I've only glanced over it,
but the patch looks okay from the technical side.

No problem :) there are other more important issues being discussed that
should be prioritized, specially your engine exception RFC.

The thing that's bothering me is the fact that this patch is basically
saying: "It is no longer possible to correctly tokenize PHP without also
parsing it."

It won't be possible either with your idea "of making the next label after
'::' and 'function' unreserved" because it doesn't cover the entire PHP
syntax. So I took the approach that fulfills 100% support.

For example, if you're writing a syntax highlighter for PHP and you want
that syntax highlighter to be correct, you'll be writing not only a lexer,
but also a parser for PHP (which is significantly more complicated).
Actually it's worse than that: The approach of running a parser
concurrently with the token collection does not work for highlighting code
snippets for example, where the snippet may not form syntactically fully
valid code.

If we are talking about syntax highlighters only, then it's pretty easy for
an external tool to consider whatever is in front of :: and function a
name and highlight it as such, you simply use token_get_all($code)
without the proposed TOKEN_PARSE flag and transform the resulting array
accordingly by applying a simple callback to simulate a lookahead.

Syntax highlighting only being an example, this applies to any external
tooling that's not written in PHP and does not have the benefit of using
token_get_all().

With any approach taken, external tools will need to handle that the same
way. The point is, do we want "simpler" rules and pretend a small part of
PHP syntax doesn't even exists just because of these external tools?

External tools can handle that. Static analyzers inherently have much more
complex issues to deal with other than a white list of semi reserved names.

I don't know how important this is to us, but I'm somewhat vary of going
more into the C++ direction (where you essentially need a full
type-analyzer to do a parse).

Yes, indeed - and this was only possible because we have an AST now
(thanks) - but I don't have a problem with it as long as the parsing is
opt in, and that's the case here.

This is why I still prefer the dead-simple approach of making the next
label after :: and "function" unreserved (what we do for the label after ->
already), combined with forbidding reserved names for free functions in the
compiler (similar to the blacklist we have for classes). This doesn't cover
everything (like trait adaptations), but I think it covers the 97% case
(and actually allows us to really allow all names, without exceptions).

The problem with this "dead simple" approach is that it leaves syntax
behind like class const list and, like you just said, trait adaptations.
So unless a working solution that covers all syntax is given, this simpler
idea you suggested is a no go for me as it's not better than the solution
already being proposed.

Nikita

I hope we can find enough grounds to agree here ^^

Márcio