token_get_all(): additional location information, and raw tokens

9 years ago by Fred Emmott — view source

unread

I’m planning on adding this functionality in some form to HHVM, however if it’s also wanted in PHP, I’d rather not add something HHVM-specific and will be happy to put up RFCs :)

Location Information
————

token_get_all() returns a line number for some tokens. I propose adding an additional TOKEN_EXTENDED_LOCATION flag, that would include:

starting line and character number within that line
ending line and character number within that line

T_ENCAPSED_AND_WHITESPACE and T_INLINE_HTML seem to be the most common cases of start line !== end line.

Raw Tokens
————

While token_get_all() is documented as returning whatever the lexer sees, in practice third-party software frequently depends on specific output. This gives you 3 options:

limit changes you make to the lexer to preserve BC
lie about the tokens to preserve BC
break BC

In our experience, #3 is not practical and #1 can lead to much more complicated solutions for problems that would be easily fixable in the lexer - so we went for #2. For example, HHVM converts:

T_HASHBANG to T_INLINE_HTML
T_ELSEIF to T_ELSE T_WHITESPACE T_IF

However, this means that there’s not currently a way to get the real lexer tokens. I propose adding a TOKEN_RAW flag, which should explicitly allow implementation-specific tokens and no guarantees about output stability.
For now, this would be a no-op in PHP, however it would give you more freedom in modifying the lexer in the future (in combination with #2 if the flag isn’t specified).

With thanks,

Fred

9 years ago by Sara Golemon — view source

unread

T_ELSEIF to T_ELSE T_WHITESPACE T_IF

HHVM only does that when the text of T_ELSEIF is "else\w+if" which
happens because of a fugly lexer hack which.... yeah... let's not talk
about that.

-Sara

9 years ago by Derick Rethans — view source

unread

I’m planning on adding this functionality in some form to HHVM,
however if it’s also wanted in PHP, I’d rather not add something
HHVM-specific and will be happy to put up RFCs :)

Location Information
————

token_get_all() returns a line number for some tokens. I propose
adding an additional TOKEN_EXTENDED_LOCATION flag, that would include:

starting line and character number within that line

ending line and character number within that line

That'd be nice to have... but I don't think the parser keeps that
information currently.

T_ENCAPSED_AND_WHITESPACE and T_INLINE_HTML seem to be the most common
cases of start line !== end line.

I would probably only include the ending line number if it is different?
Saves on a whole lot of memory allocations and usage... and it's trivial
to detect in consuming code.

cheers,
Derick