I’m planning on adding this functionality in some form to HHVM, however if it’s also wanted in PHP, I’d rather not add something HHVM-specific and will be happy to put up RFCs :)
Location Information
————
token_get_all()
returns a line number for some tokens. I propose adding an additional TOKEN_EXTENDED_LOCATION flag, that would include:
- starting line and character number within that line
- ending line and character number within that line
T_ENCAPSED_AND_WHITESPACE
and T_INLINE_HTML
seem to be the most common cases of start line !== end line.
Raw Tokens
————
While token_get_all()
is documented as returning whatever the lexer sees, in practice third-party software frequently depends on specific output. This gives you 3 options:
- limit changes you make to the lexer to preserve BC
- lie about the tokens to preserve BC
- break BC
In our experience, #3 is not practical and #1 can lead to much more complicated solutions for problems that would be easily fixable in the lexer - so we went for #2. For example, HHVM converts:
- T_HASHBANG to
T_INLINE_HTML
-
T_ELSEIF
toT_ELSE
T_WHITESPACET_IF
However, this means that there’s not currently a way to get the real lexer tokens. I propose adding a TOKEN_RAW flag, which should explicitly allow implementation-specific tokens and no guarantees about output stability.
For now, this would be a no-op in PHP, however it would give you more freedom in modifying the lexer in the future (in combination with #2 if the flag isn’t specified).
With thanks,
- Fred
T_ELSEIF
toT_ELSE
T_WHITESPACET_IF
HHVM only does that when the text of T_ELSEIF
is "else\w+if" which
happens because of a fugly lexer hack which.... yeah... let's not talk
about that.
-Sara
I’m planning on adding this functionality in some form to HHVM,
however if it’s also wanted in PHP, I’d rather not add something
HHVM-specific and will be happy to put up RFCs :)Location Information
————
token_get_all()
returns a line number for some tokens. I propose
adding an additional TOKEN_EXTENDED_LOCATION flag, that would include:
- starting line and character number within that line
- ending line and character number within that line
That'd be nice to have... but I don't think the parser keeps that
information currently.
T_ENCAPSED_AND_WHITESPACE
andT_INLINE_HTML
seem to be the most common
cases of start line !== end line.
I would probably only include the ending line number if it is different?
Saves on a whole lot of memory allocations and usage... and it's trivial
to detect in consuming code.
cheers,
Derick