[RFC] Normalize token_get_all() output (with flag)

9 years ago by Stanislav Malyshev — view source

unread

Hi!

https://wiki.php.net/rfc/token-get-always-tokens

This should be pretty non-controversial. I hope?

No way!
Just kidding :) I think it's a great idea, token_get_all() is annoying
to deal with. We'd have to fix token_name though so that this would
still work:

foreach ($tokens as $token) {
if (is_array($token)) {
echo "Line {$token[2]}: ", token_name($token[0]), "
('{$token[1]}')", PHP_EOL;
}
}

We could, of course, do something like
$token[0]<=255?$token[0]:token_name($token[0]) - but that looks hacky.

Stas Malyshev
smalyshev@gmail.com

9 years ago by Sara Golemon — view source

unread

I think it's a great idea, token_get_all() is annoying
to deal with. We'd have to fix token_name though so that this would
still work:

Ah, excellent point!

We could, of course, do something like
$token[0]<=255?$token[0]:token_name($token[0]) - but that looks hacky.

Do you mean have users of the API do that? Or have the implementation
of token_name() do that? Because the latter doesn't seem unreasonable
at all.

For example:
var_dump(token_name(ord(';'))); // string(1)";"

-Sara

9 years ago by Stanislav Malyshev — view source

unread

Hi!

We could, of course, do something like
$token[0]<=255?$token[0]:token_name($token[0]) - but that looks hacky.

Do you mean have users of the API do that? Or have the implementation
of token_name() do that? Because the latter doesn't seem unreasonable
at all.

I meant for token_name() to do that internally, so that the user can use
just token_name() instead of boilerplate code. Sorry for not being clear.

For example:
var_dump(token_name(ord(';'))); // string(1)";"

Yes, exactly.

--
Stas Malyshev
smalyshev@gmail.com

9 years ago by Sara Golemon — view source

unread

Do you mean have users of the API do that? Or have the implementation
of token_name() do that? Because the latter doesn't seem unreasonable
at all.

I meant for token_name() to do that internally, so that the user can use
just token_name() instead of boilerplate code. Sorry for not being clear.

For example:
var_dump(token_name(ord(';'))); // string(1)";"

Yes, exactly.

Awesome, have updated the RFC accordingly.

-Sara

9 years ago by Marcio Almada — view source

unread

Total +1. I wanted to propose that, without the flag, but it got too
late for the PHP7 BC breaking season. Adding a flag seems to be a
better idea :P

2016-01-04 18:56 GMT-04:00 Sara Golemon pollita@php.net:

I swear, 2016 isn't going to be "An RFC per day" year, but...

https://wiki.php.net/rfc/token-get-always-tokens

This should be pretty non-controversial. I hope?

-Sara

9 years ago by Nikita Popov — view source

unread

I swear, 2016 isn't going to be "An RFC per day" year, but...

https://wiki.php.net/rfc/token-get-always-tokens

This should be pretty non-controversial. I hope?

+1

Would be nice if someone could come up with a more explicit name for the
flag. TOKEN_FULL is not obvious, at least to me. TOKEN_ALWAYS_ARRAY?

I'd also like to have a flag TOKEN_NO_LINENOS with deduplication of token
arrays, but that's a separate matter...

Nikita

9 years ago by Sara Golemon — view source

unread

Would be nice if someone could come up with a more explicit name for the
flag. TOKEN_FULL is not obvious, at least to me. TOKEN_ALWAYS_ARRAY?

Yeah, I'm not a huge fan of the name either, but I couldn't come up
with anything better at the time.

Maybe TOKEN_ASSOC? Since it provides associative array elements (as
opposed to the current indexed array behavior)

I'd also like to have a flag TOKEN_NO_LINENOS with deduplication of token
arrays, but that's a separate matter...

Not sure what you're suggesting here. Can you elaborate?

-Sara

9 years ago by Nikita Popov — view source

unread

Would be nice if someone could come up with a more explicit name for the
flag. TOKEN_FULL is not obvious, at least to me. TOKEN_ALWAYS_ARRAY?

Yeah, I'm not a huge fan of the name either, but I couldn't come up
with anything better at the time.

Maybe TOKEN_ASSOC? Since it provides associative array elements (as
opposed to the current indexed array behavior)

I like that one.

I'd also like to have a flag TOKEN_NO_LINENOS with deduplication of token
arrays, but that's a separate matter...

Not sure what you're suggesting here. Can you elaborate?

Basically: token_get_all() is rather slow. I think it says something that
getting the tokens of a script is about as slow as lexing it, parsing it
into an internal AST and constructing an object-based userland AST for it.
If you use token_get_all() in a matter that only requires one lookahead
token at a time, you don't really care about how nice the token format is,
you're only interested in it being efficient. I was hoping that we can
optimize it by dropping the line numbers (which is the most volatile part
of the structure) and try to reuse the same array for tokens which have the
same ID and content (but likely different lineno). It's very likely that a
script contains the T_WHITESPACE( ) token more than one and similarly
labels and variables tend to repeat, etc. No idea if that would actually
work/help, just an idea.

Nikita

9 years ago by Sara Golemon — view source

unread

Basically: token_get_all() is rather slow. I think it says something that
getting the tokens of a script is about as slow as lexing it, parsing it
into an internal AST and constructing an object-based userland AST for it.
If you use token_get_all() in a matter that only requires one lookahead
token at a time, you don't really care about how nice the token format is,
you're only interested in it being efficient. I was hoping that we can
optimize it by dropping the line numbers (which is the most volatile part of
the structure) and try to reuse the same array for tokens which have the
same ID and content (but likely different lineno). It's very likely that a
script contains the T_WHITESPACE( ) token more than one and similarly labels
and variables tend to repeat, etc. No idea if that would actually work/help,
just an idea.

Ah, I see what you mean. That'll take some thinking, and is well
outside the scope of this RFC, but I'll give it some thought and maybe
you or I or someone can gist something up later maybe...

-Sara

9 years ago by Sara Golemon — view source

unread

Maybe TOKEN_ASSOC? Since it provides associative array elements (as
opposed to the current indexed array behavior)

I like that one.

Have updated the RFC accordingly.
I've also renamed the 'token' field in the subarray to 'id' because
everytime I write an example, I end up finding myself with expressions
like token_name($token['token']) because it makes perfect sense to
call your subarray $token (e.g. foreach(token_get_all(...) as $token))
so... too many tokens, amirite?

-Sara

9 years ago by Sara Golemon — view source

unread

https://wiki.php.net/rfc/token-get-always-tokens

A suggestion from a co-worker who's worried about seeing patterns like:

case ($t['token']) {
case T_PAAMAYIM_NEKUDOTAYIM:
// do something
break;
case ord(';'):
// do something else
break;
}

I see three options to remediate this:

We could offer constants like: T_SEMICOLON (or T_ASCII_SEMICOLON)
defined to that character's ordinal value. (using unicode names)
Make the token field be the character value (for single-character
tokens) instead of the ordinal.
Add an additional field such that you have Array ( ['id'] => 59,
['token'] => ';', ['text'] => ';', ['line'] => 1 ) 'id' would always
be integer, while 'token' would be int or single-char.

I don't like 3 as it's wasteful and needlessly duplicative. 1 feels a
bit over-engineered and actually hurts readability. 2 feels like a
nice compromise and matches how our rules end up looking in the parser
anyway (since C treats single characters as ordinals already).

-Sara

9 years ago by Sara Golemon — view source

unread

https://wiki.php.net/rfc/token-get-always-tokens

A suggestion from a co-worker who's worried about seeing patterns like:

case ($t['token']) {
case T_PAAMAYIM_NEKUDOTAYIM:
// do something
break;
case ord(';'):
// do something else
break;
}

I see three options to remediate this:

We could offer constants like: T_SEMICOLON (or T_ASCII_SEMICOLON)
defined to that character's ordinal value. (using unicode names)

Make the token field be the character value (for single-character
tokens) instead of the ordinal.

Add an additional field such that you have Array ( ['id'] => 59,
['token'] => ';', ['text'] => ';', ['line'] => 1 ) 'id' would always
be integer, while 'token' would be int or single-char.

I don't like 3 as it's wasteful and needlessly duplicative. 1 feels a
bit over-engineered and actually hurts readability. 2 feels like a
nice compromise and matches how our rules end up looking in the parser
anyway (since C treats single characters as ordinals already).

Never really got a response on this one, and I'm personally leaning
further towards Option 2; So unless someone has a better idea, I'll
update the RFC in a few days.

-Sara

9 years ago by Stanislav Malyshev — view source

unread

Hi!

A suggestion from a co-worker who's worried about seeing patterns like:

case ($t['token']) {
case T_PAAMAYIM_NEKUDOTAYIM:
// do something
break;
case ord(';'):
// do something else
break;
}

What's wrong with this pattern? Looks pretty fine to me, clear what's
going on.

Make the token field be the character value (for single-character
tokens) instead of the ordinal.

That would work too, but the side effect would be mixing types. It
probably not too much trouble since tokens are numbered > 255, but it
may have some unexpected effects that most of the tokens would match by
== 0. Not sure if that's a problem.

Add an additional field such that you have Array ( ['id'] => 59,
['token'] => ';', ['text'] => ';', ['line'] => 1 ) 'id' would always
be integer, while 'token' would be int or single-char.

I don't like 3 as it's wasteful and needlessly duplicative. 1 feels a
bit over-engineered and actually hurts readability. 2 feels like a

Agree on both 1 and 3.

Stas Malyshev
smalyshev@gmail.com

9 years ago by Andrea Faulds — view source

unread

Hi Sara,

Sara Golemon wrote:

I swear, 2016 isn't going to be "An RFC per day" year, but...

https://wiki.php.net/rfc/token-get-always-tokens

This should be pretty non-controversial. I hope?

This is more of a side-note, but maybe it's worth bringing up. Since
token_get_all gives an array with subarrays of a regular structure,
might it be worthwhile returning an array of objects instead? It would
probably reduce memory usage (assuming they're objects of a Token class
or something, not StdClass), but I don't know if it's that useful.

Thoughts?

Andrea Faulds
https://ajf.me/

9 years ago by Sara Golemon — view source

unread

This is more of a side-note, but maybe it's worth bringing up. Since
token_get_all gives an array with subarrays of a regular structure, might it
be worthwhile returning an array of objects instead? It would probably
reduce memory usage (assuming they're objects of a Token class or something,
not StdClass), but I don't know if it's that useful.

Is the internal memory usage of (new stdClass) actually less than
(array())? I know array changed a lot with PHP7, and that the kind of
array we're talking about here is a mixed (which would be the least
efficient), but I don't have the numbers to hand.

To your question, I'm cool either way. Though if memory/efficiency is
a concern, returning an iterator would be even better, especially on
large files. Though I think that's beyond the scope of this
particular RFC.

Maybe I should keep up my "1 RFC per day" run and suggest
TOKEN_ITERATOR as yet another flag.... Maybe not. My inbox hurts.

-Sara

9 years ago by Andrea Faulds — view source

unread

Hi Sara,

Sara Golemon wrote:

This is more of a side-note, but maybe it's worth bringing up. Since
token_get_all gives an array with subarrays of a regular structure, might it
be worthwhile returning an array of objects instead? It would probably
reduce memory usage (assuming they're objects of a Token class or something,
not StdClass), but I don't know if it's that useful.

Is the internal memory usage of (new stdClass) actually less than
(array())? I know array changed a lot with PHP7, and that the kind of
array we're talking about here is a mixed (which would be the least
efficient), but I don't have the numbers to hand.

No, a StdClass would consume more memory than an array, because it has a
hashtable plus object overhead.

But having a specific token class would much consume less, since there's
no hashtable overhead for defined properties:

https://gist.github.com/nikic/5015323

I believe it's even better in PHP 7.

To your question, I'm cool either way. Though if memory/efficiency is
a concern, returning an iterator would be even better, especially on
large files. Though I think that's beyond the scope of this
particular RFC.

Memory probably doesn't matter, this was mostly a "what if" suggestion.

Maybe I should keep up my "1 RFC per day" run and suggest
TOKEN_ITERATOR as yet another flag.... Maybe not. My inbox hurts.

I know first-hand what making too many RFCs can do to you, so I wouldn't
recommend it.

Thanks! :)

Andrea Faulds
https://ajf.me/

9 years ago by me@rouvenwessling.de — view source

unread

This is more of a side-note, but maybe it's worth bringing up. Since token_get_all gives an array with subarrays of a regular structure, might it be worthwhile returning an array of objects instead? It would probably reduce memory usage (assuming they're objects of a Token class or something, not StdClass), but I don't know if it's that useful.

Thoughts?

I liked the idea so I went ahead and tested it. Code is here: https://github.com/php/php-src/pull/1727

Doing some incredibly simplistic testing using a modified version of Nikita’s PHP-Parser. It saved ~10% in memory (just parsing the parser itself and immediately discarding the result) and speed up within the margin of error. The proposal from Sara would probably take slightly more memory than the current code since there would be more arrays allocated.

In terms of performance that doesn’t seem like a big win. But objects are niece for another reason: IDEs (and potentially other tools) know about them and will warn you about typos in properties.

Best regards
Rouven

9 years ago by Sara Golemon — view source

unread

This is more of a side-note, but maybe it's worth bringing up. Since token_get_all gives an array with subarrays of a regular structure, might it be worthwhile returning an array of objects instead? It would probably reduce memory usage (assuming they're objects of a Token class or something, not StdClass), but I don't know if it's that useful.

I liked the idea so I went ahead and tested it. Code is here: https://github.com/php/php-src/pull/1727

Doing some incredibly simplistic testing using a modified version of Nikita’s PHP-Parser.
It saved ~10% in memory (just parsing the parser itself and immediately discarding the
result) and speed up within the margin of error. The proposal from Sara would probably
take slightly more memory than the current code since there would be more arrays allocated.

Nice! I'll say one thing though: If we're going to introduce a class,
I'm tempted to leave token_get_all alone (or at least limit its
changes to what's in the current porposal), and have the PhpToken
class implement its own static method to do the same thing. This
avoid overloading token_get_all() with a bunch of behaviors and allows
us to have some broader choice of actions with PhpToken::scan(). i.e.
An iteration version which would probably have nice perf wins, or a
callback version, etc... This also minimizes the bike-shedding risk
while we try out ideas.

How does this sound?

Keep the current RFC basically as is. It's a minor addition to
token_get_all() which can be slotted into existing code to improve
readability, but offers little advantage beyond that.
Make a new extension to prototype this PhpToken class outside of
the php-src tree, add all the extra goodies we want (array of token
return, iterator of token return, extra info, limited info, etc...)
When this extension is good and solid, propose merging into core.

I had reserved the github.com/phplang project for the specification,
but didn't end up using it. We can share a repo there if you'd like.

-Sara

9 years ago by Stanislav Malyshev — view source

unread

Hi!

Nice! I'll say one thing though: If we're going to introduce a class,
I'm tempted to leave token_get_all alone (or at least limit its
changes to what's in the current porposal), and have the PhpToken
class implement its own static method to do the same thing. This

I like this idea better. It is also conceptually simpler - you just have
one class to deal with.

How does this sound?

Keep the current RFC basically as is. It's a minor addition to
token_get_all() which can be slotted into existing code to improve
readability, but offers little advantage beyond that.

Make a new extension to prototype this PhpToken class outside of
the php-src tree, add all the extra goodies we want (array of token
return, iterator of token return, extra info, limited info, etc...)

When this extension is good and solid, propose merging into core.

Sounds great! I wonder if it'd be possible to make a real streaming
iterator? That would be so nice.

--
Stas Malyshev
smalyshev@gmail.com

9 years ago by Sara Golemon — view source

unread

How does this sound?

Keep the current RFC basically as is. It's a minor addition to
token_get_all() which can be slotted into existing code to improve
readability, but offers little advantage beyond that.

Make a new extension to prototype this PhpToken class outside of
the php-src tree, add all the extra goodies we want (array of token
return, iterator of token return, extra info, limited info, etc...)

When this extension is good and solid, propose merging into core.

Sounds great! I wonder if it'd be possible to make a real streaming
iterator? That would be so nice.

I can't imagine why not. The scanner lends itself perfectly to the
Iterator model and would end up making a much bigger dent in the
memory consumption than dithering between arrays and classes (though
"why not both" applies here). :)

Bouncing between several tasks atm, but I'll try to get something
started and invite some folks to the repo.

-Sara

9 years ago by Andrea Faulds — view source

unread

Hi Sara,

Sara Golemon wrote:

How does this sound?

Keep the current RFC basically as is. It's a minor addition to
token_get_all() which can be slotted into existing code to improve
readability, but offers little advantage beyond that.

Make a new extension to prototype this PhpToken class outside of
the php-src tree, add all the extra goodies we want (array of token
return, iterator of token return, extra info, limited info, etc...)

When this extension is good and solid, propose merging into core.

I had reserved the github.com/phplang project for the specification,
but didn't end up using it. We can share a repo there if you'd like.

Like Stas, I think this sounds good. I might even look over what's
created, if I find the time.

Regarding arrays versus iterators, you can create the former from the
latter, but not the other way around. I'd go for just having an iterator
and no array.. It maps better to how PHP actually works, anyway.

But enough discussion here, I shall await your git repository...

Thanks!

Andrea Faulds
https://ajf.me/

9 years ago by Sara Golemon — view source

unread

Make a new extension to prototype this PhpToken class outside of
the php-src tree, add all the extra goodies we want (array of token
return, iterator of token return, extra info, limited info, etc...)

When this extension is good and solid, propose merging into core.

I had reserved the github.com/phplang project for the specification,
but didn't end up using it. We can share a repo there if you'd like.

Like Stas, I think this sounds good. I might even look over what's created,
if I find the time.

Regarding arrays versus iterators, you can create the former from the
latter, but not the other way around. I'd go for just having an iterator and
no array.. It maps better to how PHP actually works, anyway.

But enough discussion here, I shall await your git repository...

Okay, i've got a starting point up at
https://github.com/phplang/php-token . It's basically just
token_get_all() and token_name() as they currently stand (I didn't
even bother overriding the behavior in token_name() yet). I did
change some of the codeflow layout a bit (and add support for scanning
files in the process), but that's it.

Unfortunately... I realized a few issues with my plans to add support
for callbacks and iterators... The tokenizer uses a global state that
is REALLY easy to invalidate, so once we allow userspace code to
resume, we're going to have to be careful about how we
preserve/restore/guard the tokenizer's state. This is usually not an
issue since there are no scanner callbacks in PHP currently.

I've added TazeTSchnitzel(Andrea), smalyshev(Stas), and nikic(Nikita)
to the ACL so far. Poke me here or on twitter with your github id if
you're interested in contributing...

-Sara

9 years ago by Stig Bakken — view source

unread

Okay, i've got a starting point up at
https://github.com/phplang/php-token . It's basically just
token_get_all() and token_name() as they currently stand (I didn't
even bother overriding the behavior in token_name() yet). I did
change some of the codeflow layout a bit (and add support for scanning
files in the process), but that's it.

I really like this, funny really that it took until now for someone to get
enough of the old interface :)

But rather than overloading the existing token_get_all() function with
flags that change its behavior, what about just adding a new function, say
token_get_all_assoc(), and give it a TOKEN_LINENUMS flag if you want to
make line numbers optional for performance?

At least in my experience, code becomes more readable with new
method/function names than with a bunch of optional flags in cases like
this.

Stig

9 years ago by Sara Golemon — view source

unread

Okay, i've got a starting point up at
https://github.com/phplang/php-token . It's basically just
token_get_all() and token_name() as they currently stand (I didn't
even bother overriding the behavior in token_name() yet). I did
change some of the codeflow layout a bit (and add support for scanning
files in the process), but that's it.

I really like this, funny really that it took until now for someone to get
enough of the old interface :)

But rather than overloading the existing token_get_all() function with flags
that change its behavior, what about just adding a new function, say
token_get_all_assoc(), and give it a TOKEN_LINENUMS flag if you want to make
line numbers optional for performance?

Um.... this extension is a new interface...

-Sara

9 years ago by Stig Bakken — view source

unread

On Sun, Jan 17, 2016 at 3:34 AM, Stig Bakken stig.bakken@gmail.com
wrote:

Okay, i've got a starting point up at
https://github.com/phplang/php-token . It's basically just
token_get_all() and token_name() as they currently stand (I didn't
even bother overriding the behavior in token_name() yet). I did
change some of the codeflow layout a bit (and add support for scanning
files in the process), but that's it.

I really like this, funny really that it took until now for someone to
get
enough of the old interface :)

But rather than overloading the existing token_get_all() function with
flags
that change its behavior, what about just adding a new function, say
token_get_all_assoc(), and give it a TOKEN_LINENUMS flag if you want to
make
line numbers optional for performance?

Um.... this extension is a new interface...

Not really, it's a new implementation. The old extension is the interface.
But I'll stop being pedantic ;-)

Stig

[RFC] Normalize token_get_all() output (with flag)

We could, of course, do something like $token[0]<=255?$token[0]:token_name($token[0]) - but that looks hacky.

Agree on both 1 and 3.

Thoughts?

Thanks! :)

Thanks!

We could, of course, do something like
$token[0]<=255?$token[0]:token_name($token[0]) - but that looks hacky.