Hi internals,
I'd like to add a new TOKEN_AS_OBJECT flag to token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain strings
and arrays we currently have. The PhpToken class is defined as:
class PhpToken {
public $type;
public $text;
public $line;
}
This has been previously suggested and implemented by Rouven Weßling 1,
I've just ported this feature to master and optimized the implementation
2.
Apart from being a nicer interface than the current format, an additional
advantage is that the TOKEN_AS_OBJECT mode uses about 1.8x less memory and
is 1.4x times faster.
Are there any objections to this change?
Regards,
Nikita
On Thu, 23 Mar 2017 18:16:31 +0100, Nikita Popov nikita.ppv@gmail.com
wrote:
Hi internals,
I'd like to add a new TOKEN_AS_OBJECT flag to
token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain
strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}This has been previously suggested and implemented by Rouven Weßling 1,
I've just ported this feature to master and optimized the implementation
2.Apart from being a nicer interface than the current format, an additional
advantage is that the TOKEN_AS_OBJECT mode uses about 1.8x less memory
and
is 1.4x times faster.Are there any objections to this change?
Regards,
Nikita
Regarding memory - would it be possible to return iterator instead of
array?
Regards,
Jan Tvrdik
On Thu, 23 Mar 2017 18:16:31 +0100, Nikita Popov nikita.ppv@gmail.com
I'd like to add a new TOKEN_AS_OBJECT flag to
token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}This has been previously suggested and implemented by Rouven Weßling [1],
I've just ported this feature to master and optimized the implementation
[2].
Yeah, IIRC the more recent discussion ended at "Oh, and this! And
this! And this!" which eventually went nowhere (largely my ADD). IMO
there's no harm in adding this, and the class format seems entirely
reasonable.
If I may bikeshed a TINY bit, I'd ask that all tokens return as
objects, rather than char|PhpToken similar to the current char|array
format we have. (Maybe that's in the PR, I haven't looked at either)
Regarding memory - would it be possible to return iterator instead of array?
That has some hairy edges to it since the lexer isn't actually
reentrant within a given thread. Image the following:
foreach (token_get_all($code, TOKEN_AS_ITERATOR) as $token) {
$moretokens = token_get_all($morecode, $whateverflags); // <--- Here
be broken dreams and crying unicorns
}
Worse still if you have parallel iterators.
That's probably fixable, but it's a much heavier refactor and one that
should probably have an RFC.
-Sara
On Thu, 23 Mar 2017 18:16:31 +0100, Nikita Popov nikita.ppv@gmail.com
I'd like to add a new TOKEN_AS_OBJECT flag to
token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain
strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}This has been previously suggested and implemented by Rouven Weßling [1],
I've just ported this feature to master and optimized the implementation
[2].Yeah, IIRC the more recent discussion ended at "Oh, and this! And
this! And this!" which eventually went nowhere (largely my ADD). IMO
there's no harm in adding this, and the class format seems entirely
reasonable.If I may bikeshed a TINY bit, I'd ask that all tokens return as
objects, rather than char|PhpToken similar to the current char|array
format we have. (Maybe that's in the PR, I haven't looked at either)
That's how it's currently implemented, yes. The output is PhpToken[].
Regarding memory - would it be possible to return iterator instead of
array?That has some hairy edges to it since the lexer isn't actually
reentrant within a given thread. Image the following:foreach (token_get_all($code, TOKEN_AS_ITERATOR) as $token) {
$moretokens = token_get_all($morecode, $whateverflags); // <--- Here
be broken dreams and crying unicorns
}Worse still if you have parallel iterators.
That's probably fixable, but it's a much heavier refactor and one that
should probably have an RFC.
The lexer is reentrant as long as the lexer state is saved and restored
every time the iterator is advanced -- we already need to support reentry
because, for various unpleasant reasons, it is actually possible that a
compilation is triggered while another is still in progress...
But yes, this is definitely a much more complicated change, and one I would
not consider to be particularly useful. In my experience most interesting
uses of token_get_all()
are not compatible with a simple iterator, or would
eschew it because the iterator would in all likelihood perform worse.
Nikita
Am 24.03.2017 um 00:33 schrieb Sara Golemon:
If I may bikeshed a TINY bit, I'd ask that all tokens return as
objects, rather than char|PhpToken similar to the current char|array
format we have.
Yes, please. That would basically/probably/hopefully make php-token-stream
superfluous.
Hi Nikita,
Nikita Popov wrote:
I'd like to add a new TOKEN_AS_OBJECT flag to
token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}
Rather than adding a flag to token_get_all()
to return objects, you
could potentially instead make an equivalent static method on PhpToken
(PhpToken::getAll() perhaps). That would avoid mixing “object-oriented”
and “procedural” styles, though I don't know if it matters. It seems
cleaner to me.
Thanks!
--
Andrea Faulds
https://ajf.me/
Em 24 de mar de 2017 12:24 PM, "Andrea Faulds" ajf@ajf.me escreveu:
Hi Nikita,
Nikita Popov wrote:
I'd like to add a new TOKEN_AS_OBJECT flag to token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}
Rather than adding a flag to token_get_all()
to return objects, you could
potentially instead make an equivalent static method on PhpToken
(PhpToken::getAll() perhaps). That would avoid mixing “object-oriented” and
“procedural” styles, though I don't know if it matters. It seems cleaner to
me.
I liked your suggestion. Maybe just change to something like
SplTokenizer::parse(): SplToken[]
Thanks!
--
Andrea Faulds
https://ajf.me/
Hi Nikita,
Nikita Popov wrote:
I'd like to add a new TOKEN_AS_OBJECT flag to
token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}Rather than adding a flag to
token_get_all()
to return objects, you could
potentially instead make an equivalent static method on PhpToken
(PhpToken::getAll() perhaps). That would avoid mixing “object-oriented” and
“procedural” styles, though I don't know if it matters. It seems cleaner to
me.Thanks!
--
Andrea Faulds
https://ajf.me/--
I prefer a distinct function/method from token_get_all
. I don't see
the value in having a return value that differs so much based on a
passed flag. I don't care so much about a function vs static method; I
just oppose the additional parameter to token_get_all
.
I like the idea very much overall aside from that detail.
Hi Nikita,
Nikita Popov wrote:
I'd like to add a new TOKEN_AS_OBJECT flag to
token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain
strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}Rather than adding a flag to
token_get_all()
to return objects, you could
potentially instead make an equivalent static method on PhpToken
(PhpToken::getAll() perhaps). That would avoid mixing “object-oriented”
and
“procedural” styles, though I don't know if it matters. It seems cleaner
to
me.Thanks!
--
Andrea Faulds
https://ajf.me/--
I prefer a distinct function/method from
token_get_all
. I don't see
the value in having a return value that differs so much based on a
passed flag. I don't care so much about a function vs static method; I
just oppose the additional parameter totoken_get_all
.I like the idea very much overall aside from that detail.
token_get_all()
already has a $flags parameter controlling its output.
Currently only the TOKEN_PARSE
flag is supported, which changes the
interpretation of certain tokens. In a previous thread on this topic Sara
suggested another flag (something like TOKEN_ALWAYS_ARRAY) which makes the
output of token_get_all()
consistently use arrays. I think handling this as
a flag is entirely reasonable, and by analogy, handling the object mode as
a flag is also reasonable. In the end, these are just minor variations on
the output format (minor relative to the very complex primary functionality
of the function: lexing). Another precedent is the json_decode()
option to
use arrays or objects in the return value.
Regards,
Nikita
Hi Nikita,
Nikita Popov wrote:
I'd like to add a new TOKEN_AS_OBJECT flag to
token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain
strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}Rather than adding a flag to
token_get_all()
to return objects, you
could potentially instead make an equivalent static method on PhpToken
(PhpToken::getAll() perhaps). That would avoid mixing “object-oriented”
and “procedural” styles, though I don't know if it matters. It seems
cleaner to me.Thanks!
I am also against adding another flag because it violates the single
responsibility principle. However, adding a getAll
method to the
PhpToken
data class also seems very wrong, because it violates the
single responsibility once again.
The real and proper solution is to have an actual PHP Lexer that is
capable of creating a token stream.
class Lexer {
public function __construct(string $source);
public static function fromFile(string $path): self;
public function tokenize(): TokenStream;
}
final class TokenStream implements IteratorAggregate {
public function getIterator(): Generator<Token>
public function toArray(): Token[]
}
final class Token {
// Ideally this would be an enum, but...
public const OPEN_TAG;
public const CLOSE_TAG;
// ...
// I hope these are not mutable!
public int $category; // type
public string $lexeme; // text
public int $line;
public int $column; // we don't have that :(
/** @see token_name */
public function getCategoryName(): string;
// We could add `is*` methods for the various categories here.
public function isOpenTag(): bool;
public function isCloseTag(): bool;
// ...
}
With this in place, we're ready for the future. The TokenStream
can
easily use a generator over the internal array. Or we offer the
toArray
method only at the beginning. Users will have to call it, but
that overhead is tiny, considering that we can extend upon it in the
future without introducing any kind of breaking change.
Obviously this could go into the namespace PHP\Parser
, or we prefix
everything with PHP
(I'd be for the former).
On a side note, going for Php
instead of PHP
is inconsistent with
the naming that is currently dominating the PHP core:
https://secure.php.net/manual/en/indexes.functions.php
There are already some things that violate it, e.g. Phar
instead of
PHAR
, but most things keep acronyms intact (DOM
, XML
, PDO
, ...).
Introducing even more inconsistency to PHP seems like a very bad idea to
me. I know that this is considered bikeshedding by many people, but
consistency is very important and we are already lacking it on too many
levels as it is.
--
Richard "Fleshgrinder" Fussenegger
Thanks Fleshgrinder!
You expressed exactly what I thought, but was too busy... cough... lazy to
put in an email.
Cheers,
Hi Nikita,
Nikita Popov wrote:
I'd like to add a new TOKEN_AS_OBJECT flag to
token_get_all()
, which
returns an array of PhpToken objects, rather than the mix of plain
strings
and arrays we currently have. The PhpToken class is defined as:class PhpToken {
public $type;
public $text;
public $line;
}Rather than adding a flag to
token_get_all()
to return objects, you
could potentially instead make an equivalent static method on PhpToken
(PhpToken::getAll() perhaps). That would avoid mixing “object-oriented”
and “procedural” styles, though I don't know if it matters. It seems
cleaner to me.Thanks!
I am also against adding another flag because it violates the single
responsibility principle. However, adding agetAll
method to the
PhpToken
data class also seems very wrong, because it violates the
single responsibility once again.The real and proper solution is to have an actual PHP Lexer that is
capable of creating a token stream.class Lexer { public function __construct(string $source); public static function fromFile(string $path): self; public function tokenize(): TokenStream; } final class TokenStream implements IteratorAggregate { public function getIterator(): Generator<Token> public function toArray(): Token[] } final class Token { // Ideally this would be an enum, but... public const OPEN_TAG; public const CLOSE_TAG; // ... // I hope these are not mutable! public int $category; // type public string $lexeme; // text public int $line; public int $column; // we don't have that :( /** @see token_name */ public function getCategoryName(): string; // We could add `is*` methods for the various categories here. public function isOpenTag(): bool; public function isCloseTag(): bool; // ... }
With this in place, we're ready for the future. The
TokenStream
can
easily use a generator over the internal array. Or we offer the
toArray
method only at the beginning. Users will have to call it, but
that overhead is tiny, considering that we can extend upon it in the
future without introducing any kind of breaking change.Obviously this could go into the namespace
PHP\Parser
, or we prefix
everything withPHP
(I'd be for the former).On a side note, going for
Php
instead ofPHP
is inconsistent with
the naming that is currently dominating the PHP core:https://secure.php.net/manual/en/indexes.functions.php
There are already some things that violate it, e.g.
Phar
instead of
PHAR
, but most things keep acronyms intact (DOM
,XML
,PDO
, ...).
Introducing even more inconsistency to PHP seems like a very bad idea to
me. I know that this is considered bikeshedding by many people, but
consistency is very important and we are already lacking it on too many
levels as it is.--
Richard "Fleshgrinder" Fussenegger--
--
Guilherme Blanco
Senior Technical Architect at Huge Inc.
>
> > Hi Nikita,
> >
> > Nikita Popov wrote:
> >
> >> I'd like to add a new TOKEN_AS_OBJECT flag to `token_get_all()`, which
> >> returns an array of PhpToken objects, rather than the mix of plain
> >> strings
> >> and arrays we currently have. The PhpToken class is defined as:
> >>
> >> class PhpToken {
> >> public $type;
> >> public $text;
> >> public $line;
> >> }
> >
> > Rather than adding a flag to `token_get_all()` to return objects, you
> > could potentially instead make an equivalent static method on PhpToken
> > (PhpToken::getAll() perhaps). That would avoid mixing “object-oriented”
> > and “procedural” styles, though I don't know if it matters. It seems
> > cleaner to me.
> >
> > Thanks!
> >
>
> I am also against adding another flag because it violates the single
> responsibility principle. However, adding a `getAll` method to the
> `PhpToken` data class also seems very wrong, because it violates the
> single responsibility once again.
>
> The real and proper solution is to have an actual PHP Lexer that is
> capable of creating a token stream.
>
> class Lexer {
> public function __construct(string $source);
> public static function fromFile(string $path): self;
> public function tokenize(): TokenStream;
> }
>
If PhpToken::getAll() violates the SRP, then Lexer::fromFile() is just the
same sort of violation.
Regards, Niklas
> final class TokenStream implements IteratorAggregate {
> public function getIterator(): Generator
> public function toArray(): Token[]
> }
>
> final class Token {
> // Ideally this would be an enum, but...
> public const OPEN_TAG;
> public const CLOSE_TAG;
> // ...
>
> // I hope these are not mutable!
> public int $category; // type
> public string $lexeme; // text
> public int $line;
> public int $column; // we don't have that :(
>
> /** @see token_name */
> public function getCategoryName(): string;
>
> // We could add `is*` methods for the various categories here.
> public function isOpenTag(): bool;
> public function isCloseTag(): bool;
> // ...
> }
>
> With this in place, we're ready for the future. The `TokenStream` can
> easily use a generator over the internal array. Or we offer the
> `toArray` method only at the beginning. Users will have to call it, but
> that overhead is tiny, considering that we can extend upon it in the
> future without introducing any kind of breaking change.
>
> Obviously this could go into the namespace `PHP\Parser`, or we prefix
> everything with `PHP` (I'd be for the former).
>
> On a side note, going for `Php` instead of `PHP` is inconsistent with
> the naming that is currently dominating the PHP core:
>
> https://secure.php.net/manual/en/indexes.functions.php
>
> There are already some things that violate it, e.g. `Phar` instead of
> `PHAR`, but most things keep acronyms intact (`DOM`, `XML`, `PDO`, ...).
> Introducing even more inconsistency to PHP seems like a very bad idea to
> me. I know that this is considered bikeshedding by many people, but
> consistency is very important and we are already lacking it on too many
> levels as it is.
>
> --
> Richard "Fleshgrinder" Fussenegger
If PhpToken::getAll() violates the SRP, then Lexer::fromFile() is just the
same sort of violation.Regards, Niklas
This is kind of true in PHP, because we are lacking proper abstractions
for path and file handling. It is most certainly not true in many other
languages. Usually such named constructors or factory methods are just
short-hands to trigger a series of other behavior that abides the SRP.
throws InvalidPathException
throws IOException
public static function fromFile(string $path): self {
return new static(new File(new Path($path))->getContents());
}
There is not logic at all inside this method, it is only there for
convenience to spare you from writing this over and over and over and
over ... again. In the PHP case the whole thing directly explodes in
complexity:
public static function fromFile(string $path): self {
// Already questionable if we can actually pass that string
// to file_get_contents since it might be a remote file, or
// complete garbage that does not even remotely resemble a
// path ... who knows ...
$contents = file_get_contents($path);
if ($contents === false) {
// Let's just hope there is no other error in here from
// a previous error ...
$error = `error_get_last()`;
// ...
}
return new static($contents);
}
I shortened the whole thing because it is to cumbersome to write all of
this without an IDE. So yes it violates the SRP in classic PHP, but no
it does not necessarily violate SRP per se.
Also note that adding Token::getAll
is more than weird from a logical
point of view too. Since Token
represents a single thing, whereas
getAll
by logic belongs to some kind of collection of things.
Note that if we go further with the whole thing that we wouldn't even
want to deal with a string in the constructor, but rather a Reader
,
BufferedReader
, or SeekableReader
. A File
instance most certainly
would comply with that:
final class Lexer {
public function __construct(SeekableReader $source);
public static function fromFile(string $path): self {
return new static(new File(new Path($path)));
}
}
Now in case we want to work with a string directly, e.g. unit tests, we
would need a seekable reader that encapsulates that string, e.g. a Cursor
.
new Lexer(new Cursor('<?php echo "foo";'));
This usecase might not be as common as the one where we want to tokenize
a file's content, thus, we do not add a short-hand for it. However, in
case it would be so common that it makes sense to add, we should add it.
final class Lexer {
// Same as before...
public static function fromString(string $source): self {
new static(new Cursor($source));
}
}
Once again, no logic, no branching, no hardcoded strategies. It does one
thing and one thing only: sparing us from re-typing the same construct
over and over again. DRY is one of the most important principles of them
all after all.
Another question to ask before adding such methods lightly is whether we
can justify the additional dependency. Depending on File
and Path
is
most certainly never a problem, after all, we need the file system at
all times. The Cursor
on the other hand might be debatable, as it
might reside in a special component that is not necessarily available
just because the Lexer
is.
I could go on and on and on now but I'm getting kind of off topic now. ?
--
Richard "Fleshgrinder" Fussenegger