[RFC][Discussion] Parser extension API

10 years ago by Leigh — view source — reply

unread

On 17 February 2015 at 11:46, Alexander Lisachenko
lisachenko.it@gmail.com wrote:

Hello, internals!

I want to introduce a RFC for providing a userland API for accessing an
Abstract Syntax Tree of the source code and to provide userland parser
hooks for source code modification:
https://wiki.php.net/rfc/parser-extension-api

Thanks!

Re: Userland representation of AST.

I use Nikita's PHP-Parser for static analysis already, so my first
reaction is that I like the idea of being able to obtain the canonical
AST from the engine, that is automatically updated as new features are
added.

Does the AST provided by the internal parser provide all of the
information required to be able to turn it back into source code?

Re: Extending the parser from PHP userland.

This I don't like so much.

To be honest I hoped this would be an API to extend the parser from
PHP extensions :) I think letting userland fiddle directly with the
compilation process is just asking for trouble.

10 years ago by Alexander Lisachenko — view source — reply

unread

2015-02-17 15:06 GMT+03:00 Leigh leight@gmail.com:

Does the AST provided by the internal parser provide all of the
information required to be able to turn it back into source code?

Yes, this should be possible, via zend_emit_op*() and zend_compile*()

2015-02-17 15:06 GMT+03:00 Leigh leight@gmail.com:

Re: Extending the parser from PHP userland.

This I don't like so much.

To be honest I hoped this would be an API to extend the parser from
PHP extensions :) I think letting userland fiddle directly with the
compilation process is just asking for trouble.

Expected reaction ) However, it would be nice to have such API on userland
too (on engine level this can be a hook too). This will give an instrument
for building custom things on user side, but will keep PHP engine clear
from a lot of stuff, which can be implemented by users. Nice example is
Design-By-Contract validation, it can be easily done via parser hooks.

10 years ago by Benjamin Eberlei — view source — reply

unread

On Tue, Feb 17, 2015 at 12:46 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

Hello, internals!

I want to introduce a RFC for providing a userland API for accessing an
Abstract Syntax Tree of the source code and to provide userland parser
hooks for source code modification:
https://wiki.php.net/rfc/parser-extension-api

The visitor API is the essential part here and you left it out.
How does this work with Engine Extensions, are they considered for all
require/include's after the registrtion? How does this work with opcache?

Thanks!

10 years ago by Alexander Lisachenko — view source — reply

unread

2015-02-17 15:09 GMT+03:00 Benjamin Eberlei kontakt@beberlei.de:

The visitor API is the essential part here and you left it out.

Yes, I decided not to put Visitor in the RFC (this was added as open
question to the RFC). But this part can be discussed in the case of general
acceptance.

2015-02-17 15:09 GMT+03:00 Benjamin Eberlei kontakt@beberlei.de:

How does this work with Engine Extensions, are they considered for all
require/include's after the registrtion? How does this work with opcache?

Yes, parser extensions will be called for all require/include/evals after
registration. This part is transparent for opcache, because opcache just
stores an opcodes for the file. AST is parsed only once for each file, then
hooks can transform the AST and after that compiler will produce a final
opcodes that can be stored for that file in the opcache. So, I expect no
impact on opcache logic.

10 years ago by Benjamin Eberlei — view source — reply

unread

On Tue, Feb 17, 2015 at 1:22 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

2015-02-17 15:09 GMT+03:00 Benjamin Eberlei kontakt@beberlei.de:

The visitor API is the essential part here and you left it out.

Yes, I decided not to put Visitor in the RFC (this was added as open
question to the RFC). But this part can be discussed in the case of general
acceptance.

2015-02-17 15:09 GMT+03:00 Benjamin Eberlei kontakt@beberlei.de:

How does this work with Engine Extensions, are they considered for all
require/include's after the registrtion? How does this work with opcache?

Yes, parser extensions will be called for all require/include/evals after
registration. This part is transparent for opcache, because opcache just
stores an opcodes for the file. AST is parsed only once for each file, then
hooks can transform the AST and after that compiler will produce a final
opcodes that can be stored for that file in the opcache. So, I expect no
impact on opcache logic.

Well not really, depending on the path towards a require a parser extension
is already registered or not. Or you have conditional registration of
extensions. So you could load a file with an extension registered, but it
still serves the old opcodes for that file.

10 years ago by Alexander Lisachenko — view source — reply

unread

2015-02-17 15:29 GMT+03:00 Benjamin Eberlei kontakt@beberlei.de:

Well not really, depending on the path towards a require a parser
extension is already registered or not. Or you have conditional
registration of extensions. So you could load a file with an extension
registered, but it still serves the old opcodes for that file.

Technically, this will be possible, however, Parser Extensions should be
registered ASAP during bootstrap process, near spl_autoload_register(),
stream_wrapper_register(), stream_filter_register(), etc..
So, no conditional code hooks should be used for that. But this can be the
consistent with stream wrappers, filters and autoloaders. Developer can
decide what he need to enable some features.

One more possible way for this RFC is to remove these register/unregister
methods from the Php\Parser\Engine class and add an option to the php.ini
with list of classes to load. This option can be adjusted then per
project/directory. Is this more suitable?

Thanks!

10 years ago by Rowan Collins — view source — reply

unread

Alexander Lisachenko wrote on 17/02/2015 12:41:

2015-02-17 15:29 GMT+03:00 Benjamin Eberlei kontakt@beberlei.de:

Well not really, depending on the path towards a require a parser
extension is already registered or not. Or you have conditional
registration of extensions. So you could load a file with an extension
registered, but it still serves the old opcodes for that file.

Technically, this will be possible, however, Parser Extensions should be
registered ASAP during bootstrap process, near spl_autoload_register(),
stream_wrapper_register(), stream_filter_register(), etc..

I think "should" is OK for extensions, but not for userland hooks. There
needs to be a very firm definition of the right and wrong way to
implement these hooks, which can at the very least warn the user when
they are slipping into undefined behaviour.

Note that spl_autoload_register doesn't affect OpCache because it only
changes the mapping of a class to a parsed file (keyed on the file
system path to the source), not the compiled contents of that file.

If AST hooks can be registered on a global scope at any time, then
compiling the same file can lead to different op codes at different
times (e.g. page loads which take a different path through the code),
which would mean the OpCache would need extra information in its keys to
cache each possible version.

One more possible way for this RFC is to remove these register/unregister
methods from the Php\Parser\Engine class and add an option to the php.ini
with list of classes to load. This option can be adjusted then per
project/directory. Is this more suitable?

The other alternative is to go to the other extreme, and have the
extensions scoped to a particular file, more like Perl pragmas. Dare I
say we could use the declare() syntax that everyone seems to have an
irrational hatred of?

declare(pragma=Example\DbcParserExtension);

Regards,

Rowan Collins
[IMSoP]

10 years ago by Alexander Lisachenko — view source — reply

unread

2015-02-17 16:35 GMT+03:00 Rowan Collins rowan.collins@gmail.com:

I think "should" is OK for extensions, but not for userland hooks. There
needs to be a very firm definition of the right and wrong way to implement
these hooks, which can at the very least warn the user when they are
slipping into undefined behaviour.

Thanks for this point. Added this question to the list of open issues in
RFC. My vision is to update a documentation with clear description, why
parser extensions should be registered statically without conditions.
Parsing process and limitations should be also described in the RFC.

2015-02-17 16:35 GMT+03:00 Rowan Collins rowan.collins@gmail.com:

The other alternative is to go to the other extreme, and have the
extensions scoped to a particular file, more like Perl pragmas. Dare I say
we could use the declare() syntax that everyone seems to have an irrational
hatred of?

declare(pragma=Example\DbcParserExtension);

Ouch ) Don't like this way at all, because extensions should process all
files, even if it's not marked with some hints. What I want to see is PHP
analog of Java bytecode compilers, they can transform an original source
code into specific tokens and provide custom extensions on top of that. So,
I want to choose between php.ini option and explicit methods. Let's see
another opinions.

Thanks!

10 years ago by Rowan Collins — view source — reply

unread

Alexander Lisachenko wrote on 17/02/2015 14:21:

2015-02-17 16:35 GMT+03:00 Rowan Collins <rowan.collins@gmail.com
mailto:rowan.collins@gmail.com>:
The other alternative is to go to the other extreme, and have the
extensions scoped to a particular file, more like Perl pragmas.
Dare I say we could use the declare() syntax that everyone seems
to have an irrational hatred of?

declare(pragma=Example\DbcParserExtension);
Ouch ) Don't like this way at all, because extensions should process
all files, even if it's not marked with some hints. What I want to see
is PHP analog of Java bytecode compilers, they can transform an
original source code into specific tokens and provide custom
extensions on top of that. So, I want to choose between php.ini option
and explicit methods. Let's see another opinions.

It's interesting that you used the word "extension" there. If the hook
can only be registered for in php.ini (and it would need to be
PHP_INI_SYSTEM if you want to avoid OpCache needing to vary its cache on
different settings) then these are basically extensions which happen to
be written in PHP - you install them globally, they execute
unconditionally, and they do things which can't be done any other way.

In that case, it would maybe make more sense to make this part of the
"simplified extension API" discussed in a previous thread, rather than
exposing it to userland at all. The only abilities you'd lose are ones
which should never be used anyway, e.g. statically registering an
extension, then dynamically changing its behaviour at runtime by
altering global/static variables.

I think the same argument applies to per-file pragmas over per-install
settings, as to Dependency Injection over global state - you should be
able to reason about each module of code in isolation, reuse it in
different combinations, etc, etc. What if you want to use two different
DbC extensions on the same server, and they collide with each other?
What would the error look like if you failed to install an extension
that was required to process your source code as intended? A declare
line (or Perl's use, or Python's import) gives a clear point where that
requirement is defined.

Regards,

Rowan Collins
[IMSoP]

10 years ago by Leigh — view source — reply

unread

On 17 February 2015 at 12:22, Alexander Lisachenko
lisachenko.it@gmail.com wrote:

Yes, parser extensions will be called for all require/include/evals after
registration. This part is transparent for opcache, because opcache just
stores an opcodes for the file. AST is parsed only once for each file, then
hooks can transform the AST and after that compiler will produce a final
opcodes that can be stored for that file in the opcache. So, I expect no
impact on opcache logic.

So: Internal Parse AST -> Extension Parse AST -> Generate OpCodes -> OpCache?

So if the parser extension does anything dynamic, then OpCache will
have the wrong version cached.

10 years ago by Alexander Lisachenko — view source — reply

unread

2015-02-17 15:48 GMT+03:00 Leigh leight@gmail.com:

So: Internal Parse AST -> Extension Parse AST -> Generate OpCodes ->
OpCache?

Yes, it's correct flow.

2015-02-17 15:48 GMT+03:00 Leigh leight@gmail.com:

So if the parser extension does anything dynamic, then OpCache will
have the wrong version cached.

This restriction is intended by design, no dynamic changes are allowed.
However, if you need, then you can disable an opcode cacher and do anything
you want.

Your example is actual for production mode of PHP too, when
opcache.validate_timestamps=0 or apc.stat=off. Parsing will be performed
only once, then you need to clear cache manually to reload changes from the
file system. So, parser extension works transparently and used only when
needed to process an AST.

10 years ago by Lars Strojny — view source — reply

unread

Hi Alexander,

Hello, internals!

I want to introduce a RFC for providing a userland API for accessing an
Abstract Syntax Tree of the source code and to provide userland parser
hooks for source code modification:
https://wiki.php.net/rfc/parser-extension-api

Looks cool and I could see a couple of interesting possibilities arising. One thing: any particular reason ExtensionInterface is static? I could see a couple of benefits having extensions carry state and registerExtension() taking an instance of ExtensionInterface, not a class name. What do you think?

cu,
Lars

10 years ago by Alexander Lisachenko — view source — reply

unread

Hello, Lars!

2015-02-17 22:09 GMT+03:00 Lars Strojny lars@strojny.net:

Looks cool and I could see a couple of interesting possibilities arising.
One thing: any particular reason ExtensionInterface is static? I could see
a couple of benefits having extensions carry state and registerExtension()
taking an instance of ExtensionInterface, not a class name. What do you
think?

Thanks for the positive feedback! My first thought was to use instances for
extensions: https://gist.github.com/lisachenko/ffcfdec4c46e01864b33.
However, I have a doubts that this will introduce additional possibilities
for side-effects, when extensions can be configured
differently/loaded/unloaded. In this thread some people noticed this fact
too and suggested to add a restrictions for preventing conditional parsing
of the source code.

So, static method and static registration was chosen to reduce a number of
possible conditional extensions. This has some disadvantages (testing,
configuration, etc), but should give better experience with parsing
behavior.

10 years ago by Dmitry Stogov — view source — reply

unread

Hi,

On Tue, Feb 17, 2015 at 2:46 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

Hello, internals!

I want to introduce a RFC for providing a userland API for accessing an
Abstract Syntax Tree of the source code and to provide userland parser
hooks for source code modification:
https://wiki.php.net/rfc/parser-extension-api

Thanks!

The first part, describing https://github.com/nikic/php-ast , looks fine.
I see no problems including this extension into PHP-7.0 core distribution.
May be even making it required (like ext/reflection).

Nikita, what do you think?

The second part looks very interesting, however it has some uncovered
questions.

API for AST modification
AST validation (someone may insert "break" node in "parameter-list").

If you have enough experience, I would suggest you to try writing an
external extension that would implement this idea.
If you'll need some modification in PHP core (e.g adding callbacks), we may
consider including them in PHP-7.0.

Thanks. Dmitry.

10 years ago by Nikita Popov — view source — reply

unread

Hi,

On Tue, Feb 17, 2015 at 2:46 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

Hello, internals!

I want to introduce a RFC for providing a userland API for accessing an
Abstract Syntax Tree of the source code and to provide userland parser
hooks for source code modification:
https://wiki.php.net/rfc/parser-extension-api

Thanks!

The first part, describing https://github.com/nikic/php-ast , looks fine.
I see no problems including this extension into PHP-7.0 core distribution.
May be even making it required (like ext/reflection).

Nikita, what do you think?

The second part looks very interesting, however it has some uncovered
questions.

API for AST modification

AST validation (someone may insert "break" node in "parameter-list").

If you have enough experience, I would suggest you to try writing an
external extension that would implement this idea.
If you'll need some modification in PHP core (e.g adding callbacks), we
may consider including them in PHP-7.0.

Thanks. Dmitry.

I agree with Dmitry.

Exporting the AST to userland in PHP 7.0 would be nice. With Dmitry's work
on assertions we even have a pretty printer for it, which allows us to
convert the AST back to PHP code.

The bits about intercepting compilation, modifying the AST and importing it
back into the compiler are a LOT more complicated. Apart from many open
design questions, this will also require quite a bit of implementation
effort because we must be very careful with the many assumptions the
compiler makes about the structure of the AST (otherwise -> segfaults). I'd
prefer to delay this to a later PHP version (there's no problem to add it
in a minor), preferably with a PoC extension being available beforehand.

Alexander, I would recommend you to split this into two RFCs, one dealing
only with AST export (and maybe pretty printing) and the second one with
the compilation hooks. There's probably a few questions about the export
API that should be discussed that will be forgotten if everyone focuses on
the more complicated, and probably not yet relevant, question of
compilation hooks.

Nikita

10 years ago by Cesar Rodas — view source — reply

unread

Hi,

On Tue, Feb 17, 2015 at 2:46 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

Hello, internals!

I want to introduce a RFC for providing a userland API for accessing an
Abstract Syntax Tree of the source code and to provide userland parser
hooks for source code modification:
https://wiki.php.net/rfc/parser-extension-api

Thanks!

The first part, describing https://github.com/nikic/php-ast , looks fine.
I see no problems including this extension into PHP-7.0 core distribution.
May be even making it required (like ext/reflection).

Nikita, what do you think?

The second part looks very interesting, however it has some uncovered
questions.

API for AST modification

AST validation (someone may insert "break" node in "parameter-list").

If you have enough experience, I would suggest you to try writing an
external extension that would implement this idea.
If you'll need some modification in PHP core (e.g adding callbacks), we
may consider including them in PHP-7.0.

Thanks. Dmitry.

I agree with Dmitry.

Exporting the AST to userland in PHP 7.0 would be nice. With Dmitry's work
on assertions we even have a pretty printer for it, which allows us to
convert the AST back to PHP code.

In this matter, how would it be? It'd be awesome if function expects an
AST tree object instead of a value:

function doQuery($table, *$where) {
   $where = convert_ast_toSqlWhere($where);
}

doQuery("foobar", $col1 = "something" AND $col2 == $col3);

Or at least ast(<expr>), although both would be nice to have.

The bits about intercepting compilation, modifying the AST and importing it
back into the compiler are a LOT more complicated. Apart from many open
design questions, this will also require quite a bit of implementation
effort because we must be very careful with the many assumptions the
compiler makes about the structure of the AST (otherwise -> segfaults). I'd
prefer to delay this to a later PHP version (there's no problem to add it
in a minor), preferably with a PoC extension being available beforehand.

Alexander, I would recommend you to split this into two RFCs, one dealing
only with AST export (and maybe pretty printing) and the second one with
the compilation hooks. There's probably a few questions about the export
API that should be discussed that will be forgotten if everyone focuses on
the more complicated, and probably not yet relevant, question of
compilation hooks.

Nikita

--
César D. Rodas
Open Source developer
+595-983-161124
PGP: F9ED A265 A3AB C8A1 D145 7368 158A 0336 C707 0AA6

10 years ago by Nikita Popov — view source — reply

unread

Hi,

On Tue, Feb 17, 2015 at 2:46 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

Hello, internals!

I want to introduce a RFC for providing a userland API for accessing an
Abstract Syntax Tree of the source code and to provide userland parser
hooks for source code modification:
https://wiki.php.net/rfc/parser-extension-api

Thanks!

The first part, describing https://github.com/nikic/php-ast , looks
fine.
I see no problems including this extension into PHP-7.0 core
distribution.
May be even making it required (like ext/reflection).

Nikita, what do you think?

The second part looks very interesting, however it has some uncovered
questions.

API for AST modification

AST validation (someone may insert "break" node in "parameter-list").

If you have enough experience, I would suggest you to try writing an
external extension that would implement this idea.
If you'll need some modification in PHP core (e.g adding callbacks), we
may consider including them in PHP-7.0.

Thanks. Dmitry.

I agree with Dmitry.

Exporting the AST to userland in PHP 7.0 would be nice. With Dmitry's
work
on assertions we even have a pretty printer for it, which allows us to
convert the AST back to PHP code.

In this matter, how would it be? It'd be awesome if function expects an
AST tree object instead of a value:
function doQuery($table, *$where) {
   $where = convert_ast_toSqlWhere($where);
}

doQuery("foobar", $col1 = "something" AND $col2 == $col3);
Or at least ast(<expr>), although both would be nice to have.

The problem here is that in most cases we do not actually know what
function will be called at compile time. A very simple example would be to
replace doQuery() with $db->query() and already we don't know what it is
that we're actually calling and whether or not it needs an AST. However
this does not apply to just methods, it's an issue with nearly all calls.

As such PHP cannot know at compile time whether it should issue a normal
call with evaluated arguments or whether it needs to preserve the AST and
pass that. Just keeping the AST around for all calls would be too expensive
(it's a pretty big memory hog).

So rather than having this as a modifier in the function signature, it
would have to be a modifier in the call syntax. E.g. rust uses foo!()
syntax for macros.

Nikita

10 years ago by Jared Williams — view source — reply

unread

Hi,

On Tue, Feb 17, 2015 at 2:46 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

Hello, internals!

I want to introduce a RFC for providing a userland API for accessing an
Abstract Syntax Tree of the source code and to provide userland parser
hooks for source code modification:
https://wiki.php.net/rfc/parser-extension-api

Thanks!

The first part, describing https://github.com/nikic/php-ast , looks
fine.
I see no problems including this extension into PHP-7.0 core
distribution.
May be even making it required (like ext/reflection).

Nikita, what do you think?

The second part looks very interesting, however it has some uncovered
questions.

API for AST modification

AST validation (someone may insert "break" node in "parameter-list").

If you have enough experience, I would suggest you to try writing an
external extension that would implement this idea.
If you'll need some modification in PHP core (e.g adding callbacks), we
may consider including them in PHP-7.0.

Thanks. Dmitry.

I agree with Dmitry.

Exporting the AST to userland in PHP 7.0 would be nice. With Dmitry's
work
on assertions we even have a pretty printer for it, which allows us to
convert the AST back to PHP code.

In this matter, how would it be? It'd be awesome if function expects an
AST tree object instead of a value:
function doQuery($table, *$where) {
   $where = convert_ast_toSqlWhere($where);
}

doQuery("foobar", $col1 = "something" AND $col2 == $col3);
Or at least ast(<expr>), although both would be nice to have.
The problem here is that in most cases we do not actually know what
function will be called at compile time. A very simple example would be to
replace doQuery() with $db->query() and already we don't know what it is
that we're actually calling and whether or not it needs an AST. However
this does not apply to just methods, it's an issue with nearly all calls.

As such PHP cannot know at compile time whether it should issue a normal
call with evaluated arguments or whether it needs to preserve the AST and
pass that. Just keeping the AST around for all calls would be too expensive
(it's a pretty big memory hog).

So rather than having this as a modifier in the function signature, it
would have to be a modifier in the call syntax. E.g. rust uses foo!()
syntax for macros.

Nikita

Wouldn't the trick be to have $col1 be an object instance that uses
operator overloading? So instead of solving the expression, operators
would return the AST.

Was possible with the operator extension

Some old experimental code from 2006

echo query($connection,
array($catalog->Person->id, $catalog->Titles->title. ' ' .
$catalog->Person->name),
new SqlJoin(SqlJoin::TYPE_INNER,
$catalog->Person,
$catalog->Titles,
$catalog->Person->titleId == $catalog->Titles->id),
$catalog->Person->id == 1);

outputs

SELECT Person.id, Titles.title || ' ' || Person.name FROM Person INNER
JOIN Titles ON Person.titleId = Titles.id WHERE Person.id = 1

Jared

10 years ago by Alexander Lisachenko — view source — reply

unread

2015-02-18 17:59 GMT+03:00 Nikita Popov nikita.ppv@gmail.com:

Alexander, I would recommend you to split this into two RFCs, one dealing
only with AST export (and maybe pretty printing) and the second one with
the compilation hooks. There's probably a few questions about the export
API that should be discussed that will be forgotten if everyone focuses on
the more complicated, and probably not yet relevant, question of
compilation hooks.

Hello, Nikita! Thanks for you thoughts!

Sounds reasonable for me, because of short timeframe for PHP7. Let's go
with the first part of RFC, because second part is really cumbersome and
can be applied later, eg. in 7.1 I will split this RFC into two parts and
mark first part for 7.0.

About first part, do you agree with proposed classes and namespace
Php\Parser? I want to propose to use classes for parser API instead of
pure functions because it can give more usable OO-API for future needs.

There is one more notice regarding to the zval nodes, they should be also
exported as objects, not as values. I think we should export coherent nodes
for everything.

10 years ago by Nikita Popov — view source — reply

unread

On Wed, Feb 18, 2015 at 4:22 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

2015-02-18 17:59 GMT+03:00 Nikita Popov nikita.ppv@gmail.com:

Alexander, I would recommend you to split this into two RFCs, one dealing
only with AST export (and maybe pretty printing) and the second one with
the compilation hooks. There's probably a few questions about the export
API that should be discussed that will be forgotten if everyone focuses on
the more complicated, and probably not yet relevant, question of
compilation hooks.

Hello, Nikita! Thanks for you thoughts!

Sounds reasonable for me, because of short timeframe for PHP7. Let's go
with the first part of RFC, because second part is really cumbersome and
can be applied later, eg. in 7.1 I will split this RFC into two parts and
mark first part for 7.0.

Great!

About first part, do you agree with proposed classes and namespace

Php\Parser?

Yeah, I'm okay with that namespace. I prefer php\ast as being more
distinctive from existing projects, but that's not really important.

I want to propose to use classes for parser API instead of pure functions
because it can give more usable OO-API for future needs.

I'm okay with having stuff like ->getKindName() on the nodes, however I'd
still keep around the free-standing functions, because they can be used
without a node (i.e. you only need the kind AST_FOO and not an instantiated
node).

I also don't see why we need ParserEngine::parse() instead of just a
(namespaced) parse() function. Not a fan of unnecessary wrapper classes.

There is one more notice regarding to the zval nodes, they should be also
exported as objects, not as values. I think we should export coherent nodes
for everything.

What's the advantage of this? Most leaf nodes will be zval nodes and I feel
it would be rather inconvenient if they'd all have extra wrappers. Is there
some specific problem with including the values directly?

Nikita

10 years ago by Dmitry Stogov — view source — reply

unread

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects, duplicating
information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

So at first we will construct just single object referring to AST root.
Then traversing it we will create and destroy objects for necessary nodes.

To access children I would propose to implementing ArrayAccess interface.

$ast = \php\ast\parse($string);
foreach ($ast as $child) {
echo "\t" . $child->getKindName() . "\n";
foreach ($child as $grandchild) {
echo "\t" . $child->getKindName() . "\n";

}
}

Thanks. Dmitry.

On Wed, Feb 18, 2015 at 4:22 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

2015-02-18 17:59 GMT+03:00 Nikita Popov nikita.ppv@gmail.com:

Alexander, I would recommend you to split this into two RFCs, one
dealing only with AST export (and maybe pretty printing) and the second one
with the compilation hooks. There's probably a few questions about the
export API that should be discussed that will be forgotten if everyone
focuses on the more complicated, and probably not yet relevant, question of
compilation hooks.

Hello, Nikita! Thanks for you thoughts!

Sounds reasonable for me, because of short timeframe for PHP7. Let's go
with the first part of RFC, because second part is really cumbersome and
can be applied later, eg. in 7.1 I will split this RFC into two parts and
mark first part for 7.0.

Great!

About first part, do you agree with proposed classes and namespace

Php\Parser?

Yeah, I'm okay with that namespace. I prefer php\ast as being more
distinctive from existing projects, but that's not really important.

I want to propose to use classes for parser API instead of pure functions
because it can give more usable OO-API for future needs.

I'm okay with having stuff like ->getKindName() on the nodes, however I'd
still keep around the free-standing functions, because they can be used
without a node (i.e. you only need the kind AST_FOO and not an instantiated
node).

I also don't see why we need ParserEngine::parse() instead of just a
(namespaced) parse() function. Not a fan of unnecessary wrapper classes.

There is one more notice regarding to the zval nodes, they should be also
exported as objects, not as values. I think we should export coherent nodes
for everything.

What's the advantage of this? Most leaf nodes will be zval nodes and I
feel it would be rather inconvenient if they'd all have extra wrappers. Is
there some specific problem with including the values directly?

Nikita

10 years ago by Alexander Lisachenko — view source — reply

unread

2015-02-18 22:22 GMT+03:00 Dmitry Stogov dmitry@zend.com:

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects, duplicating
information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

What about Generators? They can be used nicely for handling AST data. And
each node can be constructed as needed during the traversal. Probably this
way is better than ArrayAccess

10 years ago by Alexander Lisachenko — view source — reply

unread

I want to attach a link to the Python AST documentation:
https://docs.python.org/3.2/library/ast.html
What I like there is iteration of nodes with generators (yields).
NodeVisitor that can be extended and used as coroutine to simplify logic of
traversal.

I want to keep Node class as simple DTO with public properties, but only
children field will be a generator.

What do you think?

2015-02-18 22:34 GMT+03:00 Alexander Lisachenko lisachenko.it@gmail.com:

2015-02-18 22:22 GMT+03:00 Dmitry Stogov dmitry@zend.com:

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects,
duplicating information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

What about Generators? They can be used nicely for handling AST data. And
each node can be constructed as needed during the traversal. Probably this
way is better than ArrayAccess

10 years ago by Dmitry Stogov — view source — reply

unread

On Wed, Feb 18, 2015 at 10:34 PM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

2015-02-18 22:22 GMT+03:00 Dmitry Stogov dmitry@zend.com:

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects,
duplicating information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

What about Generators? They can be used nicely for handling AST data. And
each node can be constructed as needed during the traversal. Probably this
way is better than ArrayAccess

No. Generators allows only sequential access, while ArrayAccess - random.

Thanks. Dmitry.

10 years ago by Nikita Popov — view source — reply

unread

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects, duplicating
information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

So at first we will construct just single object referring to AST root.
Then traversing it we will create and destroy objects for necessary nodes.

To access children I would propose to implementing ArrayAccess interface.

$ast = \php\ast\parse($string);
foreach ($ast as $child) {
echo "\t" . $child->getKindName() . "\n";
foreach ($child as $grandchild) {
echo "\t" . $child->getKindName() . "\n";

}
}

Thanks. Dmitry.

I've considered using this approach, but decided against it because it
introduces a lot of "magic" for unclear gains. Lazily creating the objects
means we have to keep around the entire AST arena while at least one node
referencing it is alive. So if you're analyzing a large project with a few
thousand files and keep around a few nodes in some cases, you end up not
just with the memory usage of those nodes, but with the memory usage of the
entire ASTs. Furthermore in many cases you'll just traverse the entire AST,
in which case you'll need to instantiate all nodes anyway and the lazy
instantiation will only hurt.

In any case, performance of constructing a full AST is pretty good - I
don't remember the exact numbers, but ast\parse_code() is only slightly
slower than doing token_get_all() on the same code.

Nikita

10 years ago by Dmitry Stogov — view source — reply

unread

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects,
duplicating information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

So at first we will construct just single object referring to AST root.
Then traversing it we will create and destroy objects for necessary nodes.

To access children I would propose to implementing ArrayAccess interface.

$ast = \php\ast\parse($string);
foreach ($ast as $child) {
echo "\t" . $child->getKindName() . "\n";
foreach ($child as $grandchild) {
echo "\t" . $child->getKindName() . "\n";

}
}

Thanks. Dmitry.

I've considered using this approach, but decided against it because it
introduces a lot of "magic" for unclear gains. Lazily creating the objects
means we have to keep around the entire AST arena while at least one node
referencing it is alive. So if you're analyzing a large project with a few
thousand files and keep around a few nodes in some cases, you end up not
just with the memory usage of those nodes, but with the memory usage of the
entire ASTs. Furthermore in many cases you'll just traverse the entire AST,
in which case you'll need to instantiate all nodes anyway and the lazy
instantiation will only hurt.

In any case, performance of constructing a full AST is pretty good - I
don't remember the exact numbers, but ast\parse_code() is only slightly
slower than doing token_get_all() on the same code.

I think the whole memory usage should be less, because at each point you'll
most probably keep only few AST objects, but I understood your arguments.
They may make sense, but it's hard to say what is better without comparison
of implementations.

Thanks. Dmitry.

Nikita

10 years ago by Sara Golemon — view source — reply

unread

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects, duplicating
information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

So at first we will construct just single object referring to AST root.
Then traversing it we will create and destroy objects for necessary nodes.

I'm not sure if you've seen my astkit extension, but it does this.

https://github.com/sgolemon/astkit

-Sara

10 years ago by Dmitry Stogov — view source — reply

unread

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects,
duplicating
information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

So at first we will construct just single object referring to AST root.
Then traversing it we will create and destroy objects for necessary
nodes.

I'm not sure if you've seen my astkit extension, but it does this.

https://github.com/sgolemon/astkit

I didn't see it before, and took just a quick look, but I like it.
Any special reason, why you didn't implement ArrayAccess interface for
children access?
For me it would look natural.

Thanks. Dmitry.

-Sara

10 years ago by Alexander Lisachenko — view source — reply

unread

Good morning!

I have cleaned https://wiki.php.net/rfc/parser-extension-api and restricted
it's scope only to the parsing API. Extension API can be implemented later
on top of
https://github.com/php/php-src/commit/1010b0ea4f4b9f96ae744f04c1191ac228580e48
and current implementation, because it requires a lot of discussion and can
not be implemented right now.

I want to move this RFC to the vote phase soon to decide if it will be
included into PHP7.0 or not.

So let's discuss open questions before moving it to the vote phase (from SO
chat)

Should each node type be represented as personal class?
There are two possible ways: single node class for everything (current
proposal) and separate class for every node kind. I have made a quick
research of AST API in several languages and noticed, that most of them
generates AST class nodes automatically. E.g. UnaryOperationNode,
StatementNode... This way is cool, but it requires a lot of classes to be
loaded and available as API. Pros: SRP of each node class, AST validators
(can be described with XML and DTD), more clear code analysis (checks with
$node instaceof StatementNode), typehints for concrete nodes for visitors.
However, PHP is dynamic language and we can use only one class for
everything, adjusting kind property for each node. Pros: single class,
easier to maintain and to use. Cons: bad UX (how to validate nodes, how to
determine number of available children, how to check is node using flags or
not, etc)

Where metadata should be stored (flags, names of kind nodes, relation
between node types)? This information will be needed later for validation
of AST

Nikita have some thoughts for the future :) So he asked about the storage
of metadata to validate an AST and to perform some analysis on it. Metadata
should include the following: name of each node kind (this can be just a
class name of node or constants in the class), node restrictions (which
kind of node types can be used as children for concrete node; number of
children nodes), node flag names (it's PUBLIC, PROTECTED, PRIVATE, etc)

2015-02-19 10:22 GMT+03:00 Dmitry Stogov dmitry@zend.com:

I think the AST API shouldn't use "public" properties.
Using it, we will have to construct the whole tree of objects,
duplicating
information from AST.
I would propose SimpleXML approach instead - construct object only for
node(s) we currently access.

So at first we will construct just single object referring to AST root.
Then traversing it we will create and destroy objects for necessary
nodes.

I'm not sure if you've seen my astkit extension, but it does this.

https://github.com/sgolemon/astkit

I didn't see it before, and took just a quick look, but I like it.
Any special reason, why you didn't implement ArrayAccess interface for
children access?
For me it would look natural.

Thanks. Dmitry.

-Sara

10 years ago by Leigh — view source — reply

unread

Good morning!

I have cleaned https://wiki.php.net/rfc/parser-extension-api and restricted
it's scope only to the parsing API. Extension API can be implemented later

+1

on top of
https://github.com/php/php-src/commit/1010b0ea4f4b9f96ae744f04c1191ac228580e48
and current implementation, because it requires a lot of discussion and can
not be implemented right now.

I had no idea that zend_ast_process was such a recent addition, and
part of your proposal. I've actually started using it already
completely independently in one of my extensions!

Should each node type be represented as personal class?
There are two possible ways: single node class for everything (current
proposal) and separate class for every node kind. I have made a quick
research of AST API in several languages and noticed, that most of them
generates AST class nodes automatically. E.g. UnaryOperationNode,
StatementNode... This way is cool, but it requires a lot of classes to be
loaded and available as API. Pros: SRP of each node class, AST validators
(can be described with XML and DTD), more clear code analysis (checks with
$node instaceof StatementNode), typehints for concrete nodes for visitors.
However, PHP is dynamic language and we can use only one class for
everything, adjusting kind property for each node. Pros: single class,
easier to maintain and to use. Cons: bad UX (how to validate nodes, how to
determine number of available children, how to check is node using flags or
not, etc)

I think we need to at least represent all of the current node structures.

A common base class, and then classes to represent lists, zvals and
decls that extend from this base.

Where metadata should be stored (flags, names of kind nodes, relation
between node types)? This information will be needed later for validation
of AST

Nikita have some thoughts for the future :) So he asked about the storage
of metadata to validate an AST and to perform some analysis on it. Metadata
should include the following: name of each node kind (this can be just a
class name of node or constants in the class), node restrictions (which
kind of node types can be used as children for concrete node; number of
children nodes), node flag names (it's PUBLIC, PROTECTED, PRIVATE, etc)

Thinking for the future is fine, but do we need this metadata for the
current proposal? Is the AST returned by the parser in a read-only
state, or can users create their own nodes + children and get a pretty
printed output? If it's the latter then we obviously need to know the
restrictions.

I think we need a mechanism that keeps names/numbers in sync
automatically, maybe we can use some macros to automatically generate
enums and userland facing details at the same time, so we don't have
to keep several places in sync if/when new AST nodes are added.

9 years ago by Alexander Lisachenko — view source — reply

unread

Hello, internals!

I'd like to bring this topic to the discussion again, because now we have a
shiny PHP7 engine and keep moving. PHP grammar becomes more complex and
internal AST representation isn't available for the userland developers, so
all static analysis tools suffering from the lack of native AST API for
that. Only possible way for that is to perform tokenization of source code
and then manually reconstruct an AST (thanks to the PHP-Parser and Nikita
for doing this job for us, developers)

Several days ago, Dmitry published a RFC for native attributes, which can
be a great tool for building more complex stuff on top of this metadata.
However, all attribute expressions will be stored as an AST nodes, so we
need an AST API again to analyse, parse or compile AST back into the source
code for evaling, etc.

It would be nice to push php-ast extension (or similar one) into the core,
providing an API via static class, for example "Php\Parser".

2015-03-03 19:12 GMT+03:00 Leigh leight@gmail.com:

On 3 March 2015 at 11:56, Alexander Lisachenko lisachenko.it@gmail.com
wrote:

Good morning!

I have cleaned https://wiki.php.net/rfc/parser-extension-api and
restricted
it's scope only to the parsing API. Extension API can be implemented
later

+1

on top of

https://github.com/php/php-src/commit/1010b0ea4f4b9f96ae744f04c1191ac228580e48

and current implementation, because it requires a lot of discussion and
can
not be implemented right now.

I had no idea that zend_ast_process was such a recent addition, and
part of your proposal. I've actually started using it already
completely independently in one of my extensions!

Should each node type be represented as personal class?
There are two possible ways: single node class for everything (current
proposal) and separate class for every node kind. I have made a quick
research of AST API in several languages and noticed, that most of them
generates AST class nodes automatically. E.g. UnaryOperationNode,
StatementNode... This way is cool, but it requires a lot of classes to be
loaded and available as API. Pros: SRP of each node class, AST validators
(can be described with XML and DTD), more clear code analysis (checks
with
$node instaceof StatementNode), typehints for concrete nodes for
visitors.
However, PHP is dynamic language and we can use only one class for
everything, adjusting kind property for each node. Pros: single class,
easier to maintain and to use. Cons: bad UX (how to validate nodes, how
to
determine number of available children, how to check is node using flags
or
not, etc)

I think we need to at least represent all of the current node structures.

A common base class, and then classes to represent lists, zvals and
decls that extend from this base.

Where metadata should be stored (flags, names of kind nodes, relation
between node types)? This information will be needed later for validation
of AST

Nikita have some thoughts for the future :) So he asked about the storage
of metadata to validate an AST and to perform some analysis on it.
Metadata
should include the following: name of each node kind (this can be just a
class name of node or constants in the class), node restrictions (which
kind of node types can be used as children for concrete node; number of
children nodes), node flag names (it's PUBLIC, PROTECTED, PRIVATE, etc)

Thinking for the future is fine, but do we need this metadata for the
current proposal? Is the AST returned by the parser in a read-only
state, or can users create their own nodes + children and get a pretty
printed output? If it's the latter then we obviously need to know the
restrictions.

I think we need a mechanism that keeps names/numbers in sync
automatically, maybe we can use some macros to automatically generate
enums and userland facing details at the same time, so we don't have
to keep several places in sync if/when new AST nodes are added.

9 years ago by Nikita Popov — view source — reply

unread

On Tue, Apr 26, 2016 at 9:42 AM, Alexander Lisachenko <
lisachenko.it@gmail.com> wrote:

Hello, internals!

I'd like to bring this topic to the discussion again, because now we have
a shiny PHP7 engine and keep moving. PHP grammar becomes more complex and
internal AST representation isn't available for the userland developers, so
all static analysis tools suffering from the lack of native AST API for
that. Only possible way for that is to perform tokenization of source code
and then manually reconstruct an AST (thanks to the PHP-Parser and Nikita
for doing this job for us, developers)

Several days ago, Dmitry published a RFC for native attributes, which can
be a great tool for building more complex stuff on top of this metadata.
However, all attribute expressions will be stored as an AST nodes, so we
need an AST API again to analyse, parse or compile AST back into the source
code for evaling, etc.

It would be nice to push php-ast extension (or similar one) into the core,
providing an API via static class, for example "Php\Parser".

As an update here, I plan to create an RFC for bundling the php-ast
extension (maybe with minor modifications) with php-src. This was planned
for 7.1 anyway, and with the annotations RFC under discussion, this seems
like a good time.

However, I will limit this RFC to the current state of the extension, i.e.
I will not include the ability to inject the AST back into the PHP compiler
-- as already mentioned this is very complicated (due to validation
concerns) and should be left for future scope. Some concerns like pretty
printing (compiling AST back to PHP code) will also be left for userland
(e.g. tpunt has a library for this), as doing this in core (on the exported
AST) is unnecessarily complicated, fragile and not sufficiently flexible
(e.g. with regard to formatting).

I'll try to post an RFC for review this week.

Regards,
Nikita

9 years ago by Alexander Lisachenko — view source — reply

unread

2016-04-26 11:15 GMT+03:00 Nikita Popov nikita.ppv@gmail.com:

As an update here, I plan to create an RFC for bundling the php-ast
extension (maybe with minor modifications) with php-src. This was planned
for 7.1 anyway, and with the annotations RFC under discussion, this seems
like a good time.

...

I'll try to post an RFC for review this week.

Hello, Nikita!

It's a great news! Having this stuff in the 7.1 core is awesome!

But I have several petitions for this API:

Could it avoid using of raw functions for parsing API, like
ast\parse_file or ast\parse_code
Could you please put all the parsing stuff into the Parser class and
put it into the Php\ namespace, as well with node classes?

9 years ago by Fleshgrinder — view source — reply

unread

Hello, Nikita!

It's a great news! Having this stuff in the 7.1 core is awesome!

But I have several petitions for this API:

Could it avoid using of raw functions for parsing API, like
ast\parse_file or ast\parse_code

Could you please put all the parsing stuff into the Parser class and
put it into the Php\ namespace, as well with node classes?

The current coding standard still prescribes to do so. However, these
requests keep popping up and I think that this should be discussed at
some point. However, putting everything into a static class just to
mimic some OO that you might like makes absolutely no sense. We have
multi-paradigm support for a reason: best tool for the job?

That being said, the current implementation violates more rules of the
coding standard (camelCase vs. snake_case).

Currently we have the following:

namespace ast {

parse_file()

parse_code()

get_kind_name()

kind_uses_flags()

class Node {
  public $kind;
  public $flags;
  public $lineno;
  public $children;
}

}

namespace ast\Node {

class Decl extends ast\Node {
  public $endLineno;
  public $name;
  public $docComment;
}

}

From a pure coding standards view and current conventions in core at
least the following has to change:

ast_parse_file()

ast_parse_code()

ast_get_kind_name() // ast_kind_name?

ast_kind_uses_flags() // ast_kind_has_flags?

class AstNode {
public $kind;
public $flags;
public $lineno;
public $children;
}

class AstDecl extends AstNode {
public $end_lineno;
public $name;
public $doc_comment;
}

Of course usage of the \Php\Ast or \PHP\AST or \Php\AST or \PHP\Ast
namespaces is possible but you already see from the four variations that
this opens a huge can of worms that would require discussion.

Personally I would design the API as follows, given that it is not a
goal to have an anemic model in order to support direct changes of the
public properties without any kind of validation. If that is an intended
functionality things might need to be implemented a bit differently (and
the anemic DTO approach could even make sense). Also note that I am
modelling it in an OO way because it fits the actual data structure (tree).

enum AstKind {

/*API = [ ORDINAL, NAME ]*/

FUNCTION = [ 66, 'AST_FUNC_DECL' ];

CLASS = [ 67, 'AST_CLASS' ];

// ...

/** @see \AstKind::getName() */
public function __toString(): string;

/** bitfield */
public function getFlags(): int;

public function getName(): string;

public function getOrdinal(): int;

public function hasFlags(): bool;

}

final class AstNode implements Countable, Traversable {

private AstKind $kind;

private int $line_number;

private array<AstNode> $children = [];

public function getKind(): AstKind;

public function getLineNumber(): int;

}

final class AstDeclarationNode extends AstNode {

private string $doc_comment;

private int $end_line_number;

private string $name;

/** @see \AstDeclarationNode::getName() */
public function __toString(): string;

public function getDocComment(): string;

public function getEndLineNumber(): int;

public function getName(): string;

}

final class Ast implements Countable, Traversable {

private AstNode $root;

private function __construct();

public static function fromFile(string $path): Ast;

public static function fromCode(
  string $code,
  string $filename = 'string code'
): Ast;

/** @see \Ast::getDump() */
public function __toString(): string;

public function getDump(bool $line_numbers = false): string;

}

I do not think that it is necessary to use the exact same constant names
as are used in the C source (e.g. AST_FUNC_DECL vs. AST_CLASS). The
userland API could be more consistent here. You might disagree thought
but in the end only the ordinal must be an absolute perfect match.

--
Richard "Fleshgrinder" Fussenegger

10 years ago by Alexander Lisachenko — view source — reply

unread

I'm okay with having stuff like ->getKindName() on the nodes, however I'd
still keep around the free-standing functions, because they can be used
without a node (i.e. you only need the kind AST_FOO and not an instantiated
node).

I think that getting name of kind is useless without value of kind.
According to the RFC, there is only one way from user land to get a kind of
node: ask a parser for an AST, that will contain nodes. Is there any reason
to do getKindName(1), where 1 - is just integer value. I can't see how it
can be useful, this information is fully related to the node.

I also don't see why we need ParserEngine::parse() instead of just a

(namespaced) parse() function. Not a fan of unnecessary wrapper classes.

I think that PHP is ugly enough :) We should try to use namespaces and
classes with methods to provide a better experience for user land
developers. Classes allow for logical composition of common functionality,
provides autocompletion in IDE and can be considered as step to
standardization of core API. (Today I discussed this question with
colleagues that will be great to have all system classes in Php namespace
for PHP>=7.0)

I know, that on engine level, classes aren't native, but you should think
about how this feature will be used by developers. Because, creation of API
is one-time process, but imagine how many times it will be used by
developers. I think, it will be more polite to provide OO-code for modern
API.

What's the advantage of this? Most leaf nodes will be zval nodes and I feel

it would be rather inconvenient if they'd all have extra wrappers. Is there
some specific problem with including the values directly?

Yes, reason here is the same as for token_get_all(). I hear so many WTFs
when value can be either simple character or array. Much harder to work
with inconsistent data and many libraries performs additional steps to
provide better usability (see
https://github.com/Andrewsville/PHP-Token-Reflection/blob/1c5ff91ee6877d1fb0f60fab3e3fc562bdb02684/TokenReflection/Stream/StreamBase.php#L100-L108
or
https://github.com/lisachenko/go-aop-php/blob/3c381e243367096d47dc9676b3ad7445d7720ba9/src/Instrument/Transformer/MagicConstantTransformer.php#L117
)

So, each element should be represented as common node, this will simplify
usage of API.