Moving to an AST-based parsing/compilation process

12 years ago by Sean Coates — view source

unread

Some people asked me what the advantages of using an AST-based
parsing/compilation process are, so I put together a few quick notes
in an RFC:

https://wiki.php.net/rfc/ast_based_parsing_compilation_process

It would be nice to get a few comments from other core devs on this.

Pardon my obviously amateur question, but would you build an AST-based compiler/parser to generate the same (minus the ones you intend to eliminate) opcodes to run on the VM in the same way as the current compiler does?

Would tools like XDebug, APC, Zend PHP encoder (or whatever that's called this week), etc. be compatible out of the box, or would changes need to be made to them?

Also, here is a potential starting point if you're less allergic to OCaml than I am: https://github.com/facebook/pfff

S

12 years ago by Andrew Faulds — view source

unread

Pardon my obviously amateur question, but would you build an AST-based compiler/parser to generate the same (minus the ones you intend to eliminate) opcodes to run on the VM in the same way as the current compiler does?
Sure. We're changing the route we go down to produce those opcodes, but
not the opcodes themselves.
Would tools like XDebug, APC, Zend PHP encoder (or whatever that's called this week), etc. be compatible out of the box, or would changes need to be made to them?
Compatible, I think. Although I don't know if in the process some might
be changed, but I would expect no big changes.

--
Andrew Faulds
http://ajf.me/

12 years ago by Adam Jon Richardson — view source

unread

Hey folks!

Some people asked me what the advantages of using an AST-based
parsing/compilation process are, so I put together a few quick notes
in an RFC:

https://wiki.php.net/rfc/ast_based_parsing_compilation_process

It would be nice to get a few comments from other core devs on this.

I''m not a core developer, but I do have some concerns about this type
of approach:

As noted in the RFC, most languages do use an Abstract Syntax Tree
(AST), however, as is also noted in the RFC, PHP opcodes are
regenerated by request, which makes PHP very unique amongst languages,
so there is perhaps a reason to be different here.

The disadvantages of the AST approach are noted as the potential for
increased resource requirements. When viewed in the RFC, the brevity
of the section and the visual weight of its contents perhaps
understates just how much of a big deal this could be.

PHP as a web technology is run on a myriad of servers and processes a
huge amount of requests every second across the world. Adding even a
couple cycles to every request is a very big deal in the scheme of
things, especially when we live in an age where many other industries
are making great efforts to reduce resources required for
goods/services.

There was some mention of caching to alleviate the potential issues,
and this could bring the cost down, perhaps even saving cycles in the
long run. Or, perhaps some brilliant work on the processing could
yield significant resource savings compared to the single-pass
approach.

My point is that I'm all for improving the PHP internals so as to
facilitate future work on the core. However, these considerations must
be carefully weighed against the resource footprint PHP now has, and
the hope of continuing to make reasonable strides to reduce that
footprint. Asking a few core developers to use more resources to
handle hacks, quirks, and decoupling technical issues CAN be the
preferred alternative if there are real savings in server resources
used worldwide.

That's not to say this area of work should be avoided. Rather, I am
saying that I hope any work in this area would give the potential for
additional resource usage very serious consideration.

Adam

12 years ago by Andrew Faulds — view source

unread

I''m not a core developer, but I do have some concerns about this type
of approach:

As noted in the RFC, most languages do use an Abstract Syntax Tree
(AST), however, as is also noted in the RFC, PHP opcodes are
regenerated by request, which makes PHP very unique amongst languages,
so there is perhaps a reason to be different here.
Python also generates opcodes if you were to run it on CGI on request.
However, like PHP will be able to when APC becomes default and ready
enough, it caches them.

The disadvantages of the AST approach are noted as the potential for
increased resource requirements. When viewed in the RFC, the brevity
of the section and the visual weight of its contents perhaps
understates just how much of a big deal this could be.

PHP as a web technology is run on a myriad of servers and processes a
huge amount of requests every second across the world. Adding even a
couple cycles to every request is a very big deal in the scheme of
things, especially when we live in an age where many other industries
are making great efforts to reduce resources required for
goods/services.
APC will make things faster, though, you're missing that. And
optimisations, which an AST would help, would make it even faster.

There was some mention of caching to alleviate the potential issues,
and this could bring the cost down, perhaps even saving cycles in the
long run. Or, perhaps some brilliant work on the processing could
yield significant resource savings compared to the single-pass
approach.

My point is that I'm all for improving the PHP internals so as to
facilitate future work on the core. However, these considerations must
be carefully weighed against the resource footprint PHP now has, and
the hope of continuing to make reasonable strides to reduce that
footprint. Asking a few core developers to use more resources to
handle hacks, quirks, and decoupling technical issues CAN be the
preferred alternative if there are real savings in server resources
used worldwide.

That's not to say this area of work should be avoided. Rather, I am
saying that I hope any work in this area would give the potential for
additional resource usage very serious consideration.

Adam

--
Andrew Faulds
http://ajf.me/

12 years ago by Adam Jon Richardson — view source

unread

APC will make things faster, though, you're missing that. And optimisations,
which an AST would help, would make it even faster.

Respectfully, I didn't miss that, and I alluded to that potential in
my response (did you read all of my response.) As should be obvious
from my post, if this approach lead to increased performance, I would
be all for it.

However, as Knuth has said, "It is often a mistake to make a priori
judgments about what parts of a program really critical, since the
universal experience of programmers who have been using measurement
tools has been that their intuitive guesses fail."

So, I'm cautious about the premise that this will/could lead to an
overall improvement in performance compared to the current
implementation, a caution that the RFC spoke to very clearly.

My point is that, if there is an increase in the resources required,
that is a really big deal.

Adam

12 years ago by Levi Morrison — view source

unread

However, as Knuth has said, "It is often a mistake to make a priori
judgments about what parts of a program really critical, since the
universal experience of programmers who have been using measurement
tools has been that their intuitive guesses fail."

So, I'm cautious about the premise that this will/could lead to an
overall improvement in performance compared to the current
implementation, a caution that the RFC spoke to very clearly.

This is probably why the section in the RFC is so small . . . :)

It seems to me to be a very wise decision to use an AST but good
decisions carried out poorly can be more harmful than helpful. We need
to do this correctly. I hope this discussion will be helpful in this
regard.

12 years ago by Adam Jon Richardson — view source

unread

This is probably why the section in the RFC is so small . . . :)

The section covering the potential for potential optimizations isn't so small :)

12 years ago by Adam Jon Richardson — view source

unread

This is probably why the section in the RFC is so small . . . :)

The section covering the potential for potential optimizations isn't so small :)

And, apparently, "potential" was the focal point of my response
(sometimes I type faster than I can proof, sorry :)

12 years ago by Ivan Enderlin @ Hoa — view source

unread

Hi,

Hey folks!

Some people asked me what the advantages of using an AST-based
parsing/compilation process are, so I put together a few quick notes
in an RFC:

https://wiki.php.net/rfc/ast_based_parsing_compilation_process

It would be nice to get a few comments from other core devs on this.
From my personal point of view, I see more advantages than
inconvenients of using an AST.

It is a more consistent and clever way to manipulate a language. We can
have more optimizations processes, algorithms and heuristics. We have
already spoken about consistency in the PHP syntax, but it will also be
easier for contributors to suggest new patches in regards to the syntax
and syntactic-sugar. An AST will also facilitate opcode caching (even a
trivial approach at first would be promising I think), and again, it
will egg on contributors to propose new patches for that. Moreover,
there are a lot of well-known algorithms and heuristics in the
literature to compensate for memory overhead. I think about
lazy-compilation to avoid the build of the whole AST.
Even if the first feeling is that AST appears to bring some issues at
first, it could be very quickly a benefit for PHP, in terms of
consistency in the code, performance and memory usage.

Conversely, it requires a lot of work. I can help if needed.

Cheers.

--
Ivan Enderlin
Developer of Hoa
http://hoa.42/ or http://hoa-project.net/

PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/

Member of HTML and WebApps Working Group of W3C
http://w3.org/

12 years ago by Morgan L. Owens — view source

unread

Hey folks!

Some people asked me what the advantages of using an AST-based
parsing/compilation process are, so I put together a few quick notes
in an RFC:

https://wiki.php.net/rfc/ast_based_parsing_compilation_process

It would be nice to get a few comments from other core devs on this.

Nikita

I'm not a core dev, but I would like to add to the notes above that
"third parties", such as myself, who want to do things with PHP source
other than run it through a PHP interpreter would also appreciate such a
separation of concerns.

To date, I've been basing work, which exposes syntactic structure, on
phc's maketea grammar (Phalanger's is more up to date, but also more
complicated what with its provenance and the Linq and generics and all),
but it's reverse-engineered and certainly wrong (oh, that reminds
me...); the existing grammar is unsuitable because no-one wants to see
that.

Something authoritative that by definition tracks the current version
would be more reassuring as regards accuracy and compatibility (and be
more likely to result in something that deserves to be let out into the
world with confidence).

12 years ago by Andrew Faulds — view source

unread

I'm not a core dev, but I would like to add to the notes above that
"third parties", such as myself, who want to do things with PHP source
other than run it through a PHP interpreter would also appreciate such
a separation of concerns.

To date, I've been basing work, which exposes syntactic structure, on
phc's maketea grammar (Phalanger's is more up to date, but also more
complicated what with its provenance and the Linq and generics and
all), but it's reverse-engineered and certainly wrong (oh, that
reminds me...); the existing grammar is unsuitable because no-one
wants to see that.

Something authoritative that by definition tracks the current
version would be more reassuring as regards accuracy and compatibility
(and be more likely to result in something that deserves to be let out
into the world with confidence).

To add to your point:

If we make it produce an AST, I wonder if we could possibly expose this
through PHP, perhaps with some sort of extension. Then parsers and such
for PHP could simply ask PHP to do the parsing for them, and then do
analysis - no more duplicating official PHP grammar.

I'm just speculating here, but this would be pretty cool if we could do it.

--
Andrew Faulds
http://ajf.me/

12 years ago by Ivan Enderlin @ Hoa — view source

unread

I'm not a core dev, but I would like to add to the notes above that
"third parties", such as myself, who want to do things with PHP
source other than run it through a PHP interpreter would also
appreciate such a separation of concerns.

To date, I've been basing work, which exposes syntactic structure, on
phc's maketea grammar (Phalanger's is more up to date, but also more
complicated what with its provenance and the Linq and generics and
all), but it's reverse-engineered and certainly wrong (oh, that
reminds me...); the existing grammar is unsuitable because no-one
wants to see that.

Something authoritative that by definition tracks the current
version would be more reassuring as regards accuracy and
compatibility (and be more likely to result in something that
deserves to be let out into the world with confidence).

To add to your point:

If we make it produce an AST, I wonder if we could possibly expose
this through PHP, perhaps with some sort of extension. Then parsers
and such for PHP could simply ask PHP to do the parsing for them, and
then do analysis - no more duplicating official PHP grammar.

I'm just speculating here, but this would be pretty cool if we could
do it.
+1. It will be very useful for static analysis, test, control flow graph
etc.

--
Ivan Enderlin
Developer of Hoa
http://hoa.42/ or http://hoa-project.net/

PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/

Member of HTML and WebApps Working Group of W3C
http://w3.org/

12 years ago by Stas Malyshev — view source

unread

Hi!

To date, I've been basing work, which exposes syntactic structure, on
phc's maketea grammar (Phalanger's is more up to date, but also more
complicated what with its provenance and the Linq and generics and all),
but it's reverse-engineered and certainly wrong (oh, that reminds
me...); the existing grammar is unsuitable because no-one wants to see
that.

Well, now if you start to implement yet another AST grammar, it would be
"wrong" too, at least for substantial time until the kinks are worked
out - just because it's different approach which probably would work
differently in some corner cases.
So what we're getting on the plus side is more academically nice parser
with potential optimizations, of which nobody knows if they'd have any
real effect and all indications point to the possibility they won't, and
we have some benefits for third parties doing some (unknown to us) work
on PHP.
On the minus side we have major disruption of the code base, virtually
certain BC problems and stability problems, slower compiler and no real
benefit for average PHP user.

I'm not sure this equation has the positives outweigh the negatives.
It'd be nice to support third-part work but I'd propose to start with
writing the actual parser (e.g. as an extension or third-party library)
and see if we can make it as fast and 100% compliant and if it turns out
good then we could talk about replacing current parser with it. In the
meantime you could also use it as base for your work too.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Morgan L. Owens — view source

unread

Hi!

... and no real
benefit for average PHP user.

Well, apart from perhaps leaving them with a simpler language that
doesn't have the inconsistencies and corner cases that currently exist
(and documented ad nauseum) not because of any design decision but
"because the parser is written that way".

12 years ago by Stas Malyshev — view source

unread

Hi!

Well, apart from perhaps leaving them with a simpler language that
doesn't have the inconsistencies and corner cases that currently exist
(and documented ad nauseum) not because of any design decision but
"because the parser is written that way".

If you think writing new parser gets rid of all corner cases you are in
for a big surprise. AST is not magic and parser will always be written
exactly the way it is written - so if somebody won't implement certain
feature in a consistent way, it won't be implemented in consistent way,
AST or not.
And it's a bit late to take design decisions on existing PHP language,
it seems to me.

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Andrew Faulds — view source

unread

Stas Malyshev smalyshev@sugarcrm.com wrote:

Hi!

Well, apart from perhaps leaving them with a simpler language that
doesn't have the inconsistencies and corner cases that currently
exist
(and documented ad nauseum) not because of any design decision but
"because the parser is written that way".

If you think writing new parser gets rid of all corner cases you are in
for a big surprise. AST is not magic and parser will always be written
exactly the way it is written - so if somebody won't implement certain
feature in a consistent way, it won't be implemented in consistent way,
AST or not.
An AST allows much deeper analysis of the syntax used after parsing (i.e. parsing of tokens to AST), though. This means you can be greatly more flexible with regards to a lot of things, and greatly reduce magic corner cases, such as executing a closure from a dereferenced array which is a static member of a class (something there is no good reason you can't do, just limitations of current parser)
And it's a bit late to take design decisions on existing PHP language,
it seems to me.
What?

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

--

--
Sent from my Android phone with K-9 Mail.
Andrew Faulds
http://ajf.me/

12 years ago by Anthony Ferrara — view source

unread

Stas,

On Thu, Sep 6, 2012 at 5:25 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:

Hi!

Well, apart from perhaps leaving them with a simpler language that
doesn't have the inconsistencies and corner cases that currently exist
(and documented ad nauseum) not because of any design decision but
"because the parser is written that way".

If you think writing new parser gets rid of all corner cases you are in
for a big surprise. AST is not magic and parser will always be written
exactly the way it is written - so if somebody won't implement certain
feature in a consistent way, it won't be implemented in consistent way,
AST or not.

Actually, that's not true. Right now, the parser is parsing both syntax and
a good bit of grammar. That's why we have so many reserved words. The
compiler step implements some of the grammar, but the parser takes care of
a significant amount of it.

With a move to an AST based parsing, the parser can be greatly simplified,
with a very significant reduction in reserved words. This has a few
benefits:

Reduced number of first-class tokens makes parsing the syntax
potentially much more efficient. This is at the expense of a more
complicated compiling step (building and processing the AST).
It also removes the need for the parser to worry about precedence. It's
parsing for syntax only, and then lets the AST compiler step worry about
operator precedence...
It provides the ability for the grammar to be extended without modifying
the syntax. That means that PECL extensions could theoretically add
compiler steps to not only extend functionality, but grammar as well. For
example, it may be possible to add language rules (such as an inline
keyword for functions, or pre-processor macros) that allow for extension of
the language without modifying the parser (I say may, because it depends
strongly on the design of the parser and AST).
Since the parser doesn't directly make opcodes, it would mean that
syntax errors (parse errors) would be able to be 100% recoverable. Compiler
errors would be just as difficult to recover from though.
It opens the door to leveraging 3pd systems. For example, the Zend VM
could hypothetically be replaced by a LLVM based VM. That would allow for
JIT based php code. Note that this isn't HipHop (which is a limited subset
of PHP), but full PHP running on a JIT VM. This could be implemented as a
PECL extension, utilizing the core parser and runtime environment, just
swapping out the executor step... Obviously this would not be trivial to
build, but right now if you wanted to build it you'd need to fork PHP to do
it (hence why the existing compilers for PHP all use a different parser).

And it's a bit late to take design decisions on existing PHP language,

it seems to me.

It will never be easier to do than today. As time goes on, the language
will continue to grow, and the syntax and grammar will only get more
complicated from here out. So the easiest time to do it will be now...

Anthony

12 years ago by Dmitry Stogov — view source

unread

Hi Nikita,

Personally, I don't see any reason to build AST. As you mentioned
yourself, it will be slower and will require more memory. On the other
hand AST itself would allow to perform only very basic optimizations.
Most of them can be easily done on VM opcode level as well.

Also, as it's not an easy task, the old "ugly hacks" will be replaced
with new mistakes, which would require new "hacks" in the future :)

The only real advantage could be an ability to expose AST to PHP
scripts, but only few people may need it.

Thanks. Dmitry.

Hey folks!

Some people asked me what the advantages of using an AST-based
parsing/compilation process are, so I put together a few quick notes
in an RFC:

https://wiki.php.net/rfc/ast_based_parsing_compilation_process

It would be nice to get a few comments from other core devs on this.

Nikita

12 years ago by Ivan Enderlin @ Hoa — view source

unread

Hi Dmitry,

Hi Nikita,

Personally, I don't see any reason to build AST. As you mentioned
yourself, it will be slower and will require more memory. On the other
hand AST itself would allow to perform only very basic optimizations.
Most of them can be easily done on VM opcode level as well.
The lexing and parsing processes will not be slower than actual, and the
construction of an AST is a new process. Well, as usual, new process
requires new resources. But if we look further, it will certainly be a
nice tool to perform better opcode caching, it will remove a lot of
hacks, it will allow third-part tools working on safeness, security,
quality etc. to go deeper at low-costs which is very important for PHP
community, it will facilitate future works (implementations, features
etc.) for PHP… Moreover, it may be possible that compiler compilers have
a better lexing and parsing processes than actual?

Someone said that AST is more “academical”, yes it is and that is why we
can benefit from already done researches in this area to avoid memory
overhead (one among others). We are not the first ones facing this
problem. It requires some researches before starting to develop this.
Let's try as a POC and we will quickly see if this is a wrong way or not.

Cheers.

Also, as it's not an easy task, the old "ugly hacks" will be replaced
with new mistakes, which would require new "hacks" in the future :)

The only real advantage could be an ability to expose AST to PHP
scripts, but only few people may need it.

Thanks. Dmitry.

Hey folks!

Some people asked me what the advantages of using an AST-based
parsing/compilation process are, so I put together a few quick notes
in an RFC:

https://wiki.php.net/rfc/ast_based_parsing_compilation_process

It would be nice to get a few comments from other core devs on this.

Nikita

--
Ivan Enderlin
Developer of Hoa
http://hoa.42/ or http://hoa-project.net/

PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/

Member of HTML and WebApps Working Group of W3C
http://w3.org/

12 years ago by Leigh — view source

unread

Hi

The lexing and parsing processes will not be slower than actual, and the
construction of an AST is a new process. Well, as usual, new process
requires new resources. But if we look further, it will certainly be a nice
tool to perform better opcode caching, it will remove a lot of hacks, it
will allow third-part tools working on safeness, security, quality etc. to
go deeper at low-costs which is very important for PHP community, it will
facilitate future works

I think this is a good point to focus on, there has been a lot of comments
about the increased resource cost of the parse. What is being forgotten is
that you can offset this against all of the gains you will see as a result.

Maybe the upfront cost of a parse goes up, but once it is parsed and the
opcodes are cached, you won't have this cost again until you change the
script. Then you have all of the benefits for every subsequent request.

Increased cost once, benefits every time.
On Sep 6, 2012 8:53 AM, "Ivan Enderlin @ Hoa" ivan.enderlin@hoa-project.net
wrote:

Hi Dmitry,

Hi Nikita,

Personally, I don't see any reason to build AST. As you mentioned
yourself, it will be slower and will require more memory. On the other hand
AST itself would allow to perform only very basic optimizations. Most of
them can be easily done on VM opcode level as well.

The lexing and parsing processes will not be slower than actual, and the
construction of an AST is a new process. Well, as usual, new process
requires new resources. But if we look further, it will certainly be a nice
tool to perform better opcode caching, it will remove a lot of hacks, it
will allow third-part tools working on safeness, security, quality etc. to
go deeper at low-costs which is very important for PHP community, it will
facilitate future works (implementations, features etc.) for PHP… Moreover,
it may be possible that compiler compilers have a better lexing and parsing
processes than actual?

Someone said that AST is more “academical”, yes it is and that is why we
can benefit from already done researches in this area to avoid memory
overhead (one among others). We are not the first ones facing this problem.
It requires some researches before starting to develop this. Let's try as a
POC and we will quickly see if this is a wrong way or not.

Cheers.

Also, as it's not an easy task, the old "ugly hacks" will be replaced
with new mistakes, which would require new "hacks" in the future :)

The only real advantage could be an ability to expose AST to PHP scripts,
but only few people may need it.

Thanks. Dmitry.

Hey folks!

Some people asked me what the advantages of using an AST-based
parsing/compilation process are, so I put together a few quick notes
in an RFC:

https://wiki.php.net/rfc/ast_**based_parsing_compilation_**process https://wiki.php.net/rfc/ast_based_parsing_compilation_process

It would be nice to get a few comments from other core devs on this.

Nikita

--
Ivan Enderlin
Developer of Hoa
http://hoa.42/ or http://hoa-project.net/

PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/

Member of HTML and WebApps Working Group of W3C
http://w3.org/

12 years ago by Stas Malyshev — view source

unread

Hi!

The lexing and parsing processes will not be slower than actual, and the
construction of an AST is a new process. Well, as usual, new process
requires new resources. But if we look further, it will certainly be a nice
tool to perform better opcode caching, it will remove a lot of hacks, it
will allow third-part tools working on safeness, security, quality etc. to
go deeper at low-costs which is very important for PHP community, it will
facilitate future works

I don't see how it would lead to better opcode caching. As for
third-party tools, I do not see why third-party tools need PHP to change
the parser. If PHP's parser is not good enough for those tools, they can
have their own parser.

Maybe the upfront cost of a parse goes up, but once it is parsed and the
opcodes are cached, you won't have this cost again until you change the
script. Then you have all of the benefits for every subsequent request.

So far we have not seen not only any of these benefits, but any
explanation of what exactly these benefits would be and any proof they
would actually benefit anybody. I seriously would propose people
interested in this project just take up this project and see if it's
beneficial or not. Just talking about what might happen on the list
would achieve nothing.

Increased cost once, benefits every time.

For now it looks like quite the reverse - increased performance and
stability costs for everybody, dubious benefits for a small group.

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Ivan Enderlin @ Hoa — view source

unread

Hi!

The lexing and parsing processes will not be slower than actual, and the
construction of an AST is a new process. Well, as usual, new process
requires new resources. But if we look further, it will certainly be a nice
tool to perform better opcode caching, it will remove a lot of hacks, it
will allow third-part tools working on safeness, security, quality etc. to
go deeper at low-costs which is very important for PHP community, it will
facilitate future works
I don't see how it would lead to better opcode caching.
JIT, lazy-parsing, lazy-opcode generation, caching heuristics,
optimisation of opcode generation…
PHP is interpreted and this is why it is able to offer high dynamic
constructions and executions. PHP incredibly scales. But we can do
better, maybe by mixing cache + interpretation in a better way. I don't
know. I said we are certainly not the first ones facing this “problem”.
We have to search in the literature if such solutions exist.

As for
third-party tools, I do not see why third-party tools need PHP to change
the parser. If PHP's parser is not good enough for those tools, they can
have their own parser.
Not if we expose the AST directly into the PHP user-land (maybe through
a specific configuration: --enable-user-ast or something like that).

Maybe the upfront cost of a parse goes up, but once it is parsed and the
opcodes are cached, you won't have this cost again until you change the
script. Then you have all of the benefits for every subsequent request.
So far we have not seen not only any of these benefits, but any
explanation of what exactly these benefits would be and any proof they
would actually benefit anybody.
Because it is a very hard topic and so, it is hard to explain quickly.

I seriously would propose people
interested in this project just take up this project and see if it's
beneficial or not. Just talking about what might happen on the list
would achieve nothing.
Exactly what I have proposed: “Let's try as a POC and we will quickly
see if this is a wrong way or not”.

Cheers.

--
Ivan Enderlin
Developer of Hoa
http://hoa.42/ or http://hoa-project.net/

PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/

Member of HTML and WebApps Working Group of W3C
http://w3.org/

12 years ago by Sebastian Bergmann — view source

unread

As for third-party tools, I do not see why third-party tools need PHP
to change the parser. If PHP's parser is not good enough for those tools,
they can have their own parser.

Nikita is doing an amazing job with PHP_Parser, which is such a
third-party tool. However, it will always lag behind the canonical
parser. And it will (probably) never match 100% the behavior of the
canonical parser.

This is why, from my perspective of someone who is interested in static
analysis and quality assurance, I think that it would be a tremendous
boost for the PHP platform if we had a state-of-the-art parser for the
reference implementation of our programming language.

12 years ago by Stas Malyshev — view source

unread

Hi!

Nikita is doing an amazing job with PHP_Parser, which is such a
third-party tool. However, it will always lag behind the canonical
parser. And it will (probably) never match 100% the behavior of the
canonical parser.

Wait, so the arguments that it will be amazingly easy to implement new
features in this parser - which should solve the problem of the lag - by
the time the old and clunky parser is released certainly it is possible
to do the same with new, much less complex and much easier to work with
parser? - so these arguments weren't true? Or am I missing some
important reason why parser that is much less complex and easier to add
things to can't do the same old one can do?

And if it's impossible to match behavior of the existing parser - do I
get it right that the proposed idea is to actually break real code that
people run now in PHP because it was too hard to parse for people that
write add-on tools to PHP? Somehow it does not sound like a good idea.
If we have doubt that we can match the existing PHP behavior then the
idea of changing parser becomes even less appealing, because for 99% of
PHP users it would be pure downside without upsides. The users don't
care if the parser is complex, they care if their code runs.

This is why, from my perspective of someone who is interested in static
analysis and quality assurance, I think that it would be a tremendous
boost for the PHP platform if we had a state-of-the-art parser for the
reference implementation of our programming language.

I think that the benefit of it for regular PHP user is unclear (and
would not be clear until the real benefit of such parser - such as
promised optimizations, etc. - is demonstrated) but the harm from
breaking of the existing code is obvious. What would happen is that
people would just avoid running such PHP version - and what use then
would be to have excellent tools for PHP that people don't use because
it can't run their code?

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Ferenc Kovacs — view source

unread

On Thu, Sep 6, 2012 at 11:10 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:

Hi!

Nikita is doing an amazing job with PHP_Parser, which is such a
third-party tool. However, it will always lag behind the canonical
parser. And it will (probably) never match 100% the behavior of the
canonical parser.

Wait, so the arguments that it will be amazingly easy to implement new
features in this parser - which should solve the problem of the lag - by
the time the old and clunky parser is released certainly it is possible
to do the same with new, much less complex and much easier to work with
parser? - so these arguments weren't true? Or am I missing some
important reason why parser that is much less complex and easier to add
things to can't do the same old one can do?

And if it's impossible to match behavior of the existing parser - do I
get it right that the proposed idea is to actually break real code that
people run now in PHP because it was too hard to parse for people that
write add-on tools to PHP? Somehow it does not sound like a good idea.
If we have doubt that we can match the existing PHP behavior then the
idea of changing parser becomes even less appealing, because for 99% of
PHP users it would be pure downside without upsides. The users don't
care if the parser is complex, they care if their code runs.

This is why, from my perspective of someone who is interested in static
analysis and quality assurance, I think that it would be a tremendous
boost for the PHP platform if we had a state-of-the-art parser for the
reference implementation of our programming language.

I think that the benefit of it for regular PHP user is unclear (and
would not be clear until the real benefit of such parser - such as
promised optimizations, etc. - is demonstrated) but the harm from
breaking of the existing code is obvious. What would happen is that
people would just avoid running such PHP version - and what use then
would be to have excellent tools for PHP that people don't use because
it can't run their code?

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

--

I propose putting together and RFC and a PoC ASAP, and then we can talk
about facts instead of opinions and biased views on the topic.

--
Ferenc Kovács
@Tyr43l - http://tyrael.hu

12 years ago by Sebastian Bergmann — view source

unread

I propose putting together and RFC and a PoC ASAP, and then we can talk
about facts instead of opinions and biased views on the topic.

That would be the reasonable thing to do, yes. But that is easy for me
easy to say as I will not be able to help in the process apart from
testing. It is up to Nikita (and those who want to help him) to decide
whether they want to invest their time and effort into writing a new
parser for PHP when it is unclear whether it will be accepted.

12 years ago by Ferenc Kovacs — view source

unread

On Thu, Sep 6, 2012 at 2:53 PM, Sebastian Bergmann sebastian@php.netwrote:

I propose putting together and RFC and a PoC ASAP, and then we can talk
about facts instead of opinions and biased views on the topic.

That would be the reasonable thing to do, yes. But that is easy for me
easy to say as I will not be able to help in the process apart from
testing. It is up to Nikita (and those who want to help him) to decide
whether they want to invest their time and effort into writing a new
parser for PHP when it is unclear whether it will be accepted.

what I meant to say is that currently we are arguing on feelings and
beliefs, and I can't see how could we get an agreement whether or not does
the pros overweight the cons.
that can only happen after we have something on our hands to actually pick
apart and measure.
until then there is no point making assumptions and trying to convince the
other side to change their opinions.

--
Ferenc Kovács
@Tyr43l - http://tyrael.hu

12 years ago by Nikita Popov — view source

unread

And if it's impossible to match behavior of the existing parser - do I
get it right that the proposed idea is to actually break real code that
people run now in PHP because it was too hard to parse for people that
write add-on tools to PHP? Somehow it does not sound like a good idea.
If we have doubt that we can match the existing PHP behavior then the
idea of changing parser becomes even less appealing, because for 99% of
PHP users it would be pure downside without upsides. The users don't
care if the parser is complex, they care if their code runs.

The whole thing obviously only makes sense if we can retain full
syntax compatibility. If we have to break syntax, then this is a no-go
anyway. But I'm fairly certain that we indeed can keep full
compatibility. I implemented an AST-emitting PHP parser in PHP and am
fairly sure that it is compatible with any PHP 5.4 source code. I
don't see issues from this side.

Nikita

12 years ago by Dmitry Stogov — view source

unread

Hi Dmitry,

Hi Nikita,

Personally, I don't see any reason to build AST. As you mentioned
yourself, it will be slower and will require more memory. On the other
hand AST itself would allow to perform only very basic optimizations.
Most of them can be easily done on VM opcode level as well.
The lexing and parsing processes will not be slower than actual, and the
construction of an AST is a new process. Well, as usual, new process
requires new resources. But if we look further, it will certainly be a
nice tool to perform better opcode caching, it will remove a lot of
hacks, it will allow third-part tools working on safeness, security,
quality etc. to go deeper at low-costs which is very important for PHP
community, it will facilitate future works (implementations, features
etc.) for PHP… Moreover, it may be possible that compiler compilers have
a better lexing and parsing processes than actual?

Few years ago we replaced flex with re2c, and got some speedup because
re2c generated scanner used mmap() and didn't check for end of buffer,
but after a while we realized that in case the script size is
multiplication of PAGE_SIZE the scanner just crashes, so we had to add
hacks to workaround :)

Someone said that AST is more “academical”, yes it is and that is why we
can benefit from already done researches in this area to avoid memory
overhead (one among others). We are not the first ones facing this
problem. It requires some researches before starting to develop this.
Let's try as a POC and we will quickly see if this is a wrong way or not.

Of course you can try it and in case it's better, faster and/or more
clear it may be accepted. Currently, PHP allows overriding of
"zend_compile" routine, so as a first step you even may implement it as
a PHP extension without ZE modification.

Thanks. Dmitry.

Cheers.

Also, as it's not an easy task, the old "ugly hacks" will be replaced
with new mistakes, which would require new "hacks" in the future :)

The only real advantage could be an ability to expose AST to PHP
scripts, but only few people may need it.

Thanks. Dmitry.

Hey folks!

Some people asked me what the advantages of using an AST-based
parsing/compilation process are, so I put together a few quick notes
in an RFC:

https://wiki.php.net/rfc/ast_based_parsing_compilation_process

It would be nice to get a few comments from other core devs on this.

Nikita

12 years ago by Sebastian Bergmann — view source

unread

The only real advantage could be an ability to expose AST to PHP scripts,
but only few people may need it.

Everyone working on static analysis tools for PHP code needs access to
the (canonical) AST. While the number of people behind this "everyone"
will be small (hopefully it will grow) but the tools they create based
based on the AST will be valuable for every PHP developer.

12 years ago by Florian Anderiasch — view source

unread

The only real advantage could be an ability to expose AST to PHP scripts,
but only few people may need it.

Everyone working on static analysis tools for PHP code needs access to
the (canonical) AST. While the number of people behind this "everyone"
will be small (hopefully it will grow) but the tools they create based
based on the AST will be valuable for every PHP developer.

I fully agree with Sebastian here, nearly all the methods used in the
past to get some meaningful analysis done relied on third party tools,
were immensely prone to breakage or both.

I've used phc up to 5.2 without problems but after that I didn't really
keep up trying yet again another completely different method. This is
such a basic task for static analysis, something not in the core will
always be a second-class citizen.

And yes, the people directly benefitting from this and not indirectly
from the tools produced will probably be quite happy.

So unless something is getting slower, +1.

Greetings,
Florian

12 years ago by johannes@schlueters.de — view source

unread

The only real advantage could be an ability to expose AST to PHP scripts,
but only few people may need it.

Everyone working on static analysis tools for PHP code needs access to
the (canonical) AST. While the number of people behind this "everyone"
will be small (hopefully it will grow) but the tools they create based
based on the AST will be valuable for every PHP developer.

As soon as those tools should be version independent they need their own
parser anyhow.

Aside from that: I don't see the reason for this discussion now. People
can go and implement until this is done it's hard to argue about
performance etc. as that can hardly be predicted. If people have a need
for this they are free to go. If people don't have a need for this they
won't. (A more relevant question might be "Who wants to invest time on
this", certainly can be a fun project ...)

johannes

Moving to an AST-based parsing/compilation process

For now it looks like quite the reverse - increased performance and stability costs for everybody, dubious benefits for a small group.

For now it looks like quite the reverse - increased performance and
stability costs for everybody, dubious benefits for a small group.