Hello internals:
I had a look at the Zend Engine to understand some
details about its internal design with respect
to its opcodes and machine model.
Would like to ask you for some comments if the
following sounds wrong or misinterpreted to you:
So, the basic design of the Zend Engine is a
a stack-based interpreter for a fixed length
instruction set (76byte on a 32bit architecture),
where the instruction encoding
is much more complex then for instance for the
JVM, Python, or Smalltalk.
Even so, the source code is compiled to a linearized
instruction stream, the instruction itself contain not just opcode and
operands.
The version I looked at had some 136 opcodes encoded
in one byte, but the rest of the instruction has
many similarities with a AST representation.
Instructions encode:
- a line number
- a function pointer to the actual handler which is
used to execute it - two operands, which encode constant values,
object references, jump addresses,
or pointer to other functions - 64 bit for an extended operand value
- a field for results, which is use for some
operations return values.
However, its not a simple, single stack model,
but uses several purpose-specific stacks.
What I am not so sure about is especially the
semantics of the result field and the pointer
to the other function (op_array).
Would be grateful if someone could comment on that.
I am also not really sure with these complexity,
whether is not actually some kind of abstract syntax
tree instead of a instruction set like Java
bytecode. Thats not a technical problem, but merely
an academic question to categorize/characterize PHP.
All comments are welcome.
Many thanks
Stefan
--
Stefan Marr
Software Languages Lab
Former Programming Technology Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://prog.vub.ac.be/~smarr
Phone: +32 2 629 3956
Fax: +32 2 629 3525
Hi Stefan,
Hello internals:
I had a look at the Zend Engine to understand some
details about its internal design with respect
to its opcodes and machine model.
To start with, the best reference about the Zend engine that I know of
is a presentation by Andy Wharmby at IBM:
www.zapt.info/PHPOpcodes_Sep2008.odp. It should answer a lot of your
questions.
Would like to ask you for some comments if the
following sounds wrong or misinterpreted to you:So, the basic design of the Zend Engine is a
a stack-based interpreter for a fixed length
No, its a register based interpreter. There is a stack, but thats used
for calling functions only. The operands to the opcodes are pointed to
by the opcodes in the case of compiled variables, or in symbol tables
otherwise. That's as close to a register machine as we can get I
think, but its not very close to a stack machine. In a stack-based VM,
the operands to an opcode would be implicit, with add for example
using the top two stack operands, and thats not the case at all.
instruction set (76byte on a 32bit architecture),
Andy's presentation says 96 bytes, but that might be 64 bit. I presume
this means sizeof(strict _zend_op)?
where the instruction encoding
is much more complex then for instance for the
JVM, Python, or Smalltalk.
Yes, definitely.
Even so, the source code is compiled to a linearized
instruction stream, the instruction itself contain not just opcode and
operands.The version I looked at had some 136 opcodes encoded
in one byte, but the rest of the instruction has
many similarities with a AST representation.
Are you referring to the IS_TMP_VAR type of a znode?
Instructions encode:
- a function pointer to the actual handler which is
used to execute it
The type of interpreter dispatch used can be chosen at configure-time
using the --with-vm-kind flag. The call-based interpreter is the
default. I've heard the others are buggy, but I'm not certain where I
heard that.
However, its not a simple, single stack model,
but uses several purpose-specific stacks.
How so?
What I am not so sure about is especially the
semantics of the result field and the pointer
to the other function (op_array).Would be grateful if someone could comment on that.
I'm not sure whats confusing about the result field? It points to a
zval, same as op1 and op2.
I think that op_array is used to attach extra information to the
opcode by special extensions. I can't think of an example off the top
of my head.
I am also not really sure with these complexity,
whether is not actually some kind of abstract syntax
tree instead of a instruction set like Java
bytecode. Thats not a technical problem, but merely
an academic question to categorize/characterize PHP.
I think the result field of a znode can make it seem like that, but I
would characterize it as you have before. An instruction set just like
Java bytecode. Way more complicated, obviously, but I dont think its
very close to an AST. Certainly the interpreter does not really
resemble an AST walker.
I hope I answered what you were looking for. I'm not certain about a
few of my answers, since I've really avoided the interpreter in my
work, but I think most of it is OK.
Best of luck,
Paul
--
Paul Biggar
paul.biggar@gmail.com
Hi Paul:
To start with, the best reference about the Zend engine that I know of
is a presentation by Andy Wharmby at IBM:
www.zapt.info/PHPOpcodes_Sep2008.odp. It should answer a lot of your
questions.
Thanks a lot, was not aware of that one. And, well it helps to read
and understand
the code.
So, the basic design of the Zend Engine is a
a stack-based interpreter for a fixed lengthNo, its a register based interpreter. There is a stack, but thats used
for calling functions only. The operands to the opcodes are pointed to
by the opcodes in the case of compiled variables, or in symbol tables
otherwise.
That's as close to a register machine as we can get I
think, but its not very close to a stack machine. In a stack-based VM,
the operands to an opcode would be implicit, with add for example
using the top two stack operands, and thats not the case at all.
The encoding of constants or addresses in symbol tables alone does not
disqualify it as a stack-based machine model per se.
However, since there seem to be no traditional instructions which rely
on
a stack, I agree that its not a stack machine.
struct _zend_execute_data makes it also look a bit like a stack,
especially
with its struct _zend_execute_data *prev_execute_data.
But knowing that union _temp_variable *Ts; is addressed directly,
it looks more lake a CISC-like register-memory model with an
"infinite" number
of registers.
instruction set (76byte on a 32bit architecture),
Andy's presentation says 96 bytes, but that might be 64 bit. I presume
this means sizeof(strict _zend_op)?
Yes, gives 76 byte on my OS X, but thats a detail which just illustrates
the significants of the different approaches. As an other example,
Self has a
real bytecode set. Each instruction is encoded in just 8bit, but
that encoding is not optimized for interpretation.
the rest of the instruction has
many similarities with a AST representation.Are you referring to the IS_TMP_VAR type of a znode?
Actually, I was more concerned about the op_array,
and whether there is any place in the interpreter where it is used
directly, i.e., by using it in a C function call as an argument and
thus using the implicit C stack. If this would be used to initiate
interpretation of the op_array, I think it would resemble a
tree walker. But have not found anything hinting at that, especially
the global data structures do not support such a thing, from what I can
tell by reading the code.
I am just cautious, for instance the Lua implementation provides
some interesting mechanisms in this direction.
However, its not a simple, single stack model,
but uses several purpose-specific stacks.How so?
Ah, thanks, you are right, was looking at the wrong struct definition
(_zend_compiler_globals), indeed _zend_executor_globals defines only
an argument stack, an argument type stack, and struct
_zend_execute_data *current_execute_data (which also is a stack).
What I am not so sure about is especially the
semantics of the result field and the pointer
to the other function (op_array).Would be grateful if someone could comment on that.
I'm not sure whats confusing about the result field? It points to a
zval, same as op1 and op2.
Ah, well, ok, now I see how it is meant. In the assumption of a stack
model, it
does not make much sense, but in a register-memory model, it is just
specifying
the location for the result, sure.
I think that op_array is used to attach extra information to the
opcode by special extensions. I can't think of an example off the top
of my head.
Well, was a bit imprecise here, its part of _znode i.e. operands and
result,
but that does not pose any misunderstandings for me anymore.
I am also not really sure with these complexity,
whether is not actually some kind of abstract syntax
tree instead of a instruction set like Java
bytecode. Thats not a technical problem, but merely
an academic question to categorize/characterize PHP.I think the result field of a znode can make it seem like that, but I
would characterize it as you have before. An instruction set just like
Java bytecode. Way more complicated, obviously, but I dont think its
very close to an AST. Certainly the interpreter does not really
resemble an AST walker.
Sometimes, it would be really interesting to know
where some of the used ideas are coming from
and what the reasoning was. I tend to think that its rather unlikely
that they
are pulled out of thin air. Some parts of the model remind me of CISC
instruction
sets... 3-address form, register-memory model...
I hope I answered what you were looking for. I'm not certain about a
few of my answers, since I've really avoided the interpreter in my
work, but I think most of it is OK.
Your answers were really helpful, guiding the code reading.
Thanks a lot
Stefan
Hi Stefan,
Sometimes, it would be really interesting to know
where some of the used ideas are coming from
and what the reasoning was. I tend to think that its rather unlikely that
they
are pulled out of thin air. Some parts of the model remind me of CISC
instruction
sets... 3-address form, register-memory model...
I think they are pulled out of thin air. More specifically, I think
there are optimizations heaped upon optimizations heaped upon an
initial implementation. It seems that each new release of PHP has a
small speed improvement based on some optimization performed, but that
there has been no major rearchitecture since the addition of a
bytecode based interpreter in PHP 4. I do not know how that was
designed though, maybe others do?
One thing I do find interesting is that the register machine nature of
PHP comes from an optimization called "compiled variables". CVs point
to symbol-table entries, but without them, I'm not sure whether we
would still call PHP a register machine. Any thoughts?
Thanks,
Paul
--
Paul Biggar
paul.biggar@gmail.com
I think they are pulled out of thin air. More specifically, I think
there are optimizations heaped upon optimizations heaped upon an
initial implementation. It seems that each new release of PHP has a
small speed improvement based on some optimization performed, but that
there has been no major rearchitecture since the addition of a
bytecode based interpreter in PHP 4.
Well, sure, but thats usual evolution. Not a problem specific to PHP.
Was more curious about the first design.
One thing I do find interesting is that the register machine nature of
PHP comes from an optimization called "compiled variables". CVs point
to symbol-table entries, but without them, I'm not sure whether we
would still call PHP a register machine. Any thoughts?
Well, actually, I would include the temp vars also as a reason to name
it a
register-memory machine model. They are accessed by using an explicit
name i.e. an index into the "Ts" array. Thus, it is definitely not an
implicit stack.
The question would be only, is it useful to go further and interpret
this structure
as an infinite number of registers which would be equivalent with
memory.
Then it could be considered to be a memory-to-memory architecture.
But usually these kind of architectures have the property of only one
type of
addresses, which does not hold for PHP.
Best regards
Stefan
Hi Stefan,
On Sat, Aug 15, 2009 at 8:52 PM, Stefan Marrphp@stefan-marr.de
wrote:Sometimes, it would be really interesting to know
where some of the used ideas are coming from
and what the reasoning was. I tend to think that its rather unlikely
that
they
are pulled out of thin air. Some parts of the model remind me of
CISC
instruction
sets... 3-address form, register-memory model...I think they are pulled out of thin air.
At some point, it was asked what was the "original" of this model.
I'd have to hazard the guess that it was Ze'ev and Andi's model in PHP
3 and then re-worked (possibly completely) in PHP 4 that supplanted
Rasmus' hack-y version.
Considering they did it for a college project and had no intention of
it actually replacing the PHP engine at the time, it has held up
pretty well :-)
--
Some people ask for gifts here.
I just want you to buy an Indie CD for yourself:
http://cdbaby.com/search/from/lynch
Hi!
So, the basic design of the Zend Engine is a
a stack-based interpreter for a fixed length
instruction set (76byte on a 32bit architecture),
Not exactly stack-based, it's more register-based. Number of registers
is not limited, even though most of them aren't used simultaneously.
Instructions encode:
- a line number
- a function pointer to the actual handler which is
used to execute it- two operands, which encode constant values,
object references, jump addresses,
or pointer to other functions- 64 bit for an extended operand value
- a field for results, which is use for some
operations return values.
The basic model is that each operation works on 2 operands and
(optionally) returns result. Operands can be either constants, temp
variables or in-memory variables, or sometimes a number which is used as
jump point, etc.
This model can be extended for some opcodes by using either extended
operand or additional opcode, if operation semantics does not fit in one
opcode (e.g. opcode generated by $a->b["x"] would have 4 operands - $a,
"b", "x" and how the expression is used - read/write/test, etc.)
However, its not a simple, single stack model,
but uses several purpose-specific stacks.
Stacks indeed are used for function calls, but this is just an
implementation detail.
What I am not so sure about is especially the
semantics of the result field and the pointer
to the other function (op_array).
result field is a result of an operation, so if you have $a = $b + $c,
then ADD opcode which would add content of $b and $c would use "result"
field to store the value, which will then be used by ASSIGN opcode to
assign the result to $a.
As for op_array, I assume you are referring to op_array field in znode
union. I don't think this one is used by the engine at runtime, it's a
compiler utility field.
I am also not really sure with these complexity,
whether is not actually some kind of abstract syntax
tree instead of a instruction set like Java
bytecode. Thats not a technical problem, but merely
an academic question to categorize/characterize PHP.
I think it's more like bytecode, indeed. Even though the instructions
are pretty high-level so with some effort you probably could build a
syntax tree out of it.
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com