Hello all,
I am working on ARM server performance tuning, and I have been playing
around a bit with the various executor modes and zend_vm_gen.php.
As it turns out (scroll down for numbers), the GOTO executor is much
faster than the default CALL executor on ARM, partly due to fewer branch
mispredictions (as perf tells me) but there are probably other factors
at play here as well.
My question to you is if we could parametrize this in the build system,
for instance by adding alternate files zend_vm_opcodes-goto.h and
zend_vm_execute-goto.h to the tree, and selecting those when targeting
ARM (and perhaps other archs that may prefer GOTO over CALL as well). Or
is there a better way of including/selecting alternate executors?
Also, when playing around, I noticed that building the executor without
specialization is broken, as there are erroneous FREE_OP2() calls left
behind in the handlers for 'break' and 'continue'. If nobody objects, I
will remove them (zend_vm_def.h lines 3302 and 3314)
Regards,
Ard.
ARM Cortex-A15 @ 1.7 GHz with default executor (specialized CALL)
simple 0.358
simplecall 0.396
simpleucall 0.419
simpleudcall 0.458
mandel 0.839
mandel2 1.038
ackermann(7) 0.400
ary(50000) 0.096
ary2(50000) 0.087
ary3(2000) 0.490
fibo(30) 1.157
hash1(50000) 0.135
hash2(500) 0.096
heapsort(20000) 0.266
matrix(20) 0.309
nestedloop(12) 0.499
sieve(30) 0.363
strcat(200000) 0.046
Total 7.449
Performance counter stats for 'php Zend/bench.php':
7444.535230 task-clock # 0.983 CPUs utilized
103 context-switches # 0.014 K/sec
9 cpu-migrations # 0.001 K/sec
5963 page-faults # 0.801 K/sec
12728701964 cycles # 1.710 GHz
13603248229 instructions # 1.07 insns per cycle
2633774500 branches # 353.786 M/sec
118799433 branch-misses # 4.51% of all branches
7.570311211 seconds time elapsed
ARM Cortex-A15 @ 1.7 GHz with specialized GOTO executor
simple 0.185
simplecall 0.295
simpleucall 0.249
simpleudcall 0.257
mandel 0.349
mandel2 0.529
ackermann(7) 0.252
ary(50000) 0.061
ary2(50000) 0.060
ary3(2000) 0.393
fibo(30) 0.798
hash1(50000) 0.092
hash2(500) 0.079
heapsort(20000) 0.195
matrix(20) 0.206
nestedloop(12) 0.214
sieve(30) 0.241
strcat(200000) 0.025
Total 4.479
Performance counter stats for '~/php Zend/bench.php':
4468.040559 task-clock # 0.983 CPUs utilized
79 context-switches # 0.018 K/sec
9 cpu-migrations # 0.002 K/sec
5062 page-faults # 0.001 M/sec
7561345552 cycles # 1.692 GHz
11297962039 instructions # 1.49 insns per cycle
2121936756 branches # 474.914 M/sec
22190686 branch-misses # 1.05% of all branches
4.545350085 seconds time elapsed
GOTO-executor is faster on x86 as well.
It may be proven by synthetic benchmarks, however. on real-life
applications it doesn't make any significant difference (sometimes it even
slowdown).
I'm not sure how we should generate (and test) zend_vm_execute-goto.h.
Probably the only good option is generating all the different executors at
once and may be even linking them all together to select one at run-time.
FREE_OP2() in BRK/CONT my be removed.
Thanks. Dmitry.
On Thu, Jun 20, 2013 at 5:55 PM, Ard Biesheuvel
ard.biesheuvel@linaro.orgwrote:
Hello all,
I am working on ARM server performance tuning, and I have been playing
around a bit with the various executor modes and zend_vm_gen.php.As it turns out (scroll down for numbers), the GOTO executor is much
faster than the default CALL executor on ARM, partly due to fewer branch
mispredictions (as perf tells me) but there are probably other factors at
play here as well.My question to you is if we could parametrize this in the build system,
for instance by adding alternate files zend_vm_opcodes-goto.h and
zend_vm_execute-goto.h to the tree, and selecting those when targeting ARM
(and perhaps other archs that may prefer GOTO over CALL as well). Or is
there a better way of including/selecting alternate executors?Also, when playing around, I noticed that building the executor without
specialization is broken, as there are erroneous FREE_OP2() calls left
behind in the handlers for 'break' and 'continue'. If nobody objects, I
will remove them (zend_vm_def.h lines 3302 and 3314)Regards,
Ard.ARM Cortex-A15 @ 1.7 GHz with default executor (specialized CALL)
=================================================================simple 0.358
simplecall 0.396
simpleucall 0.419
simpleudcall 0.458
mandel 0.839
mandel2 1.038
ackermann(7) 0.400
ary(50000) 0.096
ary2(50000) 0.087
ary3(2000) 0.490
fibo(30) 1.157
hash1(50000) 0.135
hash2(500) 0.096
heapsort(20000) 0.266
matrix(20) 0.309
nestedloop(12) 0.499
sieve(30) 0.363
strcat(200000) 0.046Total 7.449
Performance counter stats for 'php Zend/bench.php':
7444.535230 task-clock # 0.983 CPUs utilized 103 context-switches # 0.014 K/sec 9 cpu-migrations # 0.001 K/sec 5963 page-faults # 0.801 K/sec 12728701964 cycles # 1.710 GHz 13603248229 instructions # 1.07 insns per cycle 2633774500 branches # 353.786 M/sec 118799433 branch-misses # 4.51% of all branches 7.570311211 seconds time elapsed
ARM Cortex-A15 @ 1.7 GHz with specialized GOTO executor
==============================**=========================simple 0.185
simplecall 0.295
simpleucall 0.249
simpleudcall 0.257
mandel 0.349
mandel2 0.529
ackermann(7) 0.252
ary(50000) 0.061
ary2(50000) 0.060
ary3(2000) 0.393
fibo(30) 0.798
hash1(50000) 0.092
hash2(500) 0.079
heapsort(20000) 0.195
matrix(20) 0.206
nestedloop(12) 0.214
sieve(30) 0.241
strcat(200000) 0.025Total 4.479
Performance counter stats for '~/php Zend/bench.php':
4468.040559 task-clock # 0.983 CPUs utilized 79 context-switches # 0.018 K/sec 9 cpu-migrations # 0.002 K/sec 5062 page-faults # 0.001 M/sec 7561345552 cycles # 1.692 GHz 11297962039 instructions # 1.49 insns per cycle 2121936756 branches # 474.914 M/sec 22190686 branch-misses # 1.05% of all branches 4.545350085 seconds time elapsed