Hello all,
As I indicated yesterday, I am looking into PHP performance when running
on ARM, and while doing that, I ran into the following code in
Zend/zend_operators.h.
For 32-bit x86, we have the following assembler code to do a 'fast'
increment (%0 is a zval*):
incl (%0) @ increment value at address %0
jno 0f @ jump to label 0 if no overflow
movl $0x0, (%0) @ else store 0 to address %0
movl $0x41e00000, 0x4(%0) @ store LONG_MAX+1 at %0+4
movb $0x2,0xc(%0) @ set field %0 + 0xc to 2
0:
This code makes a fair number of assumptions:
- zval.lval is the first member of a zval struct
- zval.dval is the second member, at offset 4
- zval.type lives at offset 0xc
- IS_DOUBLE == 0x2
For 64-bit x86, we have a similar snippet:
incq (%0) @ increment value at address %0
jno 0f @ jump to label 0 if no overflow
movl $0x0, (%0) @ else store 0 to address %0
movl $0x43e00000, 0x4(%0) @ store LONG_MAX+1 at %0+4
movb $0x2,0x14(%0) @ set field %0 + 0x14 to 2
0:
only, expectedly, (double)LONG_MAX+1.0 has a different value (as
LONG_MAX on 64-bit has another value than on 32-bit), but also note that
while the hard coded offset of the type field (0x14) is likely correct,
the dval offset is 0x4 like in the previous example, while a long has
size 8 on 64-bit. In other words, this code is broken.
Another thing to note is that branch predictors typically predict all
branches as not taken, and it is up to the compiler to arrange the
branch in an optimal way; as we branch on no overflow, the code above
basically amounts to using
if (EXPECT(overflow)) {
/* juggle doubles */
} else {
lval++;
}
The fast_add and fast_sub function also appear to be broken. They fall
back to double arithmetic if the integer instructions overflow:
fildl (%1) @ load operand 1
fildl (%2) @ load operand 2
faddp %%st, %%st(1) @ perform the addition
movb $0x2,0xc(%0) @ set type of result to IS_DOUBLE
fstpl (%0) @ convert the result to long and store to lval!!
Note that this code is only executed if we have already established that
the integer addition will overflow, and so we know lval cannot hold the
result. Still, we store the result in lval, and nothing is stored to the
dval field in this case.
I ran some tests both on x86 and ARM, and the inline assembler does not
result in a noticeable speed up, as far as I can tell. Adding to that
the fact that they are all broken for x86, I suppose we get rid of them,
and just use the C versions in all cases. (I will not be adding ARM
versions of the inline assembler, as the performance gain is not obvious).
Regards,
Ard.
Replying to self:
The code is not as broken as I thought. Apologies for the noise.
(I misread zend.h where lval and dval are actually part of a union type)
However, there is still a lot of hardcodedness which I think should be
cleaned up, and the gain is not obvious.
--
Ard.
Hello all,
As I indicated yesterday, I am looking into PHP performance when running
on ARM, and while doing that, I ran into the following code in
Zend/zend_operators.h.For 32-bit x86, we have the following assembler code to do a 'fast'
increment (%0 is a zval*):incl (%0) @ increment value at address %0
jno 0f @ jump to label 0 if no overflow
movl $0x0, (%0) @ else store 0 to address %0
movl $0x41e00000, 0x4(%0) @ store LONG_MAX+1 at %0+4
movb $0x2,0xc(%0) @ set field %0 + 0xc to 2
0:This code makes a fair number of assumptions:
- zval.lval is the first member of a zval struct
- zval.dval is the second member, at offset 4
- zval.type lives at offset 0xc
- IS_DOUBLE == 0x2
For 64-bit x86, we have a similar snippet:
incq (%0) @ increment value at address %0
jno 0f @ jump to label 0 if no overflow
movl $0x0, (%0) @ else store 0 to address %0
movl $0x43e00000, 0x4(%0) @ store LONG_MAX+1 at %0+4
movb $0x2,0x14(%0) @ set field %0 + 0x14 to 2
0:only, expectedly, (double)LONG_MAX+1.0 has a different value (as
LONG_MAX on 64-bit has another value than on 32-bit), but also note that
while the hard coded offset of the type field (0x14) is likely correct,
the dval offset is 0x4 like in the previous example, while a long has
size 8 on 64-bit. In other words, this code is broken.Another thing to note is that branch predictors typically predict all
branches as not taken, and it is up to the compiler to arrange the
branch in an optimal way; as we branch on no overflow, the code above
basically amounts to usingif (EXPECT(overflow)) {
/* juggle doubles */
} else {
lval++;
}The fast_add and fast_sub function also appear to be broken. They fall
back to double arithmetic if the integer instructions overflow:fildl (%1) @ load operand 1
fildl (%2) @ load operand 2
faddp %%st, %%st(1) @ perform the addition
movb $0x2,0xc(%0) @ set type of result to IS_DOUBLE
fstpl (%0) @ convert the result to long and store to lval!!Note that this code is only executed if we have already established that
the integer addition will overflow, and so we know lval cannot hold the
result. Still, we store the result in lval, and nothing is stored to the
dval field in this case.I ran some tests both on x86 and ARM, and the inline assembler does not
result in a noticeable speed up, as far as I can tell. Adding to that
the fact that they are all broken for x86, I suppose we get rid of them,
and just use the C versions in all cases. (I will not be adding ARM
versions of the inline assembler, as the performance gain is not obvious).Regards,
Ard.