more inline assembler noise

12 years ago by Ard Biesheuvel — view source

unread

Hello,

Again, apologies for prematurely declaring someone else's code 'crap'.
There are no bugs in the inline x86 assembler in Zend/zend_operators.h,
as far as I can tell, only two kinds of issues that I still think should
be addressed.

First of all, from a maintenance pov, having field offsets (like the
offset of zval.type) and constants (like $0x2 for IS_DOUBLE) hard coded
inside the instructions is a bad idea.

The other issue is the branching and the floating point instructions.
The inline assembler addresses the common case, but also adds a bunch of
instructions that address the corner case, and some branches to jump
over them. As I indicated in my previous email, branching is relatively
costly on a modern CPU with deep pipelines and having a bunch of FPU
instructions in there that hardly ever get executed doesn't help either.

The primary reason for having inline assembler at all is the ability to
detect overflow. This mainly applies to multiplication, as in that case,
detecting overflow in C code is much harder compared to reading a
condition flag in the CPU (hence the various accelerated implementations
in zend_signed_multiply.h). However, detecting overflow in
addition/subtraction implemented in C is much easier, as the code in
zend_operators.h proves: just a matter of checking the sign bits, or
doing a simple compare with LONG_MIN/LONG_MAX.

Therefore, I would be interested in finding out which benchmark was used
to make the case for having these accelerated implementations in the
first place. The differences in performance between various
implementations are very small in the tests I have done.

As for the code style/maintainability, I propose to apply the attached
patch. The performance is on par, as far as I can tell, but it is
arguably better code. I will also hook in the ARM versions once I manage
to prove that the performance is affected favourably by them.

Regards,
Ard.

Before

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.910s
user 0m56.876s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m34.576s
user 1m34.518s
sys 0m0.020s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m21.494s
user 0m21.473s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m19.879s
user 0m19.865s
sys 0m0.004s

After

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.687s
user 0m56.656s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m28.124s
user 1m28.082s
sys 0m0.004s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m20.561s
user 0m20.545s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m20.524s
user 0m20.509s
sys 0m0.004s

12 years ago by Ferenc Kovacs — view source

unread

On Fri, Jan 18, 2013 at 7:08 PM, Ard Biesheuvel
ard.biesheuvel@linaro.orgwrote:

Hello,

Again, apologies for prematurely declaring someone else's code 'crap'.
There are no bugs in the inline x86 assembler in Zend/zend_operators.h, as
far as I can tell, only two kinds of issues that I still think should be
addressed.

First of all, from a maintenance pov, having field offsets (like the
offset of zval.type) and constants (like $0x2 for IS_DOUBLE) hard coded
inside the instructions is a bad idea.

The other issue is the branching and the floating point instructions. The
inline assembler addresses the common case, but also adds a bunch of
instructions that address the corner case, and some branches to jump over
them. As I indicated in my previous email, branching is relatively costly
on a modern CPU with deep pipelines and having a bunch of FPU instructions
in there that hardly ever get executed doesn't help either.

The primary reason for having inline assembler at all is the ability to
detect overflow. This mainly applies to multiplication, as in that case,
detecting overflow in C code is much harder compared to reading a condition
flag in the CPU (hence the various accelerated implementations in
zend_signed_multiply.h). However, detecting overflow in
addition/subtraction implemented in C is much easier, as the code in
zend_operators.h proves: just a matter of checking the sign bits, or doing
a simple compare with LONG_MIN/LONG_MAX.

Therefore, I would be interested in finding out which benchmark was used
to make the case for having these accelerated implementations in the first
place. The differences in performance between various implementations are
very small in the tests I have done.

As for the code style/maintainability, I propose to apply the attached
patch. The performance is on par, as far as I can tell, but it is arguably
better code. I will also hook in the ARM versions once I manage to prove
that the performance is affected favourably by them.

Regards,
Ard.

Before

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.910s
user 0m56.876s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m34.576s
user 1m34.518s
sys 0m0.020s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m21.494s
user 0m21.473s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m19.879s
user 0m19.865s
sys 0m0.004s

After

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.687s
user 0m56.656s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m28.124s
user 1m28.082s
sys 0m0.004s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m20.561s
user 0m20.545s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m20.524s
user 0m20.509s
sys 0m0.004s

--

hi,

any update on this?

--
Ferenc Kovács
@Tyr43l - http://tyrael.hu

12 years ago by Dmitry Stogov — view source

unread

Hi Ard,

Actually with your patch the fast_increment_function() is going to be
compile into something like this

incl (%ecx)
seto %al
test %al,%al
jz .FLOAT

.END:
...
.FLOAT:
movl $0x0, (%ecx)
movl $0x41e00000, 0x4(%ecx)
movb $0x2,0xc(%ecx)
jmp . END

while before the patch it would

incl (%ecx)
jno .END

.FLOATL
movl $0x0, (%ecx)
movl $0x41e00000, 0x4(%ecx)
movb $0x2,0xc(%ecx)
.END:
...

So the only advantage of your code is eliminated static branch
misprediction in cost of two additional CPU instructions.
However CPU branch predictor should make this advantage unimportant.

Thanks. Dmitry.

On Fri, Jan 18, 2013 at 10:08 PM, Ard Biesheuvel
ard.biesheuvel@linaro.orgwrote:

Hello,

Again, apologies for prematurely declaring someone else's code 'crap'.
There are no bugs in the inline x86 assembler in Zend/zend_operators.h, as
far as I can tell, only two kinds of issues that I still think should be
addressed.

First of all, from a maintenance pov, having field offsets (like the
offset of zval.type) and constants (like $0x2 for IS_DOUBLE) hard coded
inside the instructions is a bad idea.

The other issue is the branching and the floating point instructions. The
inline assembler addresses the common case, but also adds a bunch of
instructions that address the corner case, and some branches to jump over
them. As I indicated in my previous email, branching is relatively costly
on a modern CPU with deep pipelines and having a bunch of FPU instructions
in there that hardly ever get executed doesn't help either.

The primary reason for having inline assembler at all is the ability to
detect overflow. This mainly applies to multiplication, as in that case,
detecting overflow in C code is much harder compared to reading a condition
flag in the CPU (hence the various accelerated implementations in
zend_signed_multiply.h). However, detecting overflow in
addition/subtraction implemented in C is much easier, as the code in
zend_operators.h proves: just a matter of checking the sign bits, or doing
a simple compare with LONG_MIN/LONG_MAX.

Therefore, I would be interested in finding out which benchmark was used
to make the case for having these accelerated implementations in the first
place. The differences in performance between various implementations are
very small in the tests I have done.

As for the code style/maintainability, I propose to apply the attached
patch. The performance is on par, as far as I can tell, but it is arguably
better code. I will also hook in the ARM versions once I manage to prove
that the performance is affected favourably by them.

Regards,
Ard.

Before

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.910s
user 0m56.876s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m34.576s
user 1m34.518s
sys 0m0.020s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m21.494s
user 0m21.473s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m19.879s
user 0m19.865s
sys 0m0.004s

After

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.687s
user 0m56.656s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m28.124s
user 1m28.082s
sys 0m0.004s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m20.561s
user 0m20.545s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m20.524s
user 0m20.509s
sys 0m0.004s

12 years ago by Ard Biesheuvel — view source

unread

Hi Dimitry,

The main problem I have with this code is that most of it (the double
handling) is outside the hot path, and that it is riddled with
hardcoded constants, struct offsets etc. However, if it works than I
am not necessarily in favour of making changes to it.

So can you explain a little bit which benchmarks you used to prove
that the inline assembly is faster than C? Especially in the
increment/decrement cases, there is no real overflow detection
necessary other than comparing with LONG_MIN/LONG_MAX, so I would
expect the compiler to generate fairly optimal code in these cases.

I am not trying to challenge these decisions, mind you. I am trying to
decide whether ARM will require similar handling as x86 to obtain
optimal performance.

Thanks,
Ard.

Hi Ard,

Actually with your patch the fast_increment_function() is going to be
compile into something like this
incl (%ecx)
seto %al
test %al,%al
jz .FLOAT
.END:
...
.FLOAT:
movl $0x0, (%ecx)
movl $0x41e00000, 0x4(%ecx)
movb $0x2,0xc(%ecx)
jmp . END

while before the patch it would
incl (%ecx)
jno .END
.FLOATL
movl $0x0, (%ecx)
movl $0x41e00000, 0x4(%ecx)
movb $0x2,0xc(%ecx)
.END:
...

So the only advantage of your code is eliminated static branch misprediction
in cost of two additional CPU instructions.
However CPU branch predictor should make this advantage unimportant.

Thanks. Dmitry.

On Fri, Jan 18, 2013 at 10:08 PM, Ard Biesheuvel ard.biesheuvel@linaro.org
wrote:

Hello,

Again, apologies for prematurely declaring someone else's code 'crap'.
There are no bugs in the inline x86 assembler in Zend/zend_operators.h, as
far as I can tell, only two kinds of issues that I still think should be
addressed.

First of all, from a maintenance pov, having field offsets (like the
offset of zval.type) and constants (like $0x2 for IS_DOUBLE) hard coded
inside the instructions is a bad idea.

The other issue is the branching and the floating point instructions. The
inline assembler addresses the common case, but also adds a bunch of
instructions that address the corner case, and some branches to jump over
them. As I indicated in my previous email, branching is relatively costly on
a modern CPU with deep pipelines and having a bunch of FPU instructions in
there that hardly ever get executed doesn't help either.

The primary reason for having inline assembler at all is the ability to
detect overflow. This mainly applies to multiplication, as in that case,
detecting overflow in C code is much harder compared to reading a condition
flag in the CPU (hence the various accelerated implementations in
zend_signed_multiply.h). However, detecting overflow in addition/subtraction
implemented in C is much easier, as the code in zend_operators.h proves:
just a matter of checking the sign bits, or doing a simple compare with
LONG_MIN/LONG_MAX.

Therefore, I would be interested in finding out which benchmark was used
to make the case for having these accelerated implementations in the first
place. The differences in performance between various implementations are
very small in the tests I have done.

As for the code style/maintainability, I propose to apply the attached
patch. The performance is on par, as far as I can tell, but it is arguably
better code. I will also hook in the ARM versions once I manage to prove
that the performance is affected favourably by them.

Regards,
Ard.

Before

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.910s
user 0m56.876s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m34.576s
user 1m34.518s
sys 0m0.020s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m21.494s
user 0m21.473s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m19.879s
user 0m19.865s
sys 0m0.004s

After

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.687s
user 0m56.656s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m28.124s
user 1m28.082s
sys 0m0.004s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m20.561s
user 0m20.545s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m20.524s
user 0m20.509s
sys 0m0.004s

12 years ago by Dmitry Stogov — view source

unread

I can't remember if I did any special benchmarks except for bench.php when
I introduced fast math functions.
That time, I rearranged the code to allow inlining of the most probable
paths and added assembler code to catch overflow (C can't do it in optimal
way). As I remember the bench.php showed some visible improvement.

even increment/decrement save 1 CPU instruction on fast path.

inc (%ecx)
jno FAST_PATH
...

FASR_PATH:

instead of

cmp (%ecx), $0x7fffffff
je SLOW_PATH
inc (%ecx)

FAST_PATH:

However, I'm not sure if this saved instruction makes any visible speed
difference by itself.

Thanks. Dmitry.

On Mon, Feb 4, 2013 at 2:38 PM, Ard Biesheuvel ard.biesheuvel@linaro.orgwrote:

Hi Dimitry,

The main problem I have with this code is that most of it (the double
handling) is outside the hot path, and that it is riddled with
hardcoded constants, struct offsets etc. However, if it works than I
am not necessarily in favour of making changes to it.

So can you explain a little bit which benchmarks you used to prove
that the inline assembly is faster than C? Especially in the
increment/decrement cases, there is no real overflow detection
necessary other than comparing with LONG_MIN/LONG_MAX, so I would
expect the compiler to generate fairly optimal code in these cases.

I am not trying to challenge these decisions, mind you. I am trying to
decide whether ARM will require similar handling as x86 to obtain
optimal performance.

Thanks,
Ard.
Hi Ard,

Actually with your patch the fast_increment_function() is going to be
compile into something like this
incl (%ecx)
seto %al
test %al,%al
jz .FLOAT
.END:
...
.FLOAT:
movl $0x0, (%ecx)
movl $0x41e00000, 0x4(%ecx)
movb $0x2,0xc(%ecx)
jmp . END

while before the patch it would
incl (%ecx)
jno .END
.FLOATL
movl $0x0, (%ecx)
movl $0x41e00000, 0x4(%ecx)
movb $0x2,0xc(%ecx)
.END:
...

So the only advantage of your code is eliminated static branch
misprediction
in cost of two additional CPU instructions.
However CPU branch predictor should make this advantage unimportant.

Thanks. Dmitry.

On Fri, Jan 18, 2013 at 10:08 PM, Ard Biesheuvel <
ard.biesheuvel@linaro.org>
wrote:

Hello,

Again, apologies for prematurely declaring someone else's code 'crap'.
There are no bugs in the inline x86 assembler in Zend/zend_operators.h,
as
far as I can tell, only two kinds of issues that I still think should be
addressed.

First of all, from a maintenance pov, having field offsets (like the
offset of zval.type) and constants (like $0x2 for IS_DOUBLE) hard coded
inside the instructions is a bad idea.

The other issue is the branching and the floating point instructions.
The
inline assembler addresses the common case, but also adds a bunch of
instructions that address the corner case, and some branches to jump
over
them. As I indicated in my previous email, branching is relatively
costly on
a modern CPU with deep pipelines and having a bunch of FPU instructions
in
there that hardly ever get executed doesn't help either.

The primary reason for having inline assembler at all is the ability to
detect overflow. This mainly applies to multiplication, as in that case,
detecting overflow in C code is much harder compared to reading a
condition
flag in the CPU (hence the various accelerated implementations in
zend_signed_multiply.h). However, detecting overflow in
addition/subtraction
implemented in C is much easier, as the code in zend_operators.h proves:
just a matter of checking the sign bits, or doing a simple compare with
LONG_MIN/LONG_MAX.

Therefore, I would be interested in finding out which benchmark was used
to make the case for having these accelerated implementations in the
first
place. The differences in performance between various implementations
are
very small in the tests I have done.

As for the code style/maintainability, I propose to apply the attached
patch. The performance is on par, as far as I can tell, but it is
arguably
better code. I will also hook in the ARM versions once I manage to prove
that the performance is affected favourably by them.

Regards,
Ard.

Before

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.910s
user 0m56.876s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m34.576s
user 1m34.518s
sys 0m0.020s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m21.494s
user 0m21.473s
sys 0m0.008s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m19.879s
user 0m19.865s
sys 0m0.004s

After

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'

real 0m56.687s
user 0m56.656s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'

real 1m28.124s
user 1m28.082s
sys 0m0.004s

$ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'

real 0m20.561s
user 0m20.545s
sys 0m0.004s

$ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'

real 0m20.524s
user 0m20.509s
sys 0m0.004s

12 years ago by Ard Biesheuvel — view source

unread

I can't remember if I did any special benchmarks except for bench.php when
I introduced fast math functions.
That time, I rearranged the code to allow inlining of the most probable
paths and added assembler code to catch overflow (C can't do it in optimal
way). As I remember the bench.php showed some visible improvement.

OK, in that case I would still recommend using fewer hardcoded numbers
and offsets: please refer to the applied patch, it makes things a bit
more robust.

However, as I don't expect any speedup for ARM in this area, I will not
pursue this any further.

Are there any other optimizations involving inline assembler that you
implemented for PHP? I would like to review them to see if I need to add
an ARM version.

Cheers,

Ard.

12 years ago by Dmitry Stogov — view source

unread

On Mon, Feb 4, 2013 at 8:59 PM, Ard Biesheuvel ard.biesheuvel@linaro.orgwrote:

I can't remember if I did any special benchmarks except for bench.php when
I introduced fast math functions.
That time, I rearranged the code to allow inlining of the most probable
paths and added assembler code to catch overflow (C can't do it in optimal
way). As I remember the bench.php showed some visible improvement.

OK, in that case I would still recommend using fewer hardcoded numbers and
offsets: please refer to the applied patch, it makes things a bit more
robust.

I completely agree with you and I'll try to fix it when have time.

However, as I don't expect any speedup for ARM in this area, I will not
pursue this any further.

I'm not familiar with ARM architecture, so I don't know if inline assembler
may improve performance.

Are there any other optimizations involving inline assembler that you
implemented for PHP? I would like to review them to see if I need to add an
ARM version.

You may take a look into zend_alloca.c. It uses special x86 instructions
for bit scanning and overflow checking during address calculation.

Thanks. Dmitry.

Cheers,

Ard.