[PATCH] arithmetic speedup

14 years ago by Dmitry Stogov — view source — reply

unread

Hi,

The attached patch improves speed of numeric operations by inlining the
most probable paths directy into executor. It also optimizes some
operations for x86 CPU using assembler.

The bench.php gets more than 10% speedup (2.5 sec instead of 2.9 sec)
Real-life applications are not affected. All the PHPT tests are passed.

I'm going to commit the patch on next week if no objections.
Any related ideas are welcome.

Thanks. Dmitry.

14 years ago by Pierre Joye — view source — reply

unread

hi Dmitry,

Nice improvements, thanks :)

Any reason not to have done the changes for windows as well?

What's about putting the asm code in external file so it can used by
more compilers? (some has issues with inline asm, like VC in x64 mode,
other may have as well afair).

Cheers,

Hi,

The attached patch improves speed of numeric operations by inlining the most
probable paths directy into executor. It also optimizes some operations for
x86 CPU using assembler.

The bench.php gets more than 10% speedup (2.5 sec instead of 2.9 sec)
Real-life applications are not affected. All the PHPT tests are passed.

I'm going to commit the patch on next week if no objections.
Any related ideas are welcome.

Thanks. Dmitry.

--

--
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

14 years ago by Dmitry Stogov — view source — reply

unread

Hi Pierre,

hi Dmitry,

Nice improvements, thanks :)

Any reason not to have done the changes for windows as well?

Sorry, I'm not an expert in MS VC inline assembler.
As I remember in VC6 it was poor and didn't allow complicated things.
In case someone can add support for VC it would be great.

What's about putting the asm code in external file so it can used by
more compilers? (some has issues with inline asm, like VC in x64 mode,
other may have as well afair).

The main idea of the patch is inlining and I don't know how can I inline
from external file.

In simple cases the function call, parameter passing, prologue, epilogue
make more overhead than the opration itself. So the inlining is
responsable for 90% of speedup while asm optimization only for 10%.

Thanks. Dmitry.

Cheers,

Hi,

The attached patch improves speed of numeric operations by inlining the most
probable paths directy into executor. It also optimizes some operations for
x86 CPU using assembler.

The bench.php gets more than 10% speedup (2.5 sec instead of 2.9 sec)
Real-life applications are not affected. All the PHPT tests are passed.

I'm going to commit the patch on next week if no objections.
Any related ideas are welcome.

Thanks. Dmitry.

14 years ago by Pierre Joye — view source — reply

unread

Hi Pierre,

hi Dmitry,

Nice improvements, thanks :)

Any reason not to have done the changes for windows as well?

Sorry, I'm not an expert in MS VC inline assembler.
As I remember in VC6 it was poor and didn't allow complicated things.
In case someone can add support for VC it would be great.

What's about putting the asm code in external file so it can used by
more compilers? (some has issues with inline asm, like VC in x64 mode,
other may have as well afair).

The main idea of the patch is inlining and I don't know how can I inline
from external file.

In simple cases the function call, parameter passing, prologue, epilogue
make more overhead than the opration itself. So the inlining is responsable
for 90% of speedup while asm optimization only for 10%.

Thanks. Dmitry.

As this patch is only GCC (and linux I suppose), I would suggest to
simply commit in trunk right now. Other platforms can implement it
later (I will do it for windows in the next weeks).

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

14 years ago by Dmitry Stogov — view source — reply

unread

Hi Pierre,

hi Dmitry,

Nice improvements, thanks :)

Any reason not to have done the changes for windows as well?

Sorry, I'm not an expert in MS VC inline assembler.
As I remember in VC6 it was poor and didn't allow complicated things.
In case someone can add support for VC it would be great.

What's about putting the asm code in external file so it can used by
more compilers? (some has issues with inline asm, like VC in x64 mode,
other may have as well afair).

The main idea of the patch is inlining and I don't know how can I inline
from external file.

In simple cases the function call, parameter passing, prologue, epilogue
make more overhead than the opration itself. So the inlining is responsable
for 90% of speedup while asm optimization only for 10%.

Thanks. Dmitry.

As this patch is only GCC (and linux I suppose),

The inlining must be done on all platforms so it should affect Windows too.

I would suggest to
simply commit in trunk right now. Other platforms can implement it
later (I will do it for windows in the next weeks).

I'll commit on Monday.

Thanks. Dmitry.

14 years ago by Pierre Joye — view source — reply

unread

On Fri, May 20, 2011 at 12:01 PM, Dmitry Stogovdmitry@zend.com wrote:

Hi Pierre,

hi Dmitry,

Nice improvements, thanks :)

Any reason not to have done the changes for windows as well?

Sorry, I'm not an expert in MS VC inline assembler.
As I remember in VC6 it was poor and didn't allow complicated things.
In case someone can add support for VC it would be great.

What's about putting the asm code in external file so it can used by
more compilers? (some has issues with inline asm, like VC in x64 mode,
other may have as well afair).

The main idea of the patch is inlining and I don't know how can I inline
from external file.

In simple cases the function call, parameter passing, prologue, epilogue
make more overhead than the opration itself. So the inlining is
responsable
for 90% of speedup while asm optimization only for 10%.

Thanks. Dmitry.

As this patch is only GCC (and linux I suppose),

The inlining must be done on all platforms so it should affect Windows too.

It could be inlined and externalized, the only issue is about compiler
and inline assemblers. Will test.

I would suggest to
simply commit in trunk right now. Other platforms can implement it
later (I will do it for windows in the next weeks).

I'll commit on Monday.

Thanks!

--
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

14 years ago by jvlad — view source — reply

unread

"Dmitry Stogov" dmitry@zend.com wrote in message
news:4DD63C03.3090401@zend.com...

The main idea of the patch is inlining and I don't know how can I inline
from external file.

via #include and macros, as many others do.
see longlong.h in GNU libgcrypt-1.4.6/libgcrypt-1.4.6/mpi/

just my 2c

14 years ago by Sebastian Bergmann — view source — reply

unread

The bench.php gets more than 10% speedup (2.5 sec instead of 2.9 sec)
Real-life applications are not affected. All the PHPT tests are passed.

I chatted with Kore Nordmann, the creator of Image_3D (raytracer written
in PHP) and ezcGraph (chart component in the Zeta Components library)
last night. His code will definitely benefit from these improvements.

Another performance improvement with regards to math functionality in
PHP could be compiling math functions such as abs() into specialized
opcodes thus alleviating the function call overhead that is otherwise
incurred. Kore mentioned, for example, that Xdebug and KCacheGrind
currently show that most time is spent in several hundred thousand calls
to abs() while running the component's test suite.

--
Sebastian Bergmann Co-Founder and Principal Consultant
http://sebastian-bergmann.de/ http://thePHP.cc/

14 years ago by Martynas Venckus — view source — reply

unread

The bench.php gets more than 10% speedup (2.5 sec instead of 2.9 sec)
Real-life applications are not affected. All the PHPT tests are passed.

I chatted with Kore Nordmann, the creator of Image_3D (raytracer written
in PHP) and ezcGraph (chart component in the Zeta Components library)
last night. His code will definitely benefit from these improvements.

Another performance improvement with regards to math functionality in
PHP could be compiling math functions such as abs() into specialized
opcodes thus alleviating the function call overhead that is otherwise
incurred. Kore mentioned, for example, that Xdebug and KCacheGrind
currently show that most time is spent in several hundred thousand calls
to abs() while running the component's test suite.

What platform was that on? GCC already inlines its builtins by
default (even at -O0). I.e., the abs() generates the following code:

   movl    -4(%rbp), %eax
   movl    %eax, %edx
   sarl    $31, %edx
   movl    %edx, %eax
   xorl    -4(%rbp), %eax
   subl    %edx, %eax

I think it's wrong to do the md inlines in PHP itself for couple reasons:

   - There's a huge list of such md functions;  esp. the fp ones

would show bigger benefits. However inlining each one of them is
infeasible.

   - The inlines are platform-dependent, so this would only

benefit a few platforms.

   - It is generally a wrong level to do such optimizations--this

is a compiler's job.

If your profiling shows that there's a function worth a md inline and
compiler doesn't already do that, submit a bug for your compiler
vendor. (-;

14 years ago by Stas Malyshev — view source — reply

unread

Hi!

What platform was that on? GCC already inlines its builtins by
default (even at -O0). I.e., the abs() generates the following code:

As I understand, Sebastian wasn't talking about inlining C abs(). He was
talking about converting PHP abs() (which is a function call right now
with all overhead that this implies) to an opcode, which makes it
somewhat cheaper to call, since the engine will be handling it without
the overhead associated with PHP function call.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

14 years ago by Martynas Venckus — view source — reply

unread

Hi!

What platform was that on? GCC already inlines its builtins by
default (even at -O0). I.e., the abs() generates the following code:

As I understand, Sebastian wasn't talking about inlining C abs(). He was
talking about converting PHP abs() (which is a function call right now
with all overhead that this implies) to an opcode, which makes it
somewhat cheaper to call, since the engine will be handling it without
the overhead associated with PHP function call.

That probably makes sense.

The original diff had optimizations using x86 assembler; this mislead me. (-;