Hi Guys,
My name is Bogdan Andone and I work for Intel in the area of SW performance analysis and optimizations.
We would like to actively contribute to Zend PHP project and to involve ourselves in finding new performance improvement opportunities based on available and/or new hardware features.
I am still in the source code digesting phase but I had a look to the fast_memcpy() implementation in opcache extension which uses SSE intrinsics:
If I am not wrong fast_memcpy() function is not currently used, as I didn't find the "-msse4.2" gcc flag in the Makefile. I assume you probably didn't see any performance benefit so you preserved generic memcpy() usage.
I would like to propose a slightly different implementation which uses _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied memory is preserved in data cache, which is not bad as the interpreter will start to use this data without the need to go back one more time to memory. _mm_stream_si128() in the current implementation is intended to be used for stores where we want to avoid reading data into the cache and the cache pollution; in opcache scenario it seems that preserving the data in cache has a positive impact.
Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance increase for the new version of fast_memcpy() compared with the generic memcpy(). Same result using a full load test with http_load on a Haswell EP 18 cores.
Here is the proposed pull request: https://github.com/php/php-src/pull/1446
Related to the SW prefetching instructions in fast_memcpy()... they are not really useful in this place. There benefit is almost negligible as the address requested for prefetch will be needed at the next iteration (few cycles later), while the time needed to get data from RAM is >100 cycles usually.. Nevertheless... they don't heart and it seems they still have a very small benefit so I preserved the original instruction and I added a new prefetch request for the destination pointer.
Hope it helps,
Bogdan
Hi Andone,
I'm not sure why nobody has replied to you yet, we've all looked at the
PR and spent a lot of the day yesterday discussing it.
I've CC'd Dmitry, he doesn't always read internals, so this should get
his attention.
Lastly, very cool ... I look forward to some more cleverness ...
Cheers
Joe
On Wed, Jul 29, 2015 at 3:22 PM, Andone, Bogdan bogdan.andone@intel.com
wrote:
Hi Guys,
My name is Bogdan Andone and I work for Intel in the area of SW
performance analysis and optimizations.
We would like to actively contribute to Zend PHP project and to involve
ourselves in finding new performance improvement opportunities based on
available and/or new hardware features.
I am still in the source code digesting phase but I had a look to the
fast_memcpy() implementation in opcache extension which uses SSE intrinsics:If I am not wrong fast_memcpy() function is not currently used, as I
didn't find the "-msse4.2" gcc flag in the Makefile. I assume you probably
didn't see any performance benefit so you preserved generic memcpy() usage.I would like to propose a slightly different implementation which uses
_mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
memory is preserved in data cache, which is not bad as the interpreter will
start to use this data without the need to go back one more time to memory.
_mm_stream_si128() in the current implementation is intended to be used for
stores where we want to avoid reading data into the cache and the cache
pollution; in opcache scenario it seems that preserving the data in cache
has a positive impact.Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
increase for the new version of fast_memcpy() compared with the generic
memcpy(). Same result using a full load test with http_load on a Haswell EP
18 cores.Here is the proposed pull request:
https://github.com/php/php-src/pull/1446Related to the SW prefetching instructions in fast_memcpy()... they are
not really useful in this place. There benefit is almost negligible as the
address requested for prefetch will be needed at the next iteration (few
cycles later), while the time needed to get data from RAM is >100 cycles
usually.. Nevertheless... they don't heart and it seems they still have a
very small benefit so I preserved the original instruction and I added a
new prefetch request for the destination pointer.Hope it helps,
Bogdan
Hey:
Hi Andone,
I'm not sure why nobody has replied to you yet, we've all looked at the
PR and spent a lot of the day yesterday discussing it.
I've CC'd Dmitry, he doesn't always read internals, so this should get
his attention.
Sorry for late response, and Dmitry is on vacation now. so, he
probably not be able to reply this soon.
anyway, is the performance improvement is consistently be seen?
have you tested it with some profiling tool? IR reduced or cache misses reduced?
thanks
Lastly, very cool ... I look forward to some more cleverness ...
Cheers
JoeOn Wed, Jul 29, 2015 at 3:22 PM, Andone, Bogdan bogdan.andone@intel.com
wrote:Hi Guys,
My name is Bogdan Andone and I work for Intel in the area of SW
performance analysis and optimizations.
We would like to actively contribute to Zend PHP project and to involve
ourselves in finding new performance improvement opportunities based on
available and/or new hardware features.
I am still in the source code digesting phase but I had a look to the
fast_memcpy() implementation in opcache extension which uses SSE intrinsics:If I am not wrong fast_memcpy() function is not currently used, as I
didn't find the "-msse4.2" gcc flag in the Makefile. I assume you probably
didn't see any performance benefit so you preserved generic memcpy() usage.I would like to propose a slightly different implementation which uses
_mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
memory is preserved in data cache, which is not bad as the interpreter will
start to use this data without the need to go back one more time to memory.
_mm_stream_si128() in the current implementation is intended to be used for
stores where we want to avoid reading data into the cache and the cache
pollution; in opcache scenario it seems that preserving the data in cache
has a positive impact.Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
increase for the new version of fast_memcpy() compared with the generic
memcpy(). Same result using a full load test with http_load on a Haswell EP
18 cores.Here is the proposed pull request:
https://github.com/php/php-src/pull/1446Related to the SW prefetching instructions in fast_memcpy()... they are
not really useful in this place. There benefit is almost negligible as the
address requested for prefetch will be needed at the next iteration (few
then maybe we don't need this in fast_memcpy? I mean it maybe used
widely if is is proven to be faster, which will be out of this
context.
thanks
cycles later), while the time needed to get data from RAM is >100 cycles
usually.. Nevertheless... they don't heart and it seems they still have a
very small benefit so I preserved the original instruction and I added a
new prefetch request for the destination pointer.Hope it helps,
Bogdan
--
Xinchen Hui
@Laruence
http://www.laruence.com/
Hi Bogdan,
-----Original Message-----
From: Andone, Bogdan [mailto:bogdan.andone@intel.com]
Sent: Wednesday, July 29, 2015 4:22 PM
To: internals@lists.php.net
Subject: [PHP-DEV] Introduction and some opcache SSE related stuffHi Guys,
My name is Bogdan Andone and I work for Intel in the area of SW
performance
analysis and optimizations.
We would like to actively contribute to Zend PHP project and to involve
ourselves in finding new performance improvement opportunities based on
available and/or new hardware features.
I am still in the source code digesting phase but I had a look to the
fast_memcpy() implementation in opcache extension which uses SSE
intrinsics:If I am not wrong fast_memcpy() function is not currently used, as I
didn't find
the "-msse4.2" gcc flag in the Makefile. I assume you probably didn't see
any
performance benefit so you preserved generic memcpy() usage.I would like to propose a slightly different implementation which uses
_mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
memory is preserved in data cache, which is not bad as the interpreter
will start
to use this data without the need to go back one more time to memory.
_mm_stream_si128() in the current implementation is intended to be used
for
stores where we want to avoid reading data into the cache and the cache
pollution; in opcache scenario it seems that preserving the data in cache
has a
positive impact.Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
increase for the new version of fast_memcpy() compared with the generic
memcpy(). Same result using a full load test with http_load on a Haswell
EP 18
cores.Here is the proposed pull request:
https://github.com/php/php-src/pull/1446Related to the SW prefetching instructions in fast_memcpy()... they are
not
really useful in this place. There benefit is almost negligible as the
address
requested for prefetch will be needed at the next iteration (few cycles
later),
while the time needed to get data from RAM is >100 cycles usually..
Nevertheless... they don't heart and it seems they still have a very small
benefit
so I preserved the original instruction and I added a new prefetch request
for the
destination pointer.
AFAIR we always rely on the standard features, thus SSE2 in this particular
case, for better compatibility. IMHO using newer things should be done more
carefully. Having more stats could be not bad, from what I see at least here
http://store.steampowered.com/hwsurvey it's still not safe to just switch
away from SSE2. Maybe introducing some flexible solution like compile time
switches for people who want to exhaust features of the modern hardware, or
specific features available from vendors, could be an approach. But it all
is of course a project definition.
Regards
Anatol
Hi Bogdan,
On Wed, Jul 29, 2015 at 5:22 PM, Andone, Bogdan bogdan.andone@intel.com
wrote:
Hi Guys,
My name is Bogdan Andone and I work for Intel in the area of SW
performance analysis and optimizations.
We would like to actively contribute to Zend PHP project and to involve
ourselves in finding new performance improvement opportunities based on
available and/or new hardware features.
I am still in the source code digesting phase but I had a look to the
fast_memcpy() implementation in opcache extension which uses SSE intrinsics:If I am not wrong fast_memcpy() function is not currently used, as I
didn't find the "-msse4.2" gcc flag in the Makefile. I assume you probably
didn't see any performance benefit so you preserved generic memcpy() usage.
This is not SSE4.2 this is SSE2.
Any X86_64 target implements SSE2, so it's enabled by default on x86_64
systems (at least on Linux).
It also may be enabled on x86 targets adding "-msse2" option.
I would like to propose a slightly different implementation which uses
_mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
memory is preserved in data cache, which is not bad as the interpreter will
start to use this data without the need to go back one more time to memory.
_mm_stream_si128() in the current implementation is intended to be used for
stores where we want to avoid reading data into the cache and the cache
pollution; in opcache scenario it seems that preserving the data in cache
has a positive impact.
_mm_stream_si128() was used on purpose, to avoid CPU cache pollution,
because data copied from SHM to process memory is not necessary used before
eviction.
By the way, I'm not completely sure. May be _mm_store_si128() can provide
better result.
Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
increase for the new version of fast_memcpy() compared with the generic
memcpy(). Same result using a full load test with http_load on a Haswell EP
18 cores.
1% is really big improvement.
I'll able to check this only on next week (when back from vacation).
Here is the proposed pull request:
https://github.com/php/php-src/pull/1446Related to the SW prefetching instructions in fast_memcpy()... they are
not really useful in this place. There benefit is almost negligible as the
address requested for prefetch will be needed at the next iteration (few
cycles later), while the time needed to get data from RAM is >100 cycles
usually.. Nevertheless... they don't heart and it seems they still have a
very small benefit so I preserved the original instruction and I added a
new prefetch request for the destination pointer.
I also didn't see significant difference from software prefetching.
Thanks. Dmitry.
Hope it helps,
Bogdan
Hi Dmitry, Bogdan,
----- Original Message -----
From: "Dmitry Stogov"
Sent: Thursday, July 30, 2015
Hi Bogdan,
On Wed, Jul 29, 2015 at 5:22 PM, Andone, Bogdan bogdan.andone@intel.com
wrote:Hi Guys,
My name is Bogdan Andone and I work for Intel in the area of SW
performance analysis and optimizations.
We would like to actively contribute to Zend PHP project and to involve
ourselves in finding new performance improvement opportunities based on
available and/or new hardware features.
I am still in the source code digesting phase but I had a look to the
fast_memcpy() implementation in opcache extension which uses SSE
intrinsics:If I am not wrong fast_memcpy() function is not currently used, as I
didn't find the "-msse4.2" gcc flag in the Makefile. I assume you
probably
didn't see any performance benefit so you preserved generic memcpy()
usage.This is not SSE4.2 this is SSE2.
Any X86_64 target implements SSE2, so it's enabled by default on x86_64
systems (at least on Linux).
It also may be enabled on x86 targets adding "-msse2" option.
Right, I was gonna say, I think that was a mistake, and all x86_64 should be
using it at least...
Of course, using anything newer that needs special options is nearly
useless, since I guess the vast majority aren't building themselves, but
using lowest-common-denominator repos. I had been wondering about speeding
up some other things, maybe taking advantage of SSE4.x (string stuff, I
don't know), but... like I said. Runtime checks would be awesome, but
except for the recent GCC, the intrinsics aren't available unless the
corresponding SSE option is enabled (lame!). So requires a separate
compilation unit. :-/
Of course I guess if the intrinsic maps simply to the instruction, could
just do it with inline asm, if wanted to do runtime CPU checking.
I would like to propose a slightly different implementation which uses
_mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
memory is preserved in data cache, which is not bad as the interpreter
will
start to use this data without the need to go back one more time to
memory.
_mm_stream_si128() in the current implementation is intended to be used
for
stores where we want to avoid reading data into the cache and the cache
pollution; in opcache scenario it seems that preserving the data in cache
has a positive impact._mm_stream_si128() was used on purpose, to avoid CPU cache pollution,
because data copied from SHM to process memory is not necessary used
before
eviction.
By the way, I'm not completely sure. May be _mm_store_si128() can provide
better result.
Interesting (that _stream was used on purpose). :-)
Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
increase for the new version of fast_memcpy() compared with the generic
memcpy(). Same result using a full load test with http_load on a Haswell
EP
18 cores.1% is really big improvement.
I'll able to check this only on next week (when back from vacation).
Well, he talks like he was comparing to generic memcpy(), so...? But not
sure how that would have been accomplished.
BTW guys, I was wondering before why fast_memcpy() only in this opcache
area? For the prefetch and/or cache pollution reasons?
Because shouldn't the library functions in glibc, etc. already be using
versions optimized for the CPU at runtime? So is generic memcpy() already
"fast?" (Other than overhead for a function call.)
Here is the proposed pull request:
https://github.com/php/php-src/pull/1446Related to the SW prefetching instructions in fast_memcpy()... they are
not really useful in this place. There benefit is almost negligible as
the
address requested for prefetch will be needed at the next iteration (few
cycles later), while the time needed to get data from RAM is >100 cycles
usually.. Nevertheless... they don't heart and it seems they still have a
very small benefit so I preserved the original instruction and I added a
new prefetch request for the destination pointer.I also didn't see significant difference from software prefetching.
So how about prefetching "further"/more interations ahead...?
Thanks. Dmitry.
Hope it helps,
Bogdan
- Matt
Hi Dmitry, Bogdan,
----- Original Message -----
From: "Dmitry Stogov"
Sent: Thursday, July 30, 2015Hi Bogdan,
On Wed, Jul 29, 2015 at 5:22 PM, Andone, Bogdan bogdan.andone@intel.com
wrote:Hi Guys,
My name is Bogdan Andone and I work for Intel in the area of SW
performance analysis and optimizations.
We would like to actively contribute to Zend PHP project and to involve
ourselves in finding new performance improvement opportunities based on
available and/or new hardware features.
I am still in the source code digesting phase but I had a look to the
fast_memcpy() implementation in opcache extension which uses SSE
intrinsics:If I am not wrong fast_memcpy() function is not currently used, as I
didn't find the "-msse4.2" gcc flag in the Makefile. I assume you
probably
didn't see any performance benefit so you preserved generic memcpy()
usage.This is not SSE4.2 this is SSE2.
Any X86_64 target implements SSE2, so it's enabled by default on x86_64
systems (at least on Linux).
It also may be enabled on x86 targets adding "-msse2" option.Right, I was gonna say, I think that was a mistake, and all x86_64 should
be using it at least...Of course, using anything newer that needs special options is nearly
useless, since I guess the vast majority aren't building themselves, but
using lowest-common-denominator repos. I had been wondering about speeding
up some other things, maybe taking advantage of SSE4.x (string stuff, I
don't know), but... like I said. Runtime checks would be awesome, but
except for the recent GCC, the intrinsics aren't available unless the
corresponding SSE option is enabled (lame!). So requires a separate
compilation unit. :-/Of course I guess if the intrinsic maps simply to the instruction, could
just do it with inline asm, if wanted to do runtime CPU checking.I would like to propose a slightly different implementation which uses
_mm_store_si128() instead of _mm_stream_si128(). This ensures that
copied
memory is preserved in data cache, which is not bad as the interpreter
will
start to use this data without the need to go back one more time to
memory.
_mm_stream_si128() in the current implementation is intended to be used
for
stores where we want to avoid reading data into the cache and the cache
pollution; in opcache scenario it seems that preserving the data in
cache
has a positive impact._mm_stream_si128() was used on purpose, to avoid CPU cache pollution,
because data copied from SHM to process memory is not necessary used
before
eviction.
By the way, I'm not completely sure. May be _mm_store_si128() can provide
better result.Interesting (that _stream was used on purpose). :-)
Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
increase for the new version of fast_memcpy() compared with the generic
memcpy(). Same result using a full load test with http_load on a
Haswell EP
18 cores.1% is really big improvement.
I'll able to check this only on next week (when back from vacation).Well, he talks like he was comparing to generic memcpy(), so...? But
not sure how that would have been accomplished.BTW guys, I was wondering before why fast_memcpy() only in this opcache
area? For the prefetch and/or cache pollution reasons?
Just because, in this place we may copy big blocks, and we also may align
them properly, to use compact and fast Inlined code.
Because shouldn't the library functions in glibc, etc. already be using
versions optimized for the CPU at runtime? So is generic memcpy() already
"fast?" (Other than overhead for a function call.)
glibc already uses optimized memcpy(), but this is universal function,
that has to check for different conditions, like allignment of source and
distination and length.
Here is the proposed pull request:
https://github.com/php/php-src/pull/1446Related to the SW prefetching instructions in fast_memcpy()... they are
not really useful in this place. There benefit is almost negligible as
the
address requested for prefetch will be needed at the next iteration (few
cycles later), while the time needed to get data from RAM is >100 cycles
usually.. Nevertheless... they don't heart and it seems they still have
a
very small benefit so I preserved the original instruction and I added a
new prefetch request for the destination pointer.I also didn't see significant difference from software prefetching.
So how about prefetching "further"/more interations ahead...?
I tried, but didn't see difference as well.
Thanks. Dmitry.
Thanks. Dmitry.
Hope it helps,
Bogdan
- Matt
Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
increase for the new version of fast_memcpy() compared with the generic
memcpy(). Same result using a full load test with http_load on a Haswell EP
18 cores.1% is really big improvement.
I'll able to check this only on next week (when back from vacation).Well, he talks like he was comparing to generic memcpy(), so...? But not sure how that would have been accomplished.
BTW guys, I was wondering before why fast_memcpy() only in this opcache area? For the prefetch and/or cache pollution reasons?
Just because, in this place we may copy big blocks, and we also may align them properly, to use compact and fast Inlined code.
Yeah... in fact all my numbers are against the current fast_memcpy() implementation, not against generic memcpy(). Sorry for the misleading information... :-/. I was playing in my corner with some SSE4.2 experiments and I wasn’t aware that SSE2 is enabled by default without any need of compiler switch.
Coming back to the issue and trying to answer also to laruence’s request for more numbers:
I am running php-cgi -T10000 on a Haswell having 45MB L3 cache:
The improvement is visible for scenarios where the amount of data loaded via opcache is significant while the real execution time is not so big; this is the case of real life scenarios:
- WordPress 4.1 & MediaWiki 1.24: ~1% performance increase
- Drupal 7.36: ~0.6% performance increase
- The improvement is not visible on synthetic benchmarks (mandelbrot, micro_bench, …) which load a small amount of bytecode and are computing intensive.
The explanation stays in data cache misses. I did a deeper analysis on Wordpress 4.1 using perf tool:
- _mm_stream based implementation: ~3x10^-4 misses/instruction => 1.023 instructions/cycle
- _mm_store based implementation: ~9x10^-6 misses/instruction (33x less) => 1.035 instructions/cycle
So the overall performance gain is fully explained by the increase of instructions/cycle due to lower cache misses; copying the opcache data is a kind of "software prefetcher" for further execution. This phenomenon is most visible on processors with big caches. If I go to a lower L3 cache size (45MB -> 6.75MB) 1% WP gain became 0.6% gain (as the cache capability to keep "prefetched" opcahe data without polluting the execution path become smaller).
Coming back to generic memcpy(), the fast_memcpy() implementation seems to be a very little bit smaller in terms of executed instructions (hard to measure the real IR data due to run to run variations). Doing a couple of measurements for absorbing run to run effect I see ~0.2% perfo increase in favor of fast_memcpy w/ mm_store; it is the same increase I see in the implementation w/ SW prefetchers compared with the case of no SW prefetch in place. So the gain we see might be explained by the fact that memcpy() do not use SW prefetching - just a guess...
Kind Regards,
Bogdan
I can confirm minor but consistent speed up on applications with big code
base.
The PR was merged into master.
Thanks. Dmitry.
On Fri, Jul 31, 2015 at 5:26 PM, Andone, Bogdan bogdan.andone@intel.com
wrote:
Running php-cgi -T10000 on WordPress4.1/index.php I see ~1%
performance
increase for the new version of fast_memcpy() compared with the
generic
memcpy(). Same result using a full load test with http_load on a
Haswell EP
18 cores.1% is really big improvement.
I'll able to check this only on next week (when back from vacation).Well, he talks like he was comparing to generic memcpy(), so...? But
not sure how that would have been accomplished.BTW guys, I was wondering before why fast_memcpy() only in this opcache
area? For the prefetch and/or cache pollution reasons?
Just because, in this place we may copy big blocks, and we also may
align them properly, to use compact and fast Inlined code.Yeah... in fact all my numbers are against the current fast_memcpy()
implementation, not against generic memcpy(). Sorry for the misleading
information... :-/. I was playing in my corner with some SSE4.2 experiments
and I wasn’t aware that SSE2 is enabled by default without any need of
compiler switch.Coming back to the issue and trying to answer also to laruence’s request
for more numbers:I am running php-cgi -T10000 on a Haswell having 45MB L3 cache:
The improvement is visible for scenarios where the amount of data loaded
via opcache is significant while the real execution time is not so big;
this is the case of real life scenarios:
- WordPress 4.1 & MediaWiki 1.24: ~1% performance increase
- Drupal 7.36: ~0.6% performance increase
- The improvement is not visible on synthetic benchmarks (mandelbrot,
micro_bench, …) which load a small amount of bytecode and are computing
intensive.The explanation stays in data cache misses. I did a deeper analysis on
Wordpress 4.1 using perf tool:
- _mm_stream based implementation: ~3x10^-4 misses/instruction => 1.023
instructions/cycle- _mm_store based implementation: ~9x10^-6 misses/instruction (33x less)
=> 1.035 instructions/cycleSo the overall performance gain is fully explained by the increase of
instructions/cycle due to lower cache misses; copying the opcache data is a
kind of "software prefetcher" for further execution. This phenomenon is
most visible on processors with big caches. If I go to a lower L3 cache
size (45MB -> 6.75MB) 1% WP gain became 0.6% gain (as the cache capability
to keep "prefetched" opcahe data without polluting the execution path
become smaller).Coming back to generic memcpy(), the fast_memcpy() implementation seems to
be a very little bit smaller in terms of executed instructions (hard to
measure the real IR data due to run to run variations). Doing a couple of
measurements for absorbing run to run effect I see ~0.2% perfo increase in
favor of fast_memcpy w/ mm_store; it is the same increase I see in the
implementation w/ SW prefetchers compared with the case of no SW prefetch
in place. So the gain we see might be explained by the fact that memcpy()
do not use SW prefetching - just a guess...Kind Regards,
Bogdan