Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:87369
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain intel.com designates 134.134.136.20 as permitted sender)
To: "internals@lists.php.net" <internals@lists.php.net>
Thread-Topic: Introduction and some opcache SSE related stuff
Thread-Index: AdDJ+OU5UPXzpIr0SCCMji2J4LQ5Qw==
Date: Wed, 29 Jul 2015 14:22:29 +0000
Message-ID: <0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9@IRSMSX106.ger.corp.intel.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9IRSMSX106gercor_"
MIME-Version: 1.0
Subject: Introduction and some opcache SSE related stuff
From: bogdan.andone@intel.com ("Andone, Bogdan")

--_000_0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9IRSMSX106gercor_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi Guys,

My name is Bogdan Andone and I work for Intel in the area of SW performance=
 analysis and optimizations.
We would like to actively contribute to Zend PHP project and to involve our=
selves in finding new performance improvement opportunities based on availa=
ble and/or new hardware features.
I am still in the source code digesting phase but I had a look to the fast_=
memcpy() implementation in opcache extension which uses SSE intrinsics:

If I am not wrong fast_memcpy() function is not currently used, as I didn't=
 find the "-msse4.2" gcc flag in the Makefile. I assume you probably didn't=
 see any performance benefit so you preserved generic memcpy() usage.

I would like to propose a slightly different implementation which uses _mm_=
store_si128() instead of _mm_stream_si128(). This ensures that copied memor=
y is preserved in data cache, which is not bad as the interpreter will star=
t to use this data without the need to go back one more time to memory. _mm=
_stream_si128() in the current implementation is intended to be used for st=
ores where we want to avoid reading data into the cache and the cache pollu=
tion; in opcache scenario it seems that preserving the data in cache has a =
positive impact.

Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance inc=
rease for the new version of fast_memcpy() compared with the generic memcpy=
(). Same result using a full load test with http_load on a Haswell EP 18 co=
res.

Here is the proposed pull request: https://github.com/php/php-src/pull/1446

Related to the SW prefetching instructions in fast_memcpy()... they are not=
 really useful in this place. There benefit is almost negligible as the add=
ress requested for prefetch will be needed at the next iteration (few cycle=
s later), while the time needed to get data from RAM is >100 cycles usually=
.. Nevertheless... they don't heart and it seems they still have a very smal=
l benefit so I preserved the original instruction and I added a new prefetc=
h request for the destination pointer.

Hope it helps,
Bogdan

--_000_0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9IRSMSX106gercor_--