Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:87574
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain zend.com designates 209.85.223.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <0ABC26E371A76440A370CFC5EB1056CC2F6CA12A@IRSMSX106.ger.corp.intel.com>
References: <0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9@IRSMSX106.ger.corp.intel.com>
	<CA+9eiLtB2cpf2v=mX_jY+niCOst2hzW70zNf9z6omC02Es+s8w@mail.gmail.com>
	<E5570C12CE404438838F1C1D3DB80653@pc1>
	<CA+9eiLtmr1az16Ha3iPWnpez1paOS40KxshosBxxZkNg6xiO2Q@mail.gmail.com>
	<0ABC26E371A76440A370CFC5EB1056CC2F6CA12A@IRSMSX106.ger.corp.intel.com>
Date: Mon, 3 Aug 2015 15:11:06 +0300
Message-ID: <CA+9eiLvzmtVUc-hJERz2n7Ng22WLXy-=M8Vq-VRAD6wttF+PNw@mail.gmail.com>
To: "Andone, Bogdan" <bogdan.andone@intel.com>
Cc: Matt Wilmas <php_lists@realplain.com>, PHP Internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary=001a113ecc566a877c051c6710d7
Subject: Re: [PHP-DEV] Introduction and some opcache SSE related stuff
From: dmitry@zend.com (Dmitry Stogov)

--001a113ecc566a877c051c6710d7
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I can confirm minor but consistent speed up on applications with big code
base.
The PR was merged into master.

Thanks. Dmitry.

On Fri, Jul 31, 2015 at 5:26 PM, Andone, Bogdan <bogdan.andone@intel.com>
wrote:

> >>>> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1%
> performance
> >>>> increase for the new version of fast_memcpy() compared with the
> generic
> >>>> memcpy(). Same result using a full load test with http_load on a
> Haswell EP
> >>>> 18 cores.
> >>>>
> >>>
> >>> 1% is really big improvement.
> >>> I'll able to check this only on next week (when back from vacation).
> >>
> >>
> >> Well, he talks like he was comparing to *generic* memcpy(), so...?  Bu=
t
> not sure how that would have been accomplished.
> >>
> >> BTW guys, I was wondering before why fast_memcpy() only in this opcach=
e
> area?  For the prefetch and/or cache pollution reasons?
> > Just because, in this place we may copy big blocks, and we also may
> align them properly, to use compact and fast Inlined code.
>
> Yeah... in fact all my numbers are against the current fast_memcpy()
> implementation, not against generic memcpy(). Sorry for the misleading
> information... :-/. I was playing in my corner with some SSE4.2 experimen=
ts
> and I wasn=E2=80=99t aware that SSE2 is enabled by default without any ne=
ed of
> compiler switch.
>
> Coming back to the issue and trying to answer also to laruence=E2=80=99s =
request
> for more numbers:
>
> I am running php-cgi -T10000 on a Haswell having 45MB L3 cache:
> The improvement is visible for scenarios where the amount of data loaded
> via opcache is significant while the real execution time is not so big;
> this is the case of real life scenarios:
> - WordPress 4.1 & MediaWiki 1.24: ~1% performance increase
> - Drupal 7.36: ~0.6% performance increase
> - The improvement is not visible on synthetic benchmarks (mandelbrot,
> micro_bench, =E2=80=A6) which load a small amount of bytecode and are com=
puting
> intensive.
>
> The explanation stays in data cache misses. I did a deeper analysis on
> Wordpress 4.1 using perf tool:
> - _mm_stream based implementation: ~3x10^-4 misses/instruction =3D> 1.023
> instructions/cycle
> - _mm_store based implementation: ~9x10^-6 misses/instruction (33x less)
> =3D> 1.035 instructions/cycle
>
> So the overall performance gain is fully explained by the increase of
> instructions/cycle due to lower cache misses; copying the opcache data is=
 a
> kind of "software prefetcher" for further execution. This phenomenon is
> most visible on processors with big caches. If I go to a lower L3 cache
> size (45MB -> 6.75MB) 1% WP gain became 0.6% gain (as the cache capabilit=
y
> to keep "prefetched" opcahe data without polluting the execution path
> become smaller).
>
> Coming back to generic memcpy(), the fast_memcpy() implementation seems t=
o
> be a very little bit smaller in terms of executed instructions (hard to
> measure the real IR data due to run to run variations). Doing a couple of
> measurements for absorbing run to run effect I see ~0.2% perfo increase i=
n
> favor of fast_memcpy w/ mm_store; it is the same increase I see in the
> implementation w/ SW prefetchers compared with the case of no SW prefetch
> in place. So the gain we see might be explained by the fact that memcpy()
> do not use SW prefetching - just a guess...
>
> Kind Regards,
> Bogdan
>
>

--001a113ecc566a877c051c6710d7--