Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:87574 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 29407 invoked from network); 3 Aug 2015 12:11:10 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 3 Aug 2015 12:11:10 -0000 Authentication-Results: pb1.pair.com header.from=dmitry@zend.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=dmitry@zend.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain zend.com designates 209.85.223.173 as permitted sender) X-PHP-List-Original-Sender: dmitry@zend.com X-Host-Fingerprint: 209.85.223.173 mail-io0-f173.google.com Received: from [209.85.223.173] ([209.85.223.173:34605] helo=mail-io0-f173.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 41/13-13081-D5A5FB55 for ; Mon, 03 Aug 2015 08:11:09 -0400 Received: by ioea135 with SMTP id a135so142080872ioe.1 for ; Mon, 03 Aug 2015 05:11:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=ahp0bNN2MmPJFKenAH2j0+f5q16v675df9wPRTA36q0=; b=lz4/fXUGDau1hzAas5K3io5uUuiwrEzNUOZ7hWSqU/Hl5pb/b5KPvIZ3lLhIbOqsH6 gB8LgzaQKj8dfaGRWuQAVXviq/R/gonJxFVNy8cNuhGRn6Jmigo61808HA0cJ5kUOVim J/gAESjpxzWTn4eE0DyIbelA97SceXwOJCi1DQsTM2uLLGwhMRFNLk+o6rLKYKj4GwaD Z9LYAlnrNIB1dXWPQuNs0dddexH9waqPI7YouWqB3Cqf8cspzTiKnzJcGmtOr7eYPNyO C26GC1PAdPYJxRXWKvia2iucwrNGhngy3UW6a8oaXIQmDzBFVD1DdgmzsLoNWscS0jKP R2Wg== X-Gm-Message-State: ALoCoQmtW6BHxUCQ8yb/H3iFUTKZm9WBlk+VoBP5xJhgCDYEbgmg3ZB9wo3uQuEbt4wNlYm8JuK4IHrrJ+CRbrUrVC2nQxaugfI06BJo9vyOuFc6Trb7bNwAVwnYyEQ0DtOBbCVpKJ46GwOE8vVQqYhGhdjcrpyh/doU286Tda4H+GiIqVQoQoc= MIME-Version: 1.0 X-Received: by 10.107.18.224 with SMTP id 93mr20087559ios.51.1438603866170; Mon, 03 Aug 2015 05:11:06 -0700 (PDT) Received: by 10.50.203.105 with HTTP; Mon, 3 Aug 2015 05:11:06 -0700 (PDT) In-Reply-To: <0ABC26E371A76440A370CFC5EB1056CC2F6CA12A@IRSMSX106.ger.corp.intel.com> References: <0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9@IRSMSX106.ger.corp.intel.com> <0ABC26E371A76440A370CFC5EB1056CC2F6CA12A@IRSMSX106.ger.corp.intel.com> Date: Mon, 3 Aug 2015 15:11:06 +0300 Message-ID: To: "Andone, Bogdan" Cc: Matt Wilmas , PHP Internals Content-Type: multipart/alternative; boundary=001a113ecc566a877c051c6710d7 Subject: Re: [PHP-DEV] Introduction and some opcache SSE related stuff From: dmitry@zend.com (Dmitry Stogov) --001a113ecc566a877c051c6710d7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I can confirm minor but consistent speed up on applications with big code base. The PR was merged into master. Thanks. Dmitry. On Fri, Jul 31, 2015 at 5:26 PM, Andone, Bogdan wrote: > >>>> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% > performance > >>>> increase for the new version of fast_memcpy() compared with the > generic > >>>> memcpy(). Same result using a full load test with http_load on a > Haswell EP > >>>> 18 cores. > >>>> > >>> > >>> 1% is really big improvement. > >>> I'll able to check this only on next week (when back from vacation). > >> > >> > >> Well, he talks like he was comparing to *generic* memcpy(), so...? Bu= t > not sure how that would have been accomplished. > >> > >> BTW guys, I was wondering before why fast_memcpy() only in this opcach= e > area? For the prefetch and/or cache pollution reasons? > > Just because, in this place we may copy big blocks, and we also may > align them properly, to use compact and fast Inlined code. > > Yeah... in fact all my numbers are against the current fast_memcpy() > implementation, not against generic memcpy(). Sorry for the misleading > information... :-/. I was playing in my corner with some SSE4.2 experimen= ts > and I wasn=E2=80=99t aware that SSE2 is enabled by default without any ne= ed of > compiler switch. > > Coming back to the issue and trying to answer also to laruence=E2=80=99s = request > for more numbers: > > I am running php-cgi -T10000 on a Haswell having 45MB L3 cache: > The improvement is visible for scenarios where the amount of data loaded > via opcache is significant while the real execution time is not so big; > this is the case of real life scenarios: > - WordPress 4.1 & MediaWiki 1.24: ~1% performance increase > - Drupal 7.36: ~0.6% performance increase > - The improvement is not visible on synthetic benchmarks (mandelbrot, > micro_bench, =E2=80=A6) which load a small amount of bytecode and are com= puting > intensive. > > The explanation stays in data cache misses. I did a deeper analysis on > Wordpress 4.1 using perf tool: > - _mm_stream based implementation: ~3x10^-4 misses/instruction =3D> 1.023 > instructions/cycle > - _mm_store based implementation: ~9x10^-6 misses/instruction (33x less) > =3D> 1.035 instructions/cycle > > So the overall performance gain is fully explained by the increase of > instructions/cycle due to lower cache misses; copying the opcache data is= a > kind of "software prefetcher" for further execution. This phenomenon is > most visible on processors with big caches. If I go to a lower L3 cache > size (45MB -> 6.75MB) 1% WP gain became 0.6% gain (as the cache capabilit= y > to keep "prefetched" opcahe data without polluting the execution path > become smaller). > > Coming back to generic memcpy(), the fast_memcpy() implementation seems t= o > be a very little bit smaller in terms of executed instructions (hard to > measure the real IR data due to run to run variations). Doing a couple of > measurements for absorbing run to run effect I see ~0.2% perfo increase i= n > favor of fast_memcpy w/ mm_store; it is the same increase I see in the > implementation w/ SW prefetchers compared with the case of no SW prefetch > in place. So the gain we see might be explained by the fact that memcpy() > do not use SW prefetching - just a guess... > > Kind Regards, > Bogdan > > --001a113ecc566a877c051c6710d7--