Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:87369 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 10639 invoked from network); 29 Jul 2015 14:30:07 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 Jul 2015 14:30:07 -0000 Received: from [127.0.0.1] ([127.0.0.1:17751]) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ECSTREAM id 7F/50-08288-E63E8B55 for ; Wed, 29 Jul 2015 10:30:06 -0400 Authentication-Results: pb1.pair.com header.from=bogdan.andone@intel.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=bogdan.andone@intel.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain intel.com designates 134.134.136.20 as permitted sender) X-PHP-List-Original-Sender: bogdan.andone@intel.com X-Host-Fingerprint: 134.134.136.20 mga02.intel.com Received: from [134.134.136.20] ([134.134.136.20:56957] helo=mga02.intel.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 9C/50-08288-BA1E8B55 for ; Wed, 29 Jul 2015 10:22:36 -0400 Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga101.jf.intel.com with ESMTP; 29 Jul 2015 07:22:32 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.15,571,1432623600"; d="scan'208,217";a="773576237" Received: from irsmsx154.ger.corp.intel.com ([163.33.192.96]) by orsmga002.jf.intel.com with ESMTP; 29 Jul 2015 07:22:31 -0700 Received: from irsmsx106.ger.corp.intel.com ([169.254.8.137]) by IRSMSX154.ger.corp.intel.com ([169.254.12.253]) with mapi id 14.03.0224.002; Wed, 29 Jul 2015 15:22:30 +0100 To: "internals@lists.php.net" Thread-Topic: Introduction and some opcache SSE related stuff Thread-Index: AdDJ+OU5UPXzpIr0SCCMji2J4LQ5Qw== Date: Wed, 29 Jul 2015 14:22:29 +0000 Message-ID: <0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9@IRSMSX106.ger.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.180] Content-Type: multipart/alternative; boundary="_000_0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9IRSMSX106gercor_" MIME-Version: 1.0 Subject: Introduction and some opcache SSE related stuff From: bogdan.andone@intel.com ("Andone, Bogdan") --_000_0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9IRSMSX106gercor_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Guys, My name is Bogdan Andone and I work for Intel in the area of SW performance= analysis and optimizations. We would like to actively contribute to Zend PHP project and to involve our= selves in finding new performance improvement opportunities based on availa= ble and/or new hardware features. I am still in the source code digesting phase but I had a look to the fast_= memcpy() implementation in opcache extension which uses SSE intrinsics: If I am not wrong fast_memcpy() function is not currently used, as I didn't= find the "-msse4.2" gcc flag in the Makefile. I assume you probably didn't= see any performance benefit so you preserved generic memcpy() usage. I would like to propose a slightly different implementation which uses _mm_= store_si128() instead of _mm_stream_si128(). This ensures that copied memor= y is preserved in data cache, which is not bad as the interpreter will star= t to use this data without the need to go back one more time to memory. _mm= _stream_si128() in the current implementation is intended to be used for st= ores where we want to avoid reading data into the cache and the cache pollu= tion; in opcache scenario it seems that preserving the data in cache has a = positive impact. Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance inc= rease for the new version of fast_memcpy() compared with the generic memcpy= (). Same result using a full load test with http_load on a Haswell EP 18 co= res. Here is the proposed pull request: https://github.com/php/php-src/pull/1446 Related to the SW prefetching instructions in fast_memcpy()... they are not= really useful in this place. There benefit is almost negligible as the add= ress requested for prefetch will be needed at the next iteration (few cycle= s later), while the time needed to get data from RAM is >100 cycles usually= .. Nevertheless... they don't heart and it seems they still have a very smal= l benefit so I preserved the original instruction and I added a new prefetc= h request for the destination pointer. Hope it helps, Bogdan --_000_0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9IRSMSX106gercor_--