Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:87424 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 95435 invoked from network); 31 Jul 2015 03:49:52 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 31 Jul 2015 03:49:52 -0000 Authentication-Results: pb1.pair.com header.from=dmitry@zend.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=dmitry@zend.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain zend.com designates 209.85.223.170 as permitted sender) X-PHP-List-Original-Sender: dmitry@zend.com X-Host-Fingerprint: 209.85.223.170 mail-io0-f170.google.com Received: from [209.85.223.170] ([209.85.223.170:32930] helo=mail-io0-f170.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id D3/40-12589-D50FAB55 for ; Thu, 30 Jul 2015 23:49:52 -0400 Received: by ioii16 with SMTP id i16so73609370ioi.0 for ; Thu, 30 Jul 2015 20:49:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=lquL2BTPo78j4SM2uoBZ7N1wKGAww9P+5NUMs5ZQX4g=; b=I6Jpgq8zj8/LkBBHKcKYGCWDgZYU5JfdYcXCA8SiG1yVQRXmpxMk+5cPeBLvaZexCO rb2SeE2ESKaRbod4nlbk+uOU/sPoot5Sc0GiqBGWjdMMs+qYXvpAdF5arbmLkZECeb/e rBNq6rEBTzr85jOquIg7yaOL5+hC3ICMFEenui8ZHrFMhdkMnc4bbIfAUm1Bk6MQvoJU wLsxf4ZFEi26ti3Vc1BqDo8otQRwmWj4LxwO/cDshCOhbVzGTWOiq1VzoPVsWkcLZTz3 3b/AEwONjdi7d1wvg0Gy4Jeq/9pHv2ZKUiuyxkNvV1IsdVOc7WcOLEfAFGTTAqm1EWxG ftYA== X-Gm-Message-State: ALoCoQnijvoCwWmEU4sP0zksPX2hcOjqc5a6n9/l3YGGCOsIrVXNuC6f1SeE1IqhaEWNgEd5XvZVWYDDq48r5OugIlJ29JH5bsyqhJ5PSu5tBMTV9w8EkQdxg4uV9BzgPt9Sf2f6ujJbMZcPXPSkQyYTuT4Gkzx32E6QXCkvdQx9jYJUEJRfWyI= MIME-Version: 1.0 X-Received: by 10.107.148.8 with SMTP id w8mr1197503iod.116.1438314587476; Thu, 30 Jul 2015 20:49:47 -0700 (PDT) Received: by 10.50.203.105 with HTTP; Thu, 30 Jul 2015 20:49:47 -0700 (PDT) Received: by 10.50.203.105 with HTTP; Thu, 30 Jul 2015 20:49:47 -0700 (PDT) In-Reply-To: References: <0ABC26E371A76440A370CFC5EB1056CC2F6C9AE9@IRSMSX106.ger.corp.intel.com> Date: Fri, 31 Jul 2015 06:49:47 +0300 Message-ID: To: Matt Wilmas Cc: Bogdan Andone , PHP Internals Content-Type: multipart/alternative; boundary=001a113fe51e0ff55e051c23b65a Subject: Re: [PHP-DEV] Introduction and some opcache SSE related stuff From: dmitry@zend.com (Dmitry Stogov) --001a113fe51e0ff55e051c23b65a Content-Type: text/plain; charset=UTF-8 On Jul 31, 2015 2:12 AM, "Matt Wilmas" wrote: > > Hi Dmitry, Bogdan, > > > ----- Original Message ----- > From: "Dmitry Stogov" > Sent: Thursday, July 30, 2015 > >> Hi Bogdan, >> >> On Wed, Jul 29, 2015 at 5:22 PM, Andone, Bogdan >> wrote: >> >>> Hi Guys, >>> >>> My name is Bogdan Andone and I work for Intel in the area of SW >>> performance analysis and optimizations. >>> We would like to actively contribute to Zend PHP project and to involve >>> ourselves in finding new performance improvement opportunities based on >>> available and/or new hardware features. >>> I am still in the source code digesting phase but I had a look to the >>> fast_memcpy() implementation in opcache extension which uses SSE intrinsics: >>> >>> If I am not wrong fast_memcpy() function is not currently used, as I >>> didn't find the "-msse4.2" gcc flag in the Makefile. I assume you probably >>> didn't see any performance benefit so you preserved generic memcpy() usage. >>> >> >> This is not SSE4.2 this is SSE2. >> Any X86_64 target implements SSE2, so it's enabled by default on x86_64 >> systems (at least on Linux). >> It also may be enabled on x86 targets adding "-msse2" option. > > > Right, I was gonna say, I think that was a mistake, and all x86_64 should be using it at least... > > Of course, using anything newer that needs special options is nearly useless, since I guess the vast majority aren't building themselves, but using lowest-common-denominator repos. I had been wondering about speeding up some other things, maybe taking advantage of SSE4.x (string stuff, I don't know), but... like I said. Runtime checks would be awesome, but except for the recent GCC, the intrinsics aren't available unless the corresponding SSE option is enabled (lame!). So requires a separate compilation unit. :-/ > > Of course I guess if the intrinsic maps simply to the instruction, could just do it with inline asm, if wanted to do runtime CPU checking. > > >>> I would like to propose a slightly different implementation which uses >>> _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied >>> memory is preserved in data cache, which is not bad as the interpreter will >>> start to use this data without the need to go back one more time to memory. >>> _mm_stream_si128() in the current implementation is intended to be used for >>> stores where we want to avoid reading data into the cache and the cache >>> pollution; in opcache scenario it seems that preserving the data in cache >>> has a positive impact. >>> >> >> _mm_stream_si128() was used on purpose, to avoid CPU cache pollution, >> because data copied from SHM to process memory is not necessary used before >> eviction. >> By the way, I'm not completely sure. May be _mm_store_si128() can provide >> better result. > > > Interesting (that _stream was used on purpose). :-) > > >>> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance >>> increase for the new version of fast_memcpy() compared with the generic >>> memcpy(). Same result using a full load test with http_load on a Haswell EP >>> 18 cores. >>> >> >> 1% is really big improvement. >> I'll able to check this only on next week (when back from vacation). > > > Well, he talks like he was comparing to *generic* memcpy(), so...? But not sure how that would have been accomplished. > > BTW guys, I was wondering before why fast_memcpy() only in this opcache area? For the prefetch and/or cache pollution reasons? Just because, in this place we may copy big blocks, and we also may align them properly, to use compact and fast Inlined code. > > Because shouldn't the library functions in glibc, etc. already be using versions optimized for the CPU at runtime? So is generic memcpy() already "fast?" (Other than overhead for a function call.) glibc already uses optimized memcpy(), but this is universal function, that has to check for different conditions, like allignment of source and distination and length. > > >>> Here is the proposed pull request: >>> https://github.com/php/php-src/pull/1446 >>> >>> Related to the SW prefetching instructions in fast_memcpy()... they are >>> not really useful in this place. There benefit is almost negligible as the >>> address requested for prefetch will be needed at the next iteration (few >>> cycles later), while the time needed to get data from RAM is >100 cycles >>> usually.. Nevertheless... they don't heart and it seems they still have a >>> very small benefit so I preserved the original instruction and I added a >>> new prefetch request for the destination pointer. >>> >> >> I also didn't see significant difference from software prefetching. > > > So how about prefetching "further"/more interations ahead...? I tried, but didn't see difference as well. Thanks. Dmitry. > > >> Thanks. Dmitry. >> >> >>> >>> Hope it helps, >>> Bogdan > > > - Matt --001a113fe51e0ff55e051c23b65a--