Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92925 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 89709 invoked from network); 29 Apr 2016 09:54:46 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 Apr 2016 09:54:46 -0000 Authentication-Results: pb1.pair.com header.from=bogdan.andone@intel.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=bogdan.andone@intel.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain intel.com designates 134.134.136.65 as permitted sender) X-PHP-List-Original-Sender: bogdan.andone@intel.com X-Host-Fingerprint: 134.134.136.65 mga03.intel.com Received: from [134.134.136.65] ([134.134.136.65:50639] helo=mga03.intel.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 60/84-60902-46F23275 for ; Fri, 29 Apr 2016 05:54:45 -0400 Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP; 29 Apr 2016 02:54:40 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.24,551,1455004800"; d="scan'208";a="955423070" Received: from irsmsx151.ger.corp.intel.com ([163.33.192.59]) by fmsmga001.fm.intel.com with ESMTP; 29 Apr 2016 02:54:40 -0700 Received: from irsmsx106.ger.corp.intel.com ([169.254.8.172]) by IRSMSX151.ger.corp.intel.com ([169.254.4.72]) with mapi id 14.03.0248.002; Fri, 29 Apr 2016 10:54:37 +0100 To: Matt Wilmas , 'Nikita Popov' CC: "internals@lists.php.net" Thread-Topic: [PHP-DEV] Lazy keys comparison during hash lookups Thread-Index: AdF5OqqPklKUV25aRaSccqn8QkELAwAAlHWAAAC45pAAAHfB8AAkJkCQAcnWDRAHcRLuQAB1U74/AFXMtHA= Date: Fri, 29 Apr 2016 09:54:36 +0000 Message-ID: <0ABC26E371A76440A370CFC5EB1056CC40F683BE@IRSMSX106.ger.corp.intel.com> References: <0ABC26E371A76440A370CFC5EB1056CC40F0BA1F@irsmsx105.ger.corp.intel.com> <0ABC26E371A76440A370CFC5EB1056CC40F0BB53@irsmsx105.ger.corp.intel.com> <0ABC26E371A76440A370CFC5EB1056CC40F0BB89@irsmsx105.ger.corp.intel.com> <0ABC26E371A76440A370CFC5EB1056CC40F0BE4F@irsmsx105.ger.corp.intel.com> <0ABC26E371A76440A370CFC5EB1056CC40F13B94@irsmsx105.ger.corp.intel.com> <0ABC26E371A76440A370CFC5EB1056CC40F67165@IRSMSX106.ger.corp.intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups From: bogdan.andone@intel.com ("Andone, Bogdan") > -----Original Message----- > From: Matt Wilmas [mailto:php_lists@realplain.com] > Sent: Wednesday, April 27, 2016 5:44 PM > To: Andone, Bogdan ; 'Nikita Popov' > > Cc: internals@lists.php.net > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups >=20 > Hi Bogdan, all, >=20 > ----- Original Message ----- > From: "Andone, Bogdan" > Sent: Monday, April 25, 2016 >=20 > >> -----Original Message----- > >> From: Andone, Bogdan [mailto:bogdan.andone@intel.com] > >> Sent: Friday, March 18, 2016 11:58 AM > >> To: 'Nikita Popov' > >> Cc: internals@lists.php.net > >> Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups > >> > >> > -----Original Message----- > >> > From: Andone, Bogdan [mailto:bogdan.andone@intel.com] > >> > Sent: Wednesday, March 09, 2016 9:33 AM > >> > To: 'Nikita Popov' > >> > Cc: internals@lists.php.net > >> > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups > >> > > >> > > >> > > >> > > -----Original Message----- > >> > > From: Andone, Bogdan [mailto:bogdan.andone@intel.com] > >> > > Sent: Tuesday, March 08, 2016 4:19 PM > >> > > To: 'Nikita Popov' > >> > > Cc: internals@lists.php.net > >> > > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups > >> > > > >> > > > From: Nikita Popov [mailto:nikita.ppv@gmail.com] > >> > > > Sent: Tuesday, March 08, 2016 3:43 PM > >> > > > To: Andone, Bogdan > >> > > > Cc: internals@lists.php.net > >> > > > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups > >> > > > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov > >> > > > > > >> > > wrote: > >> > > > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan > >> > > > wrote: > >> > > > >> Hi Guys, > >> > > > >> > >> > > > >> I would like to propose a small change into the DJBX33A hash > >> > > function > >> > > > algorithm which will make easier the key matching validations > >> > > > in > >> > hash > >> > > > lookup functions. > >> > > > >> > >> > > > >> The change addresses the modulo 8 tailing bytes of the key. > >> > > > >> For > >> > > these > >> > > > bytes we can use an 8 bit shift instead of a 5 bit shift; we > >> > > > also > >> > need > >> > > > to replace the ADD by XOR, in order to avoid byte level overflow= s. > >> > > This > >> > > > change ensures the uniqueness of the hash function > >> > > > transformation > >> > for > >> > > > the tailing bytes: supposing two strings have same partial hash > >> > value > >> > > > for the first Nx8 bytes, different combinations of tailing > >> > characters > >> > > > (with the same tail size) will always generate different keys. > >> > > > >> We have the following consequences: > >> > > > >> If two strings have: > >> > > > >> - same hash value, > >> > > > >> - same length, > >> > > > >> - same bytes for the first Nx8 positions, then they are > >> > > > >> equal, and the tailing bytes can be skipped during > >> > > > comparison. > >> > > > >> > >> > > > >> There is a visible performance gain if we apply this > >> > > > >> approach as > >> > we > >> > > > can use a lightweight memcmp() implementation based on longs > >> > > comparison > >> > > > and completely free of the complexity incurred by tailing bytes. > >> > > > For Mediawiki I have a 1.7% performance gain while Wordpress > >> > > > reports > >> > 1.2% > >> > > > speedup on Haswell-EP. > >> > > > >> > >> > > > >> Let's take a small example: > >> > > > >> Suppose we have a key=3D"this_is_a_key_value". > >> > > > >> The hash function for the first N x 8 byes are computed in > >> > > > >> the > >> > > > original way; suppose "this_is_a_key_va" (16bytes) will return > >> > > > a > >> > > partial > >> > > > hash value h1; the final hash value will be computed by the > >> > following > >> > > > sequence: > >> > > > >> h =3D ((h1<<8) ^ h1) ^ 'l'; > >> > > > >> h =3D ((h<<8) ^ h) ^ 'u'; > >> > > > >> h =3D ((h<<8) ^ h) ^ 'e'; > >> > > > >> or, in only one operation: > >> > > > >> h =3D (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ ('l'<<16) ^ > >> > (('l'^'u')<<8) > >> > > ^ > >> > > > ('l'^'u'^'e') > >> > > > >> We can see that ht=3D('l'<<16) ^ (('l'^'u')<<8) ^ > >> > > ('l'^'u'^'e') cannot > >> > > > be obtained by any other 3 characters long tail. The statement > >> > > > is > >> > not > >> > > > true if we use ADD instead of XOR, as extended ASCII characters > >> > might > >> > > > generate overflows affecting the LSB of the higher byte in the > >> > > > hash value. > >> > > > >> > >> > > > >> I pushed a pull request here: https://github.com/php/php- > >> > > > src/pull/1793. Unfortunately it does not pass the travis tests > >> > because > >> > > > "htmlspecialchars etc use a generated table that assumes the > >> > > > current hash function" as noticed by Nikita. > >> > > > >> > >> > > > >> Let me know your thoughts on this idea. > >> > > > > > >> > > > > Hey Bogdan, > >> > > > > This looks like an interesting idea! I'm somewhat > >> > > > > apprehensive > >> > about > >> > > > coupling this to a change of the hash function, for two reasons: > >> > > > > a) This will make it more problematic if we want to change > >> > > > > the > >> > hash > >> > > > function in the future, e.g. if we want to switch to SipHash. > >> > > > > b) The quality of the new hash distribution is not > >> > > > > immediately > >> > > clear, > >> > > > but likely non-trivially weaker. > >> > > > > So I'm wondering if we can keep the concept of using a > >> > > > > zend_ulong > >> > > > aligned memcmp while leaving the hash function alone: The > >> > zend_string > >> > > > allocation policy already allocates the string data aligned and > >> > padded > >> > > > to zend_ulong boundaries. If we were to additionally explicitly > >> > > > zero > >> > > out > >> > > > the last byte (to avoid valgrind warnings) we should be able to > >> > > compare > >> > > > the character data of two zend_strings using a zend_ulong memcmp= . > >> > This > >> > > > would have the additional benefit that it works for normal > >> > > > string comparisons (unrelated to hashtables) as well. On the > >> > > > other hand, > >> > this > >> > > > is only possible for zend_string to zend_string comparisons, > >> > > > not for comparisons with static strings. > >> > > > > Regards, > >> > > > s/zero out the last byte/zero out the last zend_ulong I'd like > >> > > > to add another issue with relying on the hash for this > >> > which > >> > > I > >> > > > just remembered: We currently always set the top bit of the > >> > > > hash for strings (see > >> > > http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351), > >> > > > in order to ensure that hashes are never zero. This makes the > >> > > > hash > >> > > non- > >> > > > unique. > >> > > > Nikita > >> > > Yeah... I missed the top bit set, but I think it could be solved > >> > > somehow. > >> > Actually setting the top bit is not an issue. The transformation > >> > unicity shall be ensured for the tailing bytes only which are > >> > shifted to the left maximum 7(on 64bits)/3(on 32bits) positions; so > >> > they will never reach the top byte and setting the top bit will not > >> > harm in terms of tailing bytes. > >> > > >> > > Imho the strongest reason against the patch is the dependency > >> > > between the hash function and the key validation method but I > >> > > have the impression that this is not be the most dirty hack > >> > > deployed ever in > >> > PHP; > >> > > and I don't know very well what is the degree of compromise > >> > > acceptable for 1% speedup :-) > >> > Otherwise the concerns about the quality of the hash function and > >> > the intention to change it with SipHash or something else, I am > >> > afraid it will not bring much in terms of collisions reductions; > >> > the issue here is the fact that hash bucket selection uses only the > >> > last few bits of the hash value. Not sure how much could help here > SipHash... > >> > Just for defending the technical feasibility of this POC I think > >> > that injective transformation for last tailing bytes could be > >> > deployed over theoretically any hash function. > >> > > >> > Thanks, > >> > Bogdan > >> > > >> > > On the other hand, your idea to zero the last zend_ulong in the > >> > > zend_string looks nice. There could be an additional potential > >> > > gain > >> > also > >> > > from lightweight implementation of memcmp(). I will give it a try > >> > > to > >> > see > >> > > if it gets better results than the current proposal. > >> > > I assume all zend_string allocation functions are located in > >> > > zend_string.h, correct? > >> > > > >> > > Thanks, > >> > > Bogdan > >> I worked a little bit of adapting the zend string allocation > >> functions for ensuring long size alignment for both Zend and standard > >> allocation, and for zeroing out the end of the strings. > >> The sad surprise was to discover that there are other >100 > >> situations, in extensions mainly, where the zend strings are > >> truncated by directly inserting '0' > >> terminators in the middle of strings; all these situations needs to > >> be revised for zeroing out the entire memory word enclosing the end > >> of the string. > >> It will probably take a while for revising most of these places > >> before having a correct real life run for being able to validate the > >> POC. > >> Optimistically speaking, supposing POC will show motivating results, > >> we will probably need to enforce the Zend interface with respect to > >> string truncation - at least a getter/setter pair for ZSTR_LEN where > >> the setter should ensure also synchronizing the string termination. > >> I suppose it will be an important effort for the community and I am > >> not sure how much willingness will be in it. On the other side, the > >> originally proposed solution does not have all this drawbacks and it > >> offers 1% speedup by a localized change which could be easily > >> reverted in the case we will get better results by a better string > >> alignment in future. > >> > >> Thanks, > >> Bogdan > > I finalized the POC for ensuring strings granularity to zend_long > > size; you can find it here: > > https://gist.github.com/bogdanandone/cfde764f64137b57ce2a6f4a9b52e757 > > Unfortunately, the effort to zero-out the end of strings generates an > > overhead higher than the gain incurred by simplified > > zend_memcpy_aligned > > memcpy() & memcmp() implementations - I defined a simpler > > zend_memcpy_aligned() function used in zend_string.h and > > zend_ulong_memeq() used in zend_hash.c inherited from my original place= . > > Overall I see ~1% loss on Worpress on my platform. I didn't spend time > > in fine tuning each place needing changes but I don't expect a real > > difference. > > Probably this is not the way to go and I finalize with a last buzz for > > the original patch proposal :) >=20 > I thought, "Oh crap" when I saw your message, after you went to all that > trouble! I think there's a simple way to check the trailing bytes, but I= thought > Valgrind would complain about it. However, just about 18 hours before yo= ur > mail, I found out that's not the case, so it seems fine. :-O I'll come ba= ck to this > last... >=20 > First, I wanted to thank you for the original idea -- because that led me= to look > into PHP 7 hash stuff because I thought I wasn't familiar with enough of = it. > Turned out I knew more than I realized, but I hope now my knowledge is mu= ch > more thorough... Just because of your e-mail. :^) >=20 > At first, I thought of, "Hey, SSE compares!" (starting with length, into = the string, > at once). Ignoring even more alignment/trailing issues, I tested standal= one > (using 16-1xx chars, Haswell). But _mm_empeq_epi8 + > _mm_movemask_epi8 were exactly the same as 64-bit scalar (don't think I t= ried > ints/-m32). The pmovmskb latency I guess kills any gain from checking 16= -at-a- > time. >=20 > I guess for longer strings, could do 32 (64?) byte chunks, with 2 (4?) cm= peq's + > por + movemask to get closer to what memcmp() would do...? >=20 > Yeah, checking memcmp() standalone, the call overhead is high for common, > short strings. I'm guessing that's where most of the 1% came from? Your feeling is right. My solution targets scenarios with predominant short= keys.=20 What I observed on real life workloads like Wordpress is that hash table ke= ys are usually short strings: I did a profiling on WP one year ago and if I= remember correctly, ~%80 of keys have sizes < 32bytes; few isolated cases = of keys with sizes >200bytes. For short strings the time spent in memcmp() in dealing with trailing bytes= is much higher than the two-three loops for copying 8bytes chunks. This ha= s also consequences at branch prediction level as trailing bytes involves m= ore conditional branches which are more miss-predicted that loop related br= anches. My previous impression was that these facts above explain the 1% gain but, = after the painful experiment of trying to round up the string sizes, I shou= ld say that part of the gain could be explained also by the fact that lazy = strings involve less data accesses with some reduction of data misses. My initial thought was also to enable SSE based implementation but, as you = also observed, it does not worth on short strings. >=20 >=20 > Then, after learning about some other hash stuff, I came up with some ide= as > for a *new* HashTable implementation in PHP! :^) I hope to start and hav= e a > working version shortly (see how that works, before further improvements)= . > I think I'll send a new message about it first (I don't think anyone else= is > planning any major changes, though). >=20 >=20 > Now, about your original "lazy" idea, specifically changing the hash > function to xor the trailing bytes with the hash value. I think there's = a > major problem: Consider a "specially" (easily!) crafted string where the > trailing bytes can clear/set almost all hash bits. So almost the entire > hash value can be set by those trailing bytes. :-O This may not be a > problem if using a seeded hash function; not sure. Not sure I understood your point.=20 For any string there is always a combination of trailing bytes for clearing= /setting all hash bytes; this combination will be unique for any string. Bu= t I don't see any problem if this is not a common situation that will gener= ate collisions. Your statement applies also to the current hash function, I= think. An example would probably help... >=20 >=20 > Finally, the [branchless] way to check trailing bytes instead? A few wee= ks > ago, I thought of using xor to compare, then shift out the unwanted bytes > (left/right for little/big endian, I guess). But Valgrind will complain > about about the uninitialized parts, I thought. Until I came across some > info. [1] Sat. night suggesting that Valgrind tracks values on a *bit* > level. So I had to try it, outside of PHP, and it worked fine! >=20 > [1] > https://www.usenix.org/legacy/publications/library/proceedings/usenix05/t= ech > /general/full_papers/seward/seward_html/usenix2005.html >=20 > Here's what I thought, which could even be run uncondtionally, without > checking for trailing bytes, since there's always another long-sized chun= k: >=20 > mov (str1), %r > xor (str2), %r > and $7, %ecx # % 8 > shl $3, %ecx # * 8 > xor %any, %any # clear flags, for free, in case %cl is 0 > shl %cl, %r > jne ... >=20 The idea is nice but I think it needs additional tuning. I assume %ecx keeps the string size, right? If this is true, supposing the = trailing bytes count is 3, the code will shift out 3 bytes and it will retu= rn the comparison result for 5 bytes which is not correct; I think more ins= tructions will be involved for doing the right shift. Otherwise, it seems that your proposal works only for the trailing bytes. Y= ou cannot apply the same steps for the full chunks; in this case also only = the number of trailing bytes will be compared in each chunk. Not talking ab= out the fact that the code should still preserve the chuck index from an it= eration to another, decrement it and so on. For being short you still have = to deal with two parts of the implementation: one for entire chunks and one= for trailing bytes and a conditional logic for jumping from one to another= . =20 > Should only cost the latency of the variable shift (slower than AMD :-P). >=20 > But the compilers aren't smart enough (?) to use the shift result for the > jump, and put an extra "test %r, %r." Well it will probably fuse with th= e > jump, so no extra micro-ops anyway. :-) >=20 > And my "ideal" instructions cause false errors with Valgrind anyway, when > the shift moves uninitialized bits to the carry flag. (%eflags is tracke= d > as a whole.) Adding a free "clear carry" doesn't help... >=20 > So I guess a check for trailing bytes is needed in C. But same idea, wit= h > the full background. ;-) Why bothering about carry flag? I assume Valgrind cries because the code ac= cesses bytes that are not allocated and that some sanity allocation overhea= d is needed as you say, probably not at the complexity I've tried to deploy= . By the way, the statement that there's always another long-sized chunk is= not true for the cases where strings are not allocated with the zend alloc= ator but with the system malloc(). Thanks, Bogdan >=20 > Thoughts? >=20 > > Thanks, > > Bogdan >=20 > - Matt