Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:92925
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain intel.com designates 134.134.136.65 as permitted sender)
To: Matt Wilmas <php_lists@realplain.com>, 'Nikita Popov'
	<nikita.ppv@gmail.com>
CC: "internals@lists.php.net" <internals@lists.php.net>
Thread-Topic: [PHP-DEV] Lazy keys comparison during hash lookups
Thread-Index: AdF5OqqPklKUV25aRaSccqn8QkELAwAAlHWAAAC45pAAAHfB8AAkJkCQAcnWDRAHcRLuQAB1U74/AFXMtHA=
Date: Fri, 29 Apr 2016 09:54:36 +0000
Message-ID: <0ABC26E371A76440A370CFC5EB1056CC40F683BE@IRSMSX106.ger.corp.intel.com>
References: <0ABC26E371A76440A370CFC5EB1056CC40F0BA1F@irsmsx105.ger.corp.intel.com>
 <CAF+90c_4=pu-GNQnyqG9A-v6myp8tUP6suKEUzVy3dDLDOEuZQ@mail.gmail.com>
 <0ABC26E371A76440A370CFC5EB1056CC40F0BB53@irsmsx105.ger.corp.intel.com>
 <0ABC26E371A76440A370CFC5EB1056CC40F0BB89@irsmsx105.ger.corp.intel.com>
 <0ABC26E371A76440A370CFC5EB1056CC40F0BE4F@irsmsx105.ger.corp.intel.com>
 <0ABC26E371A76440A370CFC5EB1056CC40F13B94@irsmsx105.ger.corp.intel.com>
 <0ABC26E371A76440A370CFC5EB1056CC40F67165@IRSMSX106.ger.corp.intel.com>
 <D21A82E276CA4DB2A7937B01F1F93B1C@pc1>
In-Reply-To: <D21A82E276CA4DB2A7937B01F1F93B1C@pc1>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
From: bogdan.andone@intel.com ("Andone, Bogdan")



> -----Original Message-----
> From: Matt Wilmas [mailto:php_lists@realplain.com]
> Sent: Wednesday, April 27, 2016 5:44 PM
> To: Andone, Bogdan <bogdan.andone@intel.com>; 'Nikita Popov'
> <nikita.ppv@gmail.com>
> Cc: internals@lists.php.net
> Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
>=20
> Hi Bogdan, all,
>=20
> ----- Original Message -----
> From: "Andone, Bogdan"
> Sent: Monday, April 25, 2016
>=20
> >> -----Original Message-----
> >> From: Andone, Bogdan [mailto:bogdan.andone@intel.com]
> >> Sent: Friday, March 18, 2016 11:58 AM
> >> To: 'Nikita Popov' <nikita.ppv@gmail.com>
> >> Cc: internals@lists.php.net
> >> Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >>
> >> > -----Original Message-----
> >> > From: Andone, Bogdan [mailto:bogdan.andone@intel.com]
> >> > Sent: Wednesday, March 09, 2016 9:33 AM
> >> > To: 'Nikita Popov'
> >> > Cc: internals@lists.php.net
> >> > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >> >
> >> >
> >> >
> >> > > -----Original Message-----
> >> > > From: Andone, Bogdan [mailto:bogdan.andone@intel.com]
> >> > > Sent: Tuesday, March 08, 2016 4:19 PM
> >> > > To: 'Nikita Popov'
> >> > > Cc: internals@lists.php.net
> >> > > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >> > >
> >> > > > From: Nikita Popov [mailto:nikita.ppv@gmail.com]
> >> > > > Sent: Tuesday, March 08, 2016 3:43 PM
> >> > > > To: Andone, Bogdan
> >> > > > Cc: internals@lists.php.net
> >> > > > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> >> > > > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov
> >> > > > ><nikita.ppv@gmail.com>
> >> > > wrote:
> >> > > > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> >> > > > <bogdan.andone@intel.com> wrote:
> >> > > > >> Hi Guys,
> >> > > > >>
> >> > > > >> I would like to propose a small change into the DJBX33A hash
> >> > > function
> >> > > > algorithm which will make easier the key matching validations
> >> > > > in
> >> > hash
> >> > > > lookup functions.
> >> > > > >>
> >> > > > >> The change addresses the modulo 8 tailing bytes of the key.
> >> > > > >> For
> >> > > these
> >> > > > bytes we can use an 8 bit shift instead of a 5 bit shift; we
> >> > > > also
> >> > need
> >> > > > to replace the ADD by XOR, in order to avoid byte level overflow=
s.
> >> > > This
> >> > > > change ensures the uniqueness of the hash function
> >> > > > transformation
> >> > for
> >> > > > the tailing bytes: supposing two strings have same partial hash
> >> > value
> >> > > > for the first Nx8 bytes, different combinations of tailing
> >> > characters
> >> > > > (with the same tail size) will always generate different keys.
> >> > > > >> We have the following consequences:
> >> > > > >> If two strings have:
> >> > > > >> - same hash value,
> >> > > > >> - same length,
> >> > > > >> - same bytes for the first Nx8 positions, then they are
> >> > > > >> equal, and the tailing bytes can be skipped during
> >> > > > comparison.
> >> > > > >>
> >> > > > >> There is a visible performance gain if we apply this
> >> > > > >> approach as
> >> > we
> >> > > > can use a lightweight memcmp() implementation based on longs
> >> > > comparison
> >> > > > and completely free of the complexity incurred by tailing bytes.
> >> > > > For Mediawiki I have a 1.7% performance gain while Wordpress
> >> > > > reports
> >> > 1.2%
> >> > > > speedup on Haswell-EP.
> >> > > > >>
> >> > > > >> Let's take a small example:
> >> > > > >> Suppose we have a key=3D"this_is_a_key_value".
> >> > > > >> The hash function for the first N x 8 byes are computed in
> >> > > > >> the
> >> > > > original way; suppose "this_is_a_key_va" (16bytes) will return
> >> > > > a
> >> > > partial
> >> > > > hash value h1; the final hash value will be computed by the
> >> > following
> >> > > > sequence:
> >> > > > >> h =3D ((h1<<8) ^ h1) ^ 'l';
> >> > > > >> h =3D ((h<<8) ^ h) ^ 'u';
> >> > > > >> h =3D ((h<<8) ^ h) ^ 'e';
> >> > > > >> or, in only one operation:
> >> > > > >> h =3D (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ ('l'<<16) ^
> >> > (('l'^'u')<<8)
> >> > > ^
> >> > > > ('l'^'u'^'e')
> >> > > > >> We can see that ht=3D('l'<<16) ^ (('l'^'u')<<8) ^
> >> > > ('l'^'u'^'e') cannot
> >> > > > be obtained by any other 3 characters long tail. The statement
> >> > > > is
> >> > not
> >> > > > true if we use ADD instead of XOR, as extended ASCII characters
> >> > might
> >> > > > generate overflows affecting the LSB of the higher byte in the
> >> > > > hash value.
> >> > > > >>
> >> > > > >> I pushed a pull request here: https://github.com/php/php-
> >> > > > src/pull/1793. Unfortunately it does not pass the travis tests
> >> > because
> >> > > > "htmlspecialchars etc use a generated table that assumes the
> >> > > > current hash function" as noticed by Nikita.
> >> > > > >>
> >> > > > >> Let me know your thoughts on this idea.
> >> > > > >
> >> > > > > Hey Bogdan,
> >> > > > > This looks like an interesting idea! I'm somewhat
> >> > > > > apprehensive
> >> > about
> >> > > > coupling this to a change of the hash function, for two reasons:
> >> > > > > a) This will make it more problematic if we want to change
> >> > > > > the
> >> > hash
> >> > > > function in the future, e.g. if we want to switch to SipHash.
> >> > > > > b) The quality of the new hash distribution is not
> >> > > > > immediately
> >> > > clear,
> >> > > > but likely non-trivially weaker.
> >> > > > > So I'm wondering if we can keep the concept of using a
> >> > > > > zend_ulong
> >> > > > aligned memcmp while leaving the hash function alone: The
> >> > zend_string
> >> > > > allocation policy already allocates the string data aligned and
> >> > padded
> >> > > > to zend_ulong boundaries. If we were to additionally explicitly
> >> > > > zero
> >> > > out
> >> > > > the last byte (to avoid valgrind warnings) we should be able to
> >> > > compare
> >> > > > the character data of two zend_strings using a zend_ulong memcmp=
.
> >> > This
> >> > > > would have the additional benefit that it works for normal
> >> > > > string comparisons (unrelated to hashtables) as well. On the
> >> > > > other hand,
> >> > this
> >> > > > is only possible for zend_string to zend_string comparisons,
> >> > > > not for comparisons with static strings.
> >> > > > > Regards,
> >> > > > s/zero out the last byte/zero out the last zend_ulong I'd like
> >> > > > to add another issue with relying on the hash for this
> >> > which
> >> > > I
> >> > > > just remembered: We currently always set the top bit of the
> >> > > > hash for strings (see
> >> > > http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351),
> >> > > > in order to ensure that hashes are never zero. This makes the
> >> > > > hash
> >> > > non-
> >> > > > unique.
> >> > > > Nikita
> >> > > Yeah... I missed the top bit set, but I think it could be solved
> >> > > somehow.
> >> > Actually setting the top bit is not an issue. The transformation
> >> > unicity shall be ensured for the tailing bytes only which are
> >> > shifted to the left maximum 7(on 64bits)/3(on 32bits) positions; so
> >> > they will never reach the top byte and setting the top bit will not
> >> > harm in terms of tailing bytes.
> >> >
> >> > > Imho the strongest reason against the patch is the dependency
> >> > > between the hash function and the key validation method but I
> >> > > have the impression that this is not be the most dirty hack
> >> > > deployed ever in
> >> > PHP;
> >> > > and I don't know very well what is the degree of compromise
> >> > > acceptable for 1% speedup :-)
> >> > Otherwise the concerns about the quality of the hash function and
> >> > the intention to change it with SipHash or something else, I am
> >> > afraid it will not bring much in terms of collisions reductions;
> >> > the issue here is the fact that hash bucket selection uses only the
> >> > last few bits of the hash value. Not sure how much could help here
> SipHash...
> >> > Just for defending the technical feasibility of this POC I think
> >> > that injective transformation for last tailing bytes could be
> >> > deployed over theoretically any hash function.
> >> >
> >> > Thanks,
> >> > Bogdan
> >> >
> >> > > On the other hand, your idea to zero the last zend_ulong in the
> >> > > zend_string looks nice. There could be an additional potential
> >> > > gain
> >> > also
> >> > > from lightweight implementation of memcmp(). I will give it a try
> >> > > to
> >> > see
> >> > > if it gets better results than the current proposal.
> >> > > I assume all zend_string allocation functions are located in
> >> > > zend_string.h, correct?
> >> > >
> >> > > Thanks,
> >> > > Bogdan
> >> I worked a little bit of adapting the zend string allocation
> >> functions for ensuring long size alignment for both Zend and standard
> >> allocation, and for zeroing out the end of the strings.
> >> The sad surprise was to discover that there are other >100
> >> situations, in extensions mainly, where the zend strings are
> >> truncated by directly inserting '0'
> >> terminators in the middle of strings; all these situations needs to
> >> be revised for zeroing out the entire memory word enclosing the end
> >> of the string.
> >> It will probably take a while for revising most of these places
> >> before having a correct real life run for being able to validate the
> >> POC.
> >> Optimistically speaking, supposing POC will show motivating results,
> >> we will probably need to enforce the Zend interface with respect to
> >> string truncation - at least a getter/setter pair for ZSTR_LEN where
> >> the setter should ensure also synchronizing the string termination.
> >> I suppose it will be an important effort for the community and I am
> >> not sure how much willingness will be in it. On the other side, the
> >> originally proposed solution does not have all this drawbacks and it
> >> offers 1% speedup by a localized change which could be easily
> >> reverted in the case we will get better results by a better string
> >> alignment in future.
> >>
> >> Thanks,
> >> Bogdan
> > I finalized the POC for ensuring strings granularity to zend_long
> > size; you can find it here:
> > https://gist.github.com/bogdanandone/cfde764f64137b57ce2a6f4a9b52e757
> > Unfortunately, the effort to zero-out the end of strings generates an
> > overhead higher than the gain incurred by simplified
> > zend_memcpy_aligned
> > memcpy() & memcmp() implementations - I defined a simpler
> > zend_memcpy_aligned() function used in zend_string.h and
> > zend_ulong_memeq() used in zend_hash.c inherited from my original place=
.
> > Overall I see ~1% loss on Worpress on my platform. I didn't spend time
> > in fine tuning each place needing changes but I don't expect a real
> > difference.
> > Probably this is not the way to go and I finalize with a last buzz for
> > the original patch proposal :)
>=20
> I thought, "Oh crap" when I saw your message, after you went to all that
> trouble!  I think there's a simple way to check the trailing bytes, but I=
 thought
> Valgrind would complain about it.  However, just about 18 hours before yo=
ur
> mail, I found out that's not the case, so it seems fine. :-O I'll come ba=
ck to this
> last...
>=20
> First, I wanted to thank you for the original idea -- because that led me=
 to look
> into PHP 7 hash stuff because I thought I wasn't familiar with enough of =
it.
> Turned out I knew more than I realized, but I hope now my knowledge is mu=
ch
> more thorough...  Just because of your e-mail. :^)
>=20
> At first, I thought of, "Hey, SSE compares!" (starting with length, into =
the string,
> at once).  Ignoring even more alignment/trailing issues, I tested standal=
one
> (using 16-1xx chars, Haswell).  But _mm_empeq_epi8 +
> _mm_movemask_epi8 were exactly the same as 64-bit scalar (don't think I t=
ried
> ints/-m32).  The pmovmskb latency I guess kills any gain from checking 16=
-at-a-
> time.
>=20
> I guess for longer strings, could do 32 (64?) byte chunks, with 2 (4?) cm=
peq's +
> por + movemask to get closer to what memcmp() would do...?
>=20
> Yeah, checking memcmp() standalone, the call overhead is high for common,
> short strings.  I'm guessing that's where most of the 1% came from?
Your feeling is right. My solution targets scenarios with predominant short=
 keys.=20
What I observed on real life workloads like Wordpress is that hash table ke=
ys are usually short strings: I did a profiling on WP one year ago and if I=
 remember correctly, ~%80 of keys have sizes < 32bytes; few isolated cases =
of keys with sizes >200bytes.
For short strings the time spent in memcmp() in dealing with trailing bytes=
 is much higher than the two-three loops for copying 8bytes chunks. This ha=
s also consequences at branch prediction level as trailing bytes involves m=
ore conditional branches which are more miss-predicted that loop related br=
anches.
My previous impression was that these facts above explain the 1% gain but, =
after the painful experiment of trying to round up the string sizes, I shou=
ld say that part of the gain could be explained also by the fact that lazy =
strings involve less data accesses with some reduction of data misses.
My initial thought was also to enable SSE based implementation but, as you =
also observed, it does not worth on short strings.

>=20
>=20
> Then, after learning about some other hash stuff, I came up with some ide=
as
> for a *new* HashTable implementation in PHP! :^)  I hope to start and hav=
e a
> working version shortly (see how that works, before further improvements)=
.
> I think I'll send a new message about it first (I don't think anyone else=
 is
> planning any major changes, though).
>=20
>=20
> Now, about your original "lazy" idea, specifically changing the hash
> function to xor the trailing bytes with the hash value.  I think there's =
a
> major problem: Consider a "specially" (easily!) crafted string where the
> trailing bytes can clear/set almost all hash bits.  So almost the entire
> hash value can be set by those trailing bytes. :-O  This may not be a
> problem if using a seeded hash function; not sure.
Not sure I understood your point.=20
For any string there is always a combination of trailing bytes for clearing=
/setting all hash bytes; this combination will be unique for any string. Bu=
t I don't see any problem if this is not a common situation that will gener=
ate collisions. Your statement applies also to the current hash function, I=
 think.
An example would probably help...

>=20
>=20
> Finally, the [branchless] way to check trailing bytes instead?  A few wee=
ks
> ago, I thought of using xor to compare, then shift out the unwanted bytes
> (left/right for little/big endian, I guess).  But Valgrind will complain
> about about the uninitialized parts, I thought.  Until I came across some
> info. [1] Sat. night suggesting that Valgrind tracks values on a *bit*
> level.  So I had to try it, outside of PHP, and it worked fine!
>=20
> [1]
> https://www.usenix.org/legacy/publications/library/proceedings/usenix05/t=
ech
> /general/full_papers/seward/seward_html/usenix2005.html
>=20
> Here's what I thought, which could even be run uncondtionally, without
> checking for trailing bytes, since there's always another long-sized chun=
k:
>=20
> mov (str1), %r
> xor (str2), %r
> and $7, %ecx    # % 8
> shl $3, %ecx    # * 8
> xor %any, %any  # clear flags, for free, in case %cl is 0
> shl %cl, %r
> jne ...
>=20
The idea is nice but I think it needs additional tuning.
I assume %ecx keeps the string size, right? If this is true, supposing the =
trailing bytes count is 3, the code will shift out 3 bytes and it will retu=
rn the comparison result for 5 bytes which is not correct; I think more ins=
tructions will be involved for doing the right shift.
Otherwise, it seems that your proposal works only for the trailing bytes. Y=
ou cannot apply the same steps for the full chunks; in this case also only =
the number of trailing bytes will be compared in each chunk. Not talking ab=
out the fact that the code should still preserve the chuck index from an it=
eration to another, decrement it and so on. For being short you still have =
to deal with two parts of the implementation: one for entire chunks and one=
 for trailing bytes and a conditional logic for jumping from one to another=
.
=20
> Should only cost the latency of the variable shift (slower than AMD :-P).
>=20
> But the compilers aren't smart enough (?) to use the shift result for the
> jump, and put an extra "test %r, %r."  Well it will probably fuse with th=
e
> jump, so no extra micro-ops anyway. :-)
>=20
> And my "ideal" instructions cause false errors with Valgrind anyway, when
> the shift moves uninitialized bits to the carry flag.  (%eflags is tracke=
d
> as a whole.)  Adding a free "clear carry" doesn't help...
>=20
> So I guess a check for trailing bytes is needed in C.  But same idea, wit=
h
> the full background. ;-)
Why bothering about carry flag? I assume Valgrind cries because the code ac=
cesses bytes that are not allocated and that some sanity allocation overhea=
d is needed as you say, probably not at the complexity I've tried to deploy=
. By the way, the statement that there's always another long-sized chunk is=
 not true for the cases where strings are not allocated with the zend alloc=
ator but with the system malloc().

Thanks,
Bogdan

>=20
> Thoughts?
>=20
> > Thanks,
> > Bogdan
>=20
> - Matt