Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:91543
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.161.181 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CAF+90c_4=pu-GNQnyqG9A-v6myp8tUP6suKEUzVy3dDLDOEuZQ@mail.gmail.com>
References: <0ABC26E371A76440A370CFC5EB1056CC40F0BA1F@irsmsx105.ger.corp.intel.com>
	<CAF+90c_4=pu-GNQnyqG9A-v6myp8tUP6suKEUzVy3dDLDOEuZQ@mail.gmail.com>
Date: Tue, 8 Mar 2016 14:43:25 +0100
Message-ID: <CAF+90c-9yB=SLdN258xLkPmMiW_4EQ4mROk-Pzid+RjV1ffjJw@mail.gmail.com>
To: "Andone, Bogdan" <bogdan.andone@intel.com>
Cc: "internals@lists.php.net" <internals@lists.php.net>
Content-Type: multipart/alternative; boundary=001a114084a2fe620e052d89c314
Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
From: nikita.ppv@gmail.com (Nikita Popov)

--001a114084a2fe620e052d89c314
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <nikita.ppv@gmail.com> wrote:

> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan <bogdan.andone@intel.com>
> wrote:
>
>> Hi Guys,
>>
>> I would like to propose a small change into the DJBX33A hash function
>> algorithm which will make easier the key matching validations in hash
>> lookup functions.
>>
>> The change addresses the modulo 8 tailing bytes of the key. For these
>> bytes we can use an 8 bit shift instead of a 5 bit shift; we also need t=
o
>> replace the ADD by XOR, in order to avoid byte level overflows. This cha=
nge
>> ensures the uniqueness of the hash function transformation for the taili=
ng
>> bytes: supposing two strings have same partial hash value for the first =
Nx8
>> bytes, different combinations of tailing characters (with the same tail
>> size) will always generate different keys.
>> We have the following consequences:
>> If two strings have:
>> - same hash value,
>> - same length,
>> - same bytes for the first Nx8 positions,
>> then they are equal, and the tailing bytes can be skipped during
>> comparison.
>>
>> There is a visible performance gain if we apply this approach as we can
>> use a lightweight memcmp() implementation based on longs comparison and
>> completely free of the complexity incurred by tailing bytes. For Mediawi=
ki
>> I have a 1.7%  performance gain while Wordpress reports 1.2% speedup on
>> Haswell-EP.
>>
>> Let=E2=80=99s take a small example:
>> Suppose we have a key=3D=E2=80=9Dthis_is_a_key_value=E2=80=9D.
>> The hash function for the  first N x 8 byes are computed in the original
>> way; suppose =E2=80=9Cthis_is_a_key_va=E2=80=9D (16bytes) will return a =
partial hash value
>> h1; the final hash value will be computed by the following sequence:
>> h =3D ((h1<<8) ^ h1) ^ =E2=80=98l=E2=80=99;
>> h =3D ((h<<8) ^ h) ^ =E2=80=98u=E2=80=99;
>> h =3D ((h<<8) ^ h) ^ =E2=80=98e=E2=80=99;
>> or, in only one operation:
>> h =3D (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (=E2=80=98l=E2=80=99<<16) ^ (=
(=E2=80=98l=E2=80=99^=E2=80=98u=E2=80=99)<<8) ^
>> (=E2=80=98l=E2=80=99^=E2=80=99u=E2=80=99^=E2=80=98e=E2=80=99)
>> We can see that ht=3D(=E2=80=98l=E2=80=99<<16) ^ ((=E2=80=98l=E2=80=99^=
=E2=80=98u=E2=80=99)<<8) ^ (=E2=80=98l=E2=80=99^=E2=80=99u=E2=80=99^=E2=80=
=98e=E2=80=99)  cannot be
>> obtained by any other 3 characters long tail. The statement is not true =
if
>> we use ADD instead of XOR, as extended ASCII characters might generate
>> overflows affecting the LSB of the higher byte in the hash value.
>>
>> I pushed a pull request here: https://github.com/php/php-src/pull/1793.
>> Unfortunately it does not pass the travis tests because =E2=80=9Chtmlspe=
cialchars
>> etc use a generated table that assumes the current hash function=E2=80=
=9D as
>> noticed by Nikita.
>>
>> Let me know your thoughts on this idea.
>>
>
> Hey Bogdan,
>
> This looks like an interesting idea! I'm somewhat apprehensive about
> coupling this to a change of the hash function, for two reasons:
> a) This will make it more problematic if we want to change the hash
> function in the future, e.g. if we want to switch to SipHash.
> b) The quality of the new hash distribution is not immediately clear, but
> likely non-trivially weaker.
>
> So I'm wondering if we can keep the concept of using a zend_ulong aligned
> memcmp while leaving the hash function alone: The zend_string allocation
> policy already allocates the string data aligned and padded to zend_ulong
> boundaries. If we were to additionally explicitly zero out the last byte
> (to avoid valgrind warnings) we should be able to compare the character
> data of two zend_strings using a zend_ulong memcmp. This would have the
> additional benefit that it works for normal string comparisons (unrelated
> to hashtables) as well. On the other hand, this is only possible for
> zend_string to zend_string comparisons, not for comparisons with static
> strings.
>

s/zero out the last byte/zero out the last zend_ulong

I'd like to add another issue with relying on the hash for this which I
just remembered: We currently always set the top bit of the hash for
strings (see http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351), in
order to ensure that hashes are never zero. This makes the hash non-unique.

Nikita

--001a114084a2fe620e052d89c314--