Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:91543 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 31861 invoked from network); 8 Mar 2016 13:43:30 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 8 Mar 2016 13:43:30 -0000 Authentication-Results: pb1.pair.com smtp.mail=nikita.ppv@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=nikita.ppv@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.161.181 as permitted sender) X-PHP-List-Original-Sender: nikita.ppv@gmail.com X-Host-Fingerprint: 209.85.161.181 mail-yw0-f181.google.com Received: from [209.85.161.181] ([209.85.161.181:34840] helo=mail-yw0-f181.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 2D/04-03587-007DED65 for ; Tue, 08 Mar 2016 08:43:29 -0500 Received: by mail-yw0-f181.google.com with SMTP id g127so11656598ywf.2 for ; Tue, 08 Mar 2016 05:43:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=xG8hgAtp2PjRpLP+sun+pZWS+CQJuqJYZvlJNx0IM9E=; b=vp7Wd/8v+3bb6ZEeQZeN4bdcKln6Y6qoVfZxRrP6aKbSpsYoiJaSeY/m4XgahyBhKZ BBtrhfCyofjplwOA5rdZlir6HC+1FEFPDkQvjNPiUoXq2PnMTHPhZXc7xAZefhhuzx6Y vRO/KhNL6IzbDmxKw7sDOlpZ/ZsH+N/2aZgU8hyOjMBVX7Q/kD74Hduo8QWXudxjHiDE o+BlZiyHGE3LLzD1quTK+W9XvbC6+bWtt5XbTOsxh9XP7z4HC0T8Z/5jHvcQVn2gS5g/ yissZHfIZ1Ar9Nz1z628wqgoeQppF2q8OwT0y5Jed21CPoD1ql4PrTLZVU8PnjY0f2Ee s9bg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=xG8hgAtp2PjRpLP+sun+pZWS+CQJuqJYZvlJNx0IM9E=; b=L0tGnl+yfdsJQcraeTjSvfssvZHZjQXWWp79fE/M3R43F8kPbRUtY3p+dhrjlCHCsN 71PMcGkEh5aNdYoZCkDyTweESpI37Cw/aIqTVNp8WkEG8iMvDusizYT7FBlST+zPxS9T Sjtpek26GmEXSyaTO1GpRsCuoR1ahm1I90/P2RvfelN+r0ILxr7BcX33m1le3aRzPhoe tuEHdqwUuyQEbTxDx/+bzJniuE6t31dOBgu4MRnnQUqjj2LISS0elkjR6kArdvQJMHG+ oQyX7kZuVIssPS3H8bupgrAb7UE4wR0pZ6D09rGRuXh7aAX+NHGvEZHIxkP2ZORVhfqF keHg== X-Gm-Message-State: AD7BkJJgOAxZQj+6bn9uy/ykujFQ+Ro2/HVjy+aqTqyoE0RXxFx1Jptrbgh4HWHvYI5Lccy9Tc/gkJbJFMhf+Q== MIME-Version: 1.0 X-Received: by 10.129.48.77 with SMTP id w74mr10015633yww.147.1457444605548; Tue, 08 Mar 2016 05:43:25 -0800 (PST) Received: by 10.129.148.70 with HTTP; Tue, 8 Mar 2016 05:43:25 -0800 (PST) In-Reply-To: References: <0ABC26E371A76440A370CFC5EB1056CC40F0BA1F@irsmsx105.ger.corp.intel.com> Date: Tue, 8 Mar 2016 14:43:25 +0100 Message-ID: To: "Andone, Bogdan" Cc: "internals@lists.php.net" Content-Type: multipart/alternative; boundary=001a114084a2fe620e052d89c314 Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups From: nikita.ppv@gmail.com (Nikita Popov) --001a114084a2fe620e052d89c314 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov wrote: > On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan > wrote: > >> Hi Guys, >> >> I would like to propose a small change into the DJBX33A hash function >> algorithm which will make easier the key matching validations in hash >> lookup functions. >> >> The change addresses the modulo 8 tailing bytes of the key. For these >> bytes we can use an 8 bit shift instead of a 5 bit shift; we also need t= o >> replace the ADD by XOR, in order to avoid byte level overflows. This cha= nge >> ensures the uniqueness of the hash function transformation for the taili= ng >> bytes: supposing two strings have same partial hash value for the first = Nx8 >> bytes, different combinations of tailing characters (with the same tail >> size) will always generate different keys. >> We have the following consequences: >> If two strings have: >> - same hash value, >> - same length, >> - same bytes for the first Nx8 positions, >> then they are equal, and the tailing bytes can be skipped during >> comparison. >> >> There is a visible performance gain if we apply this approach as we can >> use a lightweight memcmp() implementation based on longs comparison and >> completely free of the complexity incurred by tailing bytes. For Mediawi= ki >> I have a 1.7% performance gain while Wordpress reports 1.2% speedup on >> Haswell-EP. >> >> Let=E2=80=99s take a small example: >> Suppose we have a key=3D=E2=80=9Dthis_is_a_key_value=E2=80=9D. >> The hash function for the first N x 8 byes are computed in the original >> way; suppose =E2=80=9Cthis_is_a_key_va=E2=80=9D (16bytes) will return a = partial hash value >> h1; the final hash value will be computed by the following sequence: >> h =3D ((h1<<8) ^ h1) ^ =E2=80=98l=E2=80=99; >> h =3D ((h<<8) ^ h) ^ =E2=80=98u=E2=80=99; >> h =3D ((h<<8) ^ h) ^ =E2=80=98e=E2=80=99; >> or, in only one operation: >> h =3D (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (=E2=80=98l=E2=80=99<<16) ^ (= (=E2=80=98l=E2=80=99^=E2=80=98u=E2=80=99)<<8) ^ >> (=E2=80=98l=E2=80=99^=E2=80=99u=E2=80=99^=E2=80=98e=E2=80=99) >> We can see that ht=3D(=E2=80=98l=E2=80=99<<16) ^ ((=E2=80=98l=E2=80=99^= =E2=80=98u=E2=80=99)<<8) ^ (=E2=80=98l=E2=80=99^=E2=80=99u=E2=80=99^=E2=80= =98e=E2=80=99) cannot be >> obtained by any other 3 characters long tail. The statement is not true = if >> we use ADD instead of XOR, as extended ASCII characters might generate >> overflows affecting the LSB of the higher byte in the hash value. >> >> I pushed a pull request here: https://github.com/php/php-src/pull/1793. >> Unfortunately it does not pass the travis tests because =E2=80=9Chtmlspe= cialchars >> etc use a generated table that assumes the current hash function=E2=80= =9D as >> noticed by Nikita. >> >> Let me know your thoughts on this idea. >> > > Hey Bogdan, > > This looks like an interesting idea! I'm somewhat apprehensive about > coupling this to a change of the hash function, for two reasons: > a) This will make it more problematic if we want to change the hash > function in the future, e.g. if we want to switch to SipHash. > b) The quality of the new hash distribution is not immediately clear, but > likely non-trivially weaker. > > So I'm wondering if we can keep the concept of using a zend_ulong aligned > memcmp while leaving the hash function alone: The zend_string allocation > policy already allocates the string data aligned and padded to zend_ulong > boundaries. If we were to additionally explicitly zero out the last byte > (to avoid valgrind warnings) we should be able to compare the character > data of two zend_strings using a zend_ulong memcmp. This would have the > additional benefit that it works for normal string comparisons (unrelated > to hashtables) as well. On the other hand, this is only possible for > zend_string to zend_string comparisons, not for comparisons with static > strings. > s/zero out the last byte/zero out the last zend_ulong I'd like to add another issue with relying on the hash for this which I just remembered: We currently always set the top bit of the hash for strings (see http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351), in order to ensure that hashes are never zero. This makes the hash non-unique. Nikita --001a114084a2fe620e052d89c314--