Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:91542 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 28629 invoked from network); 8 Mar 2016 13:18:25 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 8 Mar 2016 13:18:25 -0000 Authentication-Results: pb1.pair.com smtp.mail=nikita.ppv@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=nikita.ppv@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.161.181 as permitted sender) X-PHP-List-Original-Sender: nikita.ppv@gmail.com X-Host-Fingerprint: 209.85.161.181 mail-yw0-f181.google.com Received: from [209.85.161.181] ([209.85.161.181:34976] helo=mail-yw0-f181.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id C1/83-03587-021DED65 for ; Tue, 08 Mar 2016 08:18:24 -0500 Received: by mail-yw0-f181.google.com with SMTP id g127so11107115ywf.2 for ; Tue, 08 Mar 2016 05:18:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=uPwpCEjE/eXIr+1cxLjkDya+EKtfyeWpU1E8x9BMhk4=; b=mGFuV8yOmf8CKTjIJvMIr3DGx5AGQMHYRpMzf/aTHCwCU7n8oQCb2Q2ONu/9nc0D8p k4Y2elfLm9b/T5UlJJUKSIWhg2BK/wETXPyYewbuLbECMaKtrgYU/sZ8PWWvtfl3PSwV N/GCJHw48dQ63ok6eToXY8u5qH/Fq0m8CwnoGP6zcLpiA6cIbEiai4jHwLLlTPDeFqrC BrbKNmrXme5DFZOG1/QBiMcoe8ZPtBKsfYCM2Fp+Po9Ja8FIg0m6LPzizH2W40Fc6p2E uJRKdmzuzqv8QeQPd9zzyodQHlJUDFCGWMqpjCxB1cz611JtZRE68dZ+nNy4Q9k/mUfy hr7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=uPwpCEjE/eXIr+1cxLjkDya+EKtfyeWpU1E8x9BMhk4=; b=Q1pU2dheifPovq4Uv9d3+dfFqcIO+Y85XiQVYnhv18Jo/ZKrx75qYymZCi3A6vA7sF BoDpv6DwZ2dqBYrJ17MV1uDi1o/WUXAiHqBf1jiyesGDO9srYfUS3jSJ0JscMS4w7/U8 u+URdpUOJpss6svlXvxkXBKbfdnDne7B7e/39PEgPCe5x/VCxGbNYgq1BujVcpeOLqnH tktBpSpLAkKw8cpxRInx6OJJQLf+sB0rvx+Sj+LlKzFO6NJHKOeahfne6X3Y0jlC0gzh hUWVJO6nC0BMTD27R3zO4O4e6ndBqPpwioIf65gIqaZXOAsEYi0KcYVYbZ0T9ORCyhgF CPMA== X-Gm-Message-State: AD7BkJKK9K2tz4+bPJIzCjjHFPprHl72w6BnJRNAZ12hPyoLlqs8PVlzJxISwkNvrrgKyZHiz3sx6XzIFN2rCA== MIME-Version: 1.0 X-Received: by 10.129.50.134 with SMTP id y128mr11070679ywy.305.1457443101716; Tue, 08 Mar 2016 05:18:21 -0800 (PST) Received: by 10.129.148.70 with HTTP; Tue, 8 Mar 2016 05:18:21 -0800 (PST) In-Reply-To: <0ABC26E371A76440A370CFC5EB1056CC40F0BA1F@irsmsx105.ger.corp.intel.com> References: <0ABC26E371A76440A370CFC5EB1056CC40F0BA1F@irsmsx105.ger.corp.intel.com> Date: Tue, 8 Mar 2016 14:18:21 +0100 Message-ID: To: "Andone, Bogdan" Cc: "internals@lists.php.net" Content-Type: multipart/alternative; boundary=001a11409d145bd7c2052d896a7c Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups From: nikita.ppv@gmail.com (Nikita Popov) --001a11409d145bd7c2052d896a7c Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan wrote: > Hi Guys, > > I would like to propose a small change into the DJBX33A hash function > algorithm which will make easier the key matching validations in hash > lookup functions. > > The change addresses the modulo 8 tailing bytes of the key. For these > bytes we can use an 8 bit shift instead of a 5 bit shift; we also need to > replace the ADD by XOR, in order to avoid byte level overflows. This chan= ge > ensures the uniqueness of the hash function transformation for the tailin= g > bytes: supposing two strings have same partial hash value for the first N= x8 > bytes, different combinations of tailing characters (with the same tail > size) will always generate different keys. > We have the following consequences: > If two strings have: > - same hash value, > - same length, > - same bytes for the first Nx8 positions, > then they are equal, and the tailing bytes can be skipped during > comparison. > > There is a visible performance gain if we apply this approach as we can > use a lightweight memcmp() implementation based on longs comparison and > completely free of the complexity incurred by tailing bytes. For Mediawik= i > I have a 1.7% performance gain while Wordpress reports 1.2% speedup on > Haswell-EP. > > Let=E2=80=99s take a small example: > Suppose we have a key=3D=E2=80=9Dthis_is_a_key_value=E2=80=9D. > The hash function for the first N x 8 byes are computed in the original > way; suppose =E2=80=9Cthis_is_a_key_va=E2=80=9D (16bytes) will return a p= artial hash value > h1; the final hash value will be computed by the following sequence: > h =3D ((h1<<8) ^ h1) ^ =E2=80=98l=E2=80=99; > h =3D ((h<<8) ^ h) ^ =E2=80=98u=E2=80=99; > h =3D ((h<<8) ^ h) ^ =E2=80=98e=E2=80=99; > or, in only one operation: > h =3D (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (=E2=80=98l=E2=80=99<<16) ^ ((= =E2=80=98l=E2=80=99^=E2=80=98u=E2=80=99)<<8) ^ > (=E2=80=98l=E2=80=99^=E2=80=99u=E2=80=99^=E2=80=98e=E2=80=99) > We can see that ht=3D(=E2=80=98l=E2=80=99<<16) ^ ((=E2=80=98l=E2=80=99^= =E2=80=98u=E2=80=99)<<8) ^ (=E2=80=98l=E2=80=99^=E2=80=99u=E2=80=99^=E2=80= =98e=E2=80=99) cannot be > obtained by any other 3 characters long tail. The statement is not true i= f > we use ADD instead of XOR, as extended ASCII characters might generate > overflows affecting the LSB of the higher byte in the hash value. > > I pushed a pull request here: https://github.com/php/php-src/pull/1793. > Unfortunately it does not pass the travis tests because =E2=80=9Chtmlspec= ialchars > etc use a generated table that assumes the current hash function=E2=80=9D= as > noticed by Nikita. > > Let me know your thoughts on this idea. > Hey Bogdan, This looks like an interesting idea! I'm somewhat apprehensive about coupling this to a change of the hash function, for two reasons: a) This will make it more problematic if we want to change the hash function in the future, e.g. if we want to switch to SipHash. b) The quality of the new hash distribution is not immediately clear, but likely non-trivially weaker. So I'm wondering if we can keep the concept of using a zend_ulong aligned memcmp while leaving the hash function alone: The zend_string allocation policy already allocates the string data aligned and padded to zend_ulong boundaries. If we were to additionally explicitly zero out the last byte (to avoid valgrind warnings) we should be able to compare the character data of two zend_strings using a zend_ulong memcmp. This would have the additional benefit that it works for normal string comparisons (unrelated to hashtables) as well. On the other hand, this is only possible for zend_string to zend_string comparisons, not for comparisons with static strings. Regards, Nikita --001a11409d145bd7c2052d896a7c--