Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:92917 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 72066 invoked from network); 29 Apr 2016 07:50:53 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 Apr 2016 07:50:53 -0000 Authentication-Results: pb1.pair.com header.from=nikita.ppv@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=nikita.ppv@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.161.175 as permitted sender) X-PHP-List-Original-Sender: nikita.ppv@gmail.com X-Host-Fingerprint: 209.85.161.175 mail-yw0-f175.google.com Received: from [209.85.161.175] ([209.85.161.175:36649] helo=mail-yw0-f175.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id E8/21-60902-B5213275 for ; Fri, 29 Apr 2016 03:50:52 -0400 Received: by mail-yw0-f175.google.com with SMTP id o66so171092837ywc.3 for ; Fri, 29 Apr 2016 00:50:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=ph2w8W0GvIHMh1/+5z2CtrXOkcsVAnQunsEuwsX636Q=; b=cCYn9OwwgC3YoZI9hrlzcwkbdqHRU6UGH79lqVsVqCOsAJfIaEGuaHd1KkDDkBxPQE 6H9t6J7wNL5l+VVJ+mn4YNjqITMi3l2rMqDbBRhxh+u+mN6myb/1fJzyLo1VM/9jeFtR FPBZW/gDxqgAJtYxaB/U48ahtztUqI2Wo/zR5FszrZcfNNo1h7qGOWweTCtXynT8hB+y tcDxVEcao7uBiZI6kkpxalN9lnldKI8hgHnTU2vw3/DxHp/GefdYXo3FX9x+d8k1qUsB YY1O9RkIWsEwUIqpvIpn2yYW7SAJNdu6SYV5Jk2dkyOzN6GTGWoW0JNc5i5mEn3/gAMk ADjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=ph2w8W0GvIHMh1/+5z2CtrXOkcsVAnQunsEuwsX636Q=; b=R1oHWp+6wQNzHnmk5JfkXm4/TPAdPQ3sw6Mr2fytr83BO0vtnOfvg1PpWxowzt61Xj qeWTS8fhcLt/63uQRDDfaWr+sAS1BB29fhif9//v1abipVEHsvVq2opg1vn1O+7x/8ql rqKVkn0xaksePzDAZX5L/FVwyxIhsGd8IGn9N6pJP+Hl8rUBZoimlkZ67iRmCPK7ZOIR tCcBmjaajxhMPqWTcrPmaJ0K2llm+MRFSk+vx+Z402/ukCjs89k9gc42pzzLYmKBFUFv fUExOhiEtL1J0tnj8xERjTYfasO9xjcAXG0RsgJpnGS+gTAaRCpgxZ9SB65HWl+gpIXu u+Eg== X-Gm-Message-State: AOPr4FW2dAuJC0AlU951SITsMN/w1gfKog5lRPNqNIYJMWxKKgDznWm8XCmlf5qSCRywYVSHMKMMzK0zEyLe5A== MIME-Version: 1.0 X-Received: by 10.129.101.193 with SMTP id z184mr12145937ywb.242.1461916249115; Fri, 29 Apr 2016 00:50:49 -0700 (PDT) Received: by 10.13.239.3 with HTTP; Fri, 29 Apr 2016 00:50:49 -0700 (PDT) In-Reply-To: <70BE415E8A7D439FBBEB3817F61577D5@pc1> References: <70BE415E8A7D439FBBEB3817F61577D5@pc1> Date: Fri, 29 Apr 2016 09:50:49 +0200 Message-ID: To: Matt Wilmas Cc: PHP internals , Dmitry Stogov , Xinchen Hui Content-Type: multipart/alternative; boundary=001a114c76b6b857ff05319ae6b0 Subject: Re: New HashTable implementation? From: nikita.ppv@gmail.com (Nikita Popov) --001a114c76b6b857ff05319ae6b0 Content-Type: text/plain; charset=UTF-8 On Fri, Apr 29, 2016 at 2:04 AM, Matt Wilmas wrote: > Hi all, > > Last June, it was briefly mentioned about changing PHP's string hash > function [1] (DJB33 *seems* pretty horrible, after all, as far as > collisions...). So 8 months ago I tried almost, if not, a half-dozen of > them (including Murmur3) that were simple enough to quickly toss in. > > The result? In all cases, I believe, fewer instructions (in > zend_hash_find, > and the hashing function), BUT also a slight-to-small increase in cache > misses in Wordpress and other scripts... > > And in a test filling an array with a million "string_$i" keys (high > collision pattern for DJB33?), the speed was halved by the *huge* cache > miss > increase. :-/ > > I couldn't quite understand what was happening, where, if there were fewer > collisions... Misses all spread out in the hash-array? > > So there didn't seem to be anything useful to gain there. > > > Now, after seeing Bogdan's hash optimization idea last month [2], and > reading Nikita's blog post again, I had some ideas I'd like to try -- > assuming nobody else is planning major changes. :-) Besides Nikita, I'm > addressing Dmitry and Xinchen because your names are on some minor hash > items on the 7.1 ideas wiki [4]. > > I'm thinking of a Robin Hood implementation with Universal hashing [5] (of > int keys and the string hashes). I haven't touched any code yet, but think > I've worked out all the details in my head, and am ready to take a stab at > it. I think it's fairly simple to get the basics working; and when I see > how that goes, I can do the additional optimizations I have in mind that it > allows (including reduced memory, on 64-bit at least). > I have some old experiments still lying around: https://github.com/nikic/php-src/commits/openAddressing => This switches to open-addressing using Python's probing function (which IIRC is a perturbed quadratic probe). https://github.com/nikic/php-src/commits/robinHood => Robin Hood hashing. The motivation for either is to reduce hash bucket sizes by 8 bytes, because we no longer need to store a next offset (and can repack the structure differently if we use a key union). I only vaguely remember what the issues / results of these experiments were. For open addressing iirc an issue was that using separate allocation policies for arHash and arData resulted in horrible cache effects, so I had to keep them together with a load factor <= 1/2. Another issue is that open addressing requires us to separately track arHash utilization (or alternatively not reduce nNumUsed for trailing tombstones). Robin Hood hashing does not have the problem with separately tracking tombstones in arHash and arData, as it uses backshift deletion. However I found that RH hashing is completely incompatible with use of a weak hash function, as it results in bad clustering. If we want to use RH hashing we need to switch to a stronger hash function (and in particular for integers this can have very high performance cost). (See also the commit comment on https://github.com/nikic/php-src/commit/b78a28e7317053360ce186b6d431a10c5532cac2 ) > Question: Can I use zval.u1.v.reserved ? I guess I'll find out otherwise. > :-O > Nope. But you can use the top bit of the hash to distinguish int/string keys, if that's what this is about. > The string hash function itself is really a separate thing [6], but > fasthash > [7] (not to be confused with "superfast") looks like a good one that I > missed before... After thinking about things, I think we could even > keep/use a 64-bit hash on 32-bit arch. > > > Well, just wanted to mention it first if anyone has a comment. :-) Should > be interesting, but no idea how it'll perform (lookups should be very, very > fast (upsizing also); but cache factors and inserts/deletes are wildcards). > Wish me luck!? > > > Thanks, > Matt > > [1] https://marc.info/?l=php-internals&m=143444845304138&w=2 > [2] https://marc.info/?t=145744248100001&r=1&w=2 > [4] https://wiki.php.net/php-7.1-ideas > [5] https://en.wikipedia.org/wiki/Universal_hashing > [6] https://github.com/rurban/smhasher > [7] https://github.com/rurban/smhasher/blob/master/doc/fasthash64 > > --001a114c76b6b857ff05319ae6b0--