New HashTable implementation?

9 years ago by Matt Wilmas — view source

unread

Hi all,

Last June, it was briefly mentioned about changing PHP's string hash
function [1] (DJB33 seems pretty horrible, after all, as far as
collisions...). So 8 months ago I tried almost, if not, a half-dozen of
them (including Murmur3) that were simple enough to quickly toss in.

The result? In all cases, I believe, fewer instructions (in zend_hash_find,
and the hashing function), BUT also a slight-to-small increase in cache
misses in Wordpress and other scripts...

And in a test filling an array with a million "string_$i" keys (high
collision pattern for DJB33?), the speed was halved by the huge cache miss
increase. :-/

I couldn't quite understand what was happening, where, if there were fewer
collisions... Misses all spread out in the hash-array?

So there didn't seem to be anything useful to gain there.

Now, after seeing Bogdan's hash optimization idea last month [2], and
reading Nikita's blog post again, I had some ideas I'd like to try --
assuming nobody else is planning major changes. :-) Besides Nikita, I'm
addressing Dmitry and Xinchen because your names are on some minor hash
items on the 7.1 ideas wiki [4].

I'm thinking of a Robin Hood implementation with Universal hashing [5] (of
int keys and the string hashes). I haven't touched any code yet, but think
I've worked out all the details in my head, and am ready to take a stab at
it. I think it's fairly simple to get the basics working; and when I see
how that goes, I can do the additional optimizations I have in mind that it
allows (including reduced memory, on 64-bit at least).

Question: Can I use zval.u1.v.reserved ? I guess I'll find out otherwise.
:-O

The string hash function itself is really a separate thing [6], but fasthash
[7] (not to be confused with "superfast") looks like a good one that I
missed before... After thinking about things, I think we could even
keep/use a 64-bit hash on 32-bit arch.

Well, just wanted to mention it first if anyone has a comment. :-) Should
be interesting, but no idea how it'll perform (lookups should be very, very
fast (upsizing also); but cache factors and inserts/deletes are wildcards).
Wish me luck!?

Thanks,
Matt

[1] https://marc.info/?l=php-internals&m=143444845304138&w=2
[2] https://marc.info/?t=145744248100001&r=1&w=2
[4] https://wiki.php.net/php-7.1-ideas
[5] https://en.wikipedia.org/wiki/Universal_hashing
[6] https://github.com/rurban/smhasher
[7] https://github.com/rurban/smhasher/blob/master/doc/fasthash64

9 years ago by Matt Wilmas — view source

unread

Now, after seeing Bogdan's hash optimization idea last month [2], and
reading Nikita's blog post [3] again, I had some ideas I'd like to try --
assuming nobody else is planning major changes. :-) Besides Nikita, I'm
addressing Dmitry and Xinchen because your names are on some minor hash
items on the 7.1 ideas wiki [4].

My ISP's stupid mail server wouldn't allow the link to Nikita's blog.
Trying again.

[3] nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html

9 years ago by Bob Weinand — view source

unread

Hey,

Am 29.4.2016 um 02:04 schrieb Matt Wilmas php_lists@realplain.com:

Hi all,

Last June, it was briefly mentioned about changing PHP's string hash
function [1] (DJB33 seems pretty horrible, after all, as far as
collisions...). So 8 months ago I tried almost, if not, a half-dozen of
them (including Murmur3) that were simple enough to quickly toss in.

The result? In all cases, I believe, fewer instructions (in zend_hash_find,
and the hashing function), BUT also a slight-to-small increase in cache
misses in Wordpress and other scripts...

And in a test filling an array with a million "string_$i" keys (high
collision pattern for DJB33?), the speed was halved by the huge cache miss
increase. :-/

I couldn't quite understand what was happening, where, if there were fewer
collisions... Misses all spread out in the hash-array?

So there didn't seem to be anything useful to gain there.

Now, after seeing Bogdan's hash optimization idea last month [2], and
reading Nikita's blog post again, I had some ideas I'd like to try --
assuming nobody else is planning major changes. :-) Besides Nikita, I'm
addressing Dmitry and Xinchen because your names are on some minor hash
items on the 7.1 ideas wiki [4].

I'm thinking of a Robin Hood implementation with Universal hashing [5] (of
int keys and the string hashes). I haven't touched any code yet, but think
I've worked out all the details in my head, and am ready to take a stab at
it. I think it's fairly simple to get the basics working; and when I see
how that goes, I can do the additional optimizations I have in mind that it
allows (including reduced memory, on 64-bit at least).

Question: Can I use zval.u1.v.reserved ? I guess I'll find out otherwise.
:-O

No, not really. zval.u1 gets typically overwritten on each type assignment.
(As type assigns typically happen via 32 bit type_info field directly for performance reasons.)
Eventually you may find a way to split up u2 in an useful way?

The string hash function itself is really a separate thing [6], but fasthash
[7] (not to be confused with "superfast") looks like a good one that I
missed before... After thinking about things, I think we could even
keep/use a 64-bit hash on 32-bit arch.

Well, just wanted to mention it first if anyone has a comment. :-) Should
be interesting, but no idea how it'll perform (lookups should be very, very
fast (upsizing also); but cache factors and inserts/deletes are wildcards).
Wish me luck!?

Good luck! :-)

Thanks,
Matt

[1] https://marc.info/?l=php-internals&m=143444845304138&w=2
[2] https://marc.info/?t=145744248100001&r=1&w=2
[4] https://wiki.php.net/php-7.1-ideas
[5] https://en.wikipedia.org/wiki/Universal_hashing
[6] https://github.com/rurban/smhasher
[7] https://github.com/rurban/smhasher/blob/master/doc/fasthash64

Bob

9 years ago by Yasuo Ohgaki — view source

unread

Last June, it was briefly mentioned about changing PHP's string hash
function [1] (DJB33 seems pretty horrible, after all, as far as
collisions...). So 8 months ago I tried almost, if not, a half-dozen of
them (including Murmur3) that were simple enough to quickly toss in.

The result? In all cases, I believe, fewer instructions (in zend_hash_find,
and the hashing function), BUT also a slight-to-small increase in cache
misses in Wordpress and other scripts...

And in a test filling an array with a million "string_$i" keys (high
collision pattern for DJB33?), the speed was halved by the huge cache miss
increase. :-/

I couldn't quite understand what was happening, where, if there were fewer
collisions... Misses all spread out in the hash-array?

So there didn't seem to be anything useful to gain there.

Now, after seeing Bogdan's hash optimization idea last month [2], and
reading Nikita's blog post again, I had some ideas I'd like to try --
assuming nobody else is planning major changes. :-) Besides Nikita, I'm
addressing Dmitry and Xinchen because your names are on some minor hash
items on the 7.1 ideas wiki [4].

I'm thinking of a Robin Hood implementation with Universal hashing [5] (of
int keys and the string hashes). I haven't touched any code yet, but think
I've worked out all the details in my head, and am ready to take a stab at
it. I think it's fairly simple to get the basics working; and when I see
how that goes, I can do the additional optimizations I have in mind that it
allows (including reduced memory, on 64-bit at least).

Question: Can I use zval.u1.v.reserved ? I guess I'll find out otherwise.
:-O

The string hash function itself is really a separate thing [6], but fasthash
[7] (not to be confused with "superfast") looks like a good one that I
missed before... After thinking about things, I think we could even
keep/use a 64-bit hash on 32-bit arch.

Well, just wanted to mention it first if anyone has a comment. :-) Should
be interesting, but no idea how it'll perform (lookups should be very, very
fast (upsizing also); but cache factors and inserts/deletes are wildcards).
Wish me luck!?

Thanks,
Matt

[1] https://marc.info/?l=php-internals&m=143444845304138&w=2
[2] https://marc.info/?t=145744248100001&r=1&w=2
[4] https://wiki.php.net/php-7.1-ideas
[5] https://en.wikipedia.org/wiki/Universal_hashing
[6] https://github.com/rurban/smhasher
[7] https://github.com/rurban/smhasher/blob/master/doc/fasthash64

Hi Matt,

I've benchmarked DJB and other hash before. I don't think any crypto safe
hash cannot compete against DJB hash.

I fully agree that collisions with DJB hash is too easy.

However, collisions will not happen a lot under normal usage. DJB hash is
super fast and any hash function will be compromised as time goes by.

Therefore, Nikita's proposal that detects and limits number of collisions
would be the best choice. IMO.

Nikita, any progress on limiting collisions for all arrays?
This feature is must have item for 7.1 IMHO.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

9 years ago by Nikita Popov — view source

unread

On Fri, Apr 29, 2016 at 2:04 AM, Matt Wilmas php_lists@realplain.com
wrote:

Hi all,

Last June, it was briefly mentioned about changing PHP's string hash
function [1] (DJB33 seems pretty horrible, after all, as far as
collisions...). So 8 months ago I tried almost, if not, a half-dozen of
them (including Murmur3) that were simple enough to quickly toss in.

The result? In all cases, I believe, fewer instructions (in
zend_hash_find,
and the hashing function), BUT also a slight-to-small increase in cache
misses in Wordpress and other scripts...

And in a test filling an array with a million "string_$i" keys (high
collision pattern for DJB33?), the speed was halved by the huge cache
miss
increase. :-/

I couldn't quite understand what was happening, where, if there were fewer
collisions... Misses all spread out in the hash-array?

So there didn't seem to be anything useful to gain there.

Now, after seeing Bogdan's hash optimization idea last month [2], and
reading Nikita's blog post again, I had some ideas I'd like to try --
assuming nobody else is planning major changes. :-) Besides Nikita, I'm
addressing Dmitry and Xinchen because your names are on some minor hash
items on the 7.1 ideas wiki [4].

I'm thinking of a Robin Hood implementation with Universal hashing [5] (of
int keys and the string hashes). I haven't touched any code yet, but think
I've worked out all the details in my head, and am ready to take a stab at
it. I think it's fairly simple to get the basics working; and when I see
how that goes, I can do the additional optimizations I have in mind that it
allows (including reduced memory, on 64-bit at least).

I have some old experiments still lying around:

https://github.com/nikic/php-src/commits/openAddressing => This switches to
open-addressing using Python's probing function (which IIRC is a perturbed
quadratic probe).

https://github.com/nikic/php-src/commits/robinHood => Robin Hood hashing.

The motivation for either is to reduce hash bucket sizes by 8 bytes,
because we no longer need to store a next offset (and can repack the
structure differently if we use a key union).

I only vaguely remember what the issues / results of these experiments
were. For open addressing iirc an issue was that using separate allocation
policies for arHash and arData resulted in horrible cache effects, so I had
to keep them together with a load factor <= 1/2. Another issue is that open
addressing requires us to separately track arHash utilization (or
alternatively not reduce nNumUsed for trailing tombstones).

Robin Hood hashing does not have the problem with separately tracking
tombstones in arHash and arData, as it uses backshift deletion. However I
found that RH hashing is completely incompatible with use of a weak hash
function, as it results in bad clustering. If we want to use RH hashing we
need to switch to a stronger hash function (and in particular for integers
this can have very high performance cost). (See also the commit comment on
https://github.com/nikic/php-src/commit/b78a28e7317053360ce186b6d431a10c5532cac2
)

Question: Can I use zval.u1.v.reserved ? I guess I'll find out otherwise.
:-O

Nope. But you can use the top bit of the hash to distinguish int/string
keys, if that's what this is about.

The string hash function itself is really a separate thing [6], but
fasthash
[7] (not to be confused with "superfast") looks like a good one that I
missed before... After thinking about things, I think we could even
keep/use a 64-bit hash on 32-bit arch.

Well, just wanted to mention it first if anyone has a comment. :-) Should
be interesting, but no idea how it'll perform (lookups should be very, very
fast (upsizing also); but cache factors and inserts/deletes are wildcards).
Wish me luck!?

Thanks,
Matt

[1] https://marc.info/?l=php-internals&m=143444845304138&w=2
[2] https://marc.info/?t=145744248100001&r=1&w=2
[4] https://wiki.php.net/php-7.1-ideas
[5] https://en.wikipedia.org/wiki/Universal_hashing
[6] https://github.com/rurban/smhasher
[7] https://github.com/rurban/smhasher/blob/master/doc/fasthash64

9 years ago by Dmitry Stogov — view source

unread

Hi Matt,

I also tried different hash functions (including mumur3) and came to similar result.
Faster function but more hash collisions.
Actually these collisions occurs not because of different hash values, but because of small hash array size.
Also the way we transform hash value into hash array index (use low bits) also may matter.
If we decrease hash load-factor, we get less collisions, but use more memory and this also leads to cache misses increase.

You are welcome to try Robin-Hood, in case you get positive results we may switch to it (however I'm sceptical)

Thanks. Dmitry.

9 years ago by Andone — view source

unread

-----Original Message-----
From: Dmitry Stogov [mailto:dmitry@zend.com]
Sent: Friday, April 29, 2016 10:52 AM
To: Matt Wilmas php_lists@realplain.com; internals@lists.php.net
Cc: Nikita Popov nikita.ppv@gmail.com; Xinchen Hui laruence@php.net
Subject: [PHP-DEV] Re: New HashTable implementation?

Hi Matt,

I also tried different hash functions (including mumur3) and came to similar
result.
Faster function but more hash collisions.
Actually these collisions occurs not because of different hash values, but
because of small hash array size.
Also the way we transform hash value into hash array index (use low bits)
also may matter.
If we decrease hash load-factor, we get less collisions, but use more memory
and this also leads to cache misses increase.

You are welcome to try Robin-Hood, in case you get positive results we may
switch to it (however I'm sceptical)

Thanks. Dmitry.

Hi Matt,
Just two cents on my side after playing around a little bit as well with the hash tables and functions :) ...
As Dmitry stated with different words you can have a hash function with a great quality but unfortunately you will still have a lot of collisions as only a reduced number of bits of the hash value will be used for the lookup in small hash tables. Sometimes browsing collisions will be cheaper than calculating gold hash values.

But browsing collisions are always cheap, I think. Another experiment I did based on the observation that a key is looked up 2-3 consecutive times frequently, was to have a simple LRU management of colliding bucket lists; the overhead was quite small (just the list refresh in case of a collision where the searched bucket was not in the first position) but big enough to generate overall worse results.

What we see at microarchitecture level of IA when running PHP7 on Wordpress for ex, is that branch miss-predictions and ICACHE misses are very high; overall we can talk about a kind of BPU (branch prediction unit) and ICACHE saturation; adding complexity to the implementation meaning bigger code and/or more conditional branches will have negative impacts if there is not an important gain in having less instructions executed.

I write this just with the hope of giving some hints on possible gains/losses in your attempts. :-)
I will be not surprised if simpler and crappy hash functions together with cheap collision solver will drive to an overall better speed than other sophisticated and theoretical better approaches.

Hope it helps,
Bogdan

From: Matt Wilmas php_lists@realplain.com
Sent: Friday, April 29, 2016 3:04:44 AM
To: internals@lists.php.net
Cc: Dmitry Stogov; Nikita Popov; Xinchen Hui
Subject: New HashTable implementation?

Hi all,

Last June, it was briefly mentioned about changing PHP's string hash function
[1] (DJB33 seems pretty horrible, after all, as far as collisions...). So 8
months ago I tried almost, if not, a half-dozen of them (including Murmur3)
that were simple enough to quickly toss in.

The result? In all cases, I believe, fewer instructions (in zend_hash_find, and
the hashing function), BUT also a slight-to-small increase in cache misses in
Wordpress and other scripts...

And in a test filling an array with a million "string_$i" keys (high collision
pattern for DJB33?), the speed was halved by the huge cache miss
increase. :-/

I couldn't quite understand what was happening, where, if there were fewer
collisions... Misses all spread out in the hash-array?

So there didn't seem to be anything useful to gain there.

Now, after seeing Bogdan's hash optimization idea last month [2], and
reading Nikita's blog post again, I had some ideas I'd like to try -- assuming
nobody else is planning major changes. :-) Besides Nikita, I'm addressing
Dmitry and Xinchen because your names are on some minor hash items on
the 7.1 ideas wiki [4].

I'm thinking of a Robin Hood implementation with Universal hashing [5] (of
int keys and the string hashes). I haven't touched any code yet, but think I've
worked out all the details in my head, and am ready to take a stab at it. I
think it's fairly simple to get the basics working; and when I see how that
goes, I can do the additional optimizations I have in mind that it allows
(including reduced memory, on 64-bit at least).

Question: Can I use zval.u1.v.reserved ? I guess I'll find out otherwise.
:-O

The string hash function itself is really a separate thing [6], but fasthash [7]
(not to be confused with "superfast") looks like a good one that I missed
before... After thinking about things, I think we could even keep/use a 64-bit
hash on 32-bit arch.

Well, just wanted to mention it first if anyone has a comment. :-) Should be
interesting, but no idea how it'll perform (lookups should be very, very fast
(upsizing also); but cache factors and inserts/deletes are wildcards).
Wish me luck!?

Thanks,
Matt

[1] https://marc.info/?l=php-internals&m=143444845304138&w=2
[2] https://marc.info/?t=145744248100001&r=1&w=2
[4] https://wiki.php.net/php-7.1-ideas
[5] https://en.wikipedia.org/wiki/Universal_hashing
[6] https://github.com/rurban/smhasher
[7] https://github.com/rurban/smhasher/blob/master/doc/fasthash64

--
To unsubscribe, visit:
http://www.php.net/unsub.php

9 years ago by Jared Williams — view source

unread

Hi all,

Last June, it was briefly mentioned about changing PHP's string hash
function [1] (DJB33 seems pretty horrible, after all, as far as
collisions...).  So 8 months ago I tried almost, if not, a half-dozen
of
them (including Murmur3) that were simple enough to quickly toss in.

The result?  In all cases, I believe, fewer instructions (in
zend_hash_find,
and the hashing function), BUT also a slight-to-small increase in
cache
misses in Wordpress and other scripts...

And in a test filling an array with a million "string_$i" keys (high
collision pattern for DJB33?), the speed was halved by the huge
cache miss
increase. :-/

I couldn't quite understand what was happening, where, if there were
fewer
collisions...  Misses all spread out in the hash-array?

So there didn't seem to be anything useful to gain there.

Now, after seeing Bogdan's hash optimization idea last month [2], and
reading Nikita's blog post again, I had some ideas I'd like to try --
assuming nobody else is planning major changes. :-)  Besides Nikita,
I'm
addressing Dmitry and Xinchen because your names are on some minor
hash
items on the 7.1 ideas wiki [4].

I'm thinking of a Robin Hood implementation with Universal hashing
[5] (of
int keys and the string hashes).  I haven't touched any code yet, but
think
I've worked out all the details in my head, and am ready to take a
stab at
it.  I think it's fairly simple to get the basics working; and when I
see
how that goes, I can do the additional optimizations I have in mind
that it
allows (including reduced memory, on 64-bit at least).

Question: Can I use zval.u1.v.reserved ?  I guess I'll find out
otherwise.
:-O

The string hash function itself is really a separate thing [6], but
fasthash
[7] (not to be confused with "superfast") looks like a good one that
I
missed before...  After thinking about things, I think we could even
keep/use a 64-bit hash on 32-bit arch.

Well, just wanted to mention it first if anyone has a comment. :-
)  Should
be interesting, but no idea how it'll perform (lookups should be
very, very
fast (upsizing also); but cache factors and inserts/deletes are
wildcards).
Wish me luck!?

Thanks,
Matt

[1] https://marc.info/?l=php-internals&m=143444845304138&w=2
[2] https://marc.info/?t=145744248100001&r=1&w=2
[4] https://wiki.php.net/php-7.1-ideas
[5] https://en.wikipedia.org/wiki/Universal_hashing
[6] https://github.com/rurban/smhasher
[7] https://github.com/rurban/smhasher/blob/master/doc/fasthash64

Have you taken a look a go's internal hashing function?

On platforms supporting AES-NI they use the AESENC instruction to get a
fast hash, but with also some protection against collision attacks.

https://github.com/golang/go/blob/0104a31b8fbcbe52728a08867b26415d282c3
5d2/src/runtime/asm_amd64.s#L870

Jared