Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:51180
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain nasza-klasa.pl designates 209.85.215.42 as permitted sender)
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
To: "internals@lists.php.net" <internals@lists.php.net>
References: <op.vok1p6osnpl9ht@pc-mbabij> <op.vok5ayabnpl9ht@laptop>
 <4D1E56E1.5090208@sugarcrm.com>
Date: Sat, 01 Jan 2011 13:30:03 +0100
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Message-ID: <op.vomqccoenpl9ht@laptop>
In-Reply-To: <4D1E56E1.5090208@sugarcrm.com>
User-Agent: Opera Mail/11.00 (Win32)
Subject: Re: [PHP-DEV] Re: Zend engine's hashtable performance tweaks
From: marcin.babij@nasza-klasa.pl ("Marcin Babij")

> Hi!
>
>> Sorry for no attachments in previous message, I think my attachments
>> weren't redirected with message by lists.php.net email confirmation
>> system. I send them again, and for sure I attach links to public copy of
>> them over HTTP:
>> https://gist.github.com/761094 - php-5.3.4-hashtable-optimization.patch
>> https://gist.github.com/761096 - apc-3.1.6-hashtable-optimization.patch
>
> I just took a quick look, so some preliminary notes:
>
> 1. +#define ZEND_HASH_SMALL_LOAD_FACTOR_INV 1.75 etc. - maybe makes  
> sense to convert them from floats to couple of integers? I'm not sure if  
> gcc is smart enough not to use floating point here, which would probably  
> be slower than int *7/4.

My idea here was that *7 could lead to overflow on integer values, and  
division can't be made first. Way out is casting to int64 or float. And I  
found specifying one float more human-readable, than 2 integers in  
relation.

> 2. Would it be possible to clean up the patch? There are tons of diffs  
> like this:
> - HASH_UNPROTECT_RECURSION(ht1);
> - HASH_UNPROTECT_RECURSION(ht2);
> + HASH_UNPROTECT_RECURSION(ht1);
> + HASH_UNPROTECT_RECURSION(ht2);
> which make no sense and just make it harder to understand what's going  
> on.

Possibly they're some whitespace changes. I'll create bug report, and  
attach patches with removed such "changes".

>
> 3.
> - ulong h; /* Used for numeric indexing */
> + ulong h;
>
> why?

It should be back, this comment still remains valid, looks like it wasn't  
at some time of evolution of patch. :)
AFAIR it was removed, when it turned out, that with open addressing we  
can't use subsequent buckets like we used to do with numeric indexes. Then  
I've changed that by making additional hash->bucket index macro  
ZEND_HASH_BUCKET that prevents clustering.

>
> 4. zend_inline_hash_func - could you describe why you changed it? In any  
> case, it needs some comments, if you deleted the old one, please add  
> description of the new one any why it is better.

I'll provide some comments in patch. Now I can tell that the main idea was  
to handle long keys (>= 8 bytes long) in new way: take sizeof(ulong) bytes  
at once, hash them into hash variable, and to add remaining bytes, just  
take last sizeof(ulong) bytes. This makes much less instructions to  
execute, yet should keep hash function collision properties. Also, I don't  
care (why should I?) that if key length isn't multiple of sizeof(ulong)  
I'll count some bytes twice.
I would like to take such approach when nKeyLength < sizeof(ulong), which  
speeds hashing function a lot for short (most common) keys, but that needs  
to do some byte-masking and assumes that key's starting address is aligned  
to sizeof(ulong), otherwise it could read bytes out of page and lead to  
segmentation fault. Could we take such assumption, maybe just on specified  
architectures, to speed it up more?

>
> Will follow up after reading the patch more in depth.

Thank you for your comments.