Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:39137 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 95324 invoked from network); 21 Jul 2008 18:15:09 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 Jul 2008 18:15:09 -0000 Authentication-Results: pb1.pair.com header.from=scott@macvicar.net; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=scott@macvicar.net; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain macvicar.net from 193.227.246.108 cause and error) X-PHP-List-Original-Sender: scott@macvicar.net X-Host-Fingerprint: 193.227.246.108 ip246-108-v193.static.x-ip.net Received: from [193.227.246.108] ([193.227.246.108:56560] helo=lovelace.midden.org.uk) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id B1/CB-33452-A22D4884 for ; Mon, 21 Jul 2008 14:15:07 -0400 Received: from macvicar.demon.co.uk ([80.177.111.173] helo=[192.168.1.104]) by lovelace.midden.org.uk with esmtp (Exim 4.69) (envelope-from ) id 1KKzuV-0000mv-DY; Mon, 21 Jul 2008 19:15:01 +0100 Message-ID: <4884D20C.4090305@macvicar.net> Date: Mon, 21 Jul 2008 19:14:36 +0100 User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: Andi Gutmans CC: Michal Dziemianko , internals@lists.php.net References: <698DE66518E7CA45812BD18E807866CE01D03F78@us-ex1.zend.net> In-Reply-To: <698DE66518E7CA45812BD18E807866CE01D03F78@us-ex1.zend.net> Content-Type: multipart/mixed; boundary="------------000103050406080409060906" X-Spam-Score: -3.0 X-Spam_Report: Spam detection software, running on the system "lovelace.midden.org.uk", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi Andi, The patch is attached for 5_3. I've got some time allocated tomorrow to review all of Michal's patches that have been produced for the GSoC. I'll try to post some figures from real life apps. [...] Content analysis details: (-3.0 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.2 URIBL_GREY Contains an URL listed in the URIBL greylist [URIs: googlepages.com] 0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL -2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] -0.6 AWL AWL: From: address is in the auto white-list Subject: Re: [PHP-DEV] zend_inline_hash_function reimplementation From: scott@macvicar.net (Scott MacVicar) --------------000103050406080409060906 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Andi, The patch is attached for 5_3. I've got some time allocated tomorrow to review all of Michal's patches that have been produced for the GSoC. I'll try to post some figures from real life apps. Scott Andi Gutmans wrote: > Hi Michal, > > Can you please send a link to the patch so we can review? I didn't get > the attachment. > > Thanks, > Andi > > > > >> -----Original Message----- >> From: Michal Dziemianko [mailto:michal.dziemianko@gmail.com] >> Sent: Monday, July 21, 2008 8:29 AM >> To: internals@lists.php.net >> Subject: [PHP-DEV] zend_inline_hash_function reimplementation >> >> Hello, >> I have looked into Zend/zend_hash.h and I guess it might be sped up a > little. >> So far it uses D. Bernstein's hash which is quite fast, but I think it > might >> be worth replacing it with MurmurHash. I have tried comparison of > speed for >> them (both as separate C programs, and compiled into PHP 5_3). Results > for >> REAL keys (collected on running web server) are at the bottom of this > page: >> http://212.85.117.53/gsoc/ >> index.php?option=com_content&view=article&id=65:hash-functions-for- >> hash-tables&catid=34:profiling&Itemid=54 >> >> Murmur is Public Domain, and its author is maintaining it all the time > so I >> guess it will be faster and better. There is comparison of > distribution at >> both the page I gave and here: http:// murmurhash.googlepages.com/ >> >> The patch for 5_3 (applicable to HEAD) as attachment > --------------000103050406080409060906 Content-Type: text/plain; name="zend_inline_hash_function-5_3.diff.txt" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="zend_inline_hash_function-5_3.diff.txt" Index: zend_hash.h =================================================================== RCS file: /repository/ZendEngine2/zend_hash.h,v retrieving revision 1.78.2.2.2.2.2.4 diff -u -d -r1.78.2.2.2.2.2.4 zend_hash.h --- zend_hash.h 29 Apr 2008 08:15:17 -0000 1.78.2.2.2.2.2.4 +++ zend_hash.h 20 Jul 2008 12:37:41 -0000 @@ -222,65 +222,47 @@ ZEND_API int zend_hash_rehash(HashTable *ht); /* - * DJBX33A (Daniel J. Bernstein, Times 33 with Addition) - * - * This is Daniel J. Bernstein's popular `times 33' hash function as - * posted by him years ago on comp.lang.c. It basically uses a function - * like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best - * known hash functions for strings. Because it is both computed very - * fast and distributes very well. - * - * The magic of number 33, i.e. why it works better than many other - * constants, prime or not, has never been adequately explained by - * anyone. So I try an explanation: if one experimentally tests all - * multipliers between 1 and 256 (as RSE did now) one detects that even - * numbers are not useable at all. The remaining 128 odd numbers - * (except for the number 1) work more or less all equally well. They - * all distribute in an acceptable way and this way fill a hash table - * with an average percent of approx. 86%. - * - * If one compares the Chi^2 values of the variants, the number 33 not - * even has the best value. But the number 33 and a few other equally - * good numbers like 17, 31, 63, 127 and 129 have nevertheless a great - * advantage to the remaining numbers in the large set of possible - * multipliers: their multiply operation can be replaced by a faster - * operation based on just one shift plus either a single addition - * or subtraction operation. And because a hash function has to both - * distribute good _and_ has to be very fast to compute, those few - * numbers should be preferred and seems to be the reason why Daniel J. - * Bernstein also preferred it. - * - * - * -- Ralf S. Engelschall + * MURMUR Hash - http://murmurhash.googlepages.com/ */ - -static inline ulong zend_inline_hash_func(const char *arKey, uint nKeyLength) +#define MURMUR_SEED 5831 +#define MURMUR_M 0x5bd1e995 +#define MURMUR_R 24 +static inline ulong zend_inline_hash_func(const char *key, uint len) { - register ulong hash = 5381; + unsigned int h = MURMUR_SEED ^ len; + unsigned char * data = (unsigned char *)key; - /* variant with the hash unrolled eight times */ - for (; nKeyLength >= 8; nKeyLength -= 8) { - hash = ((hash << 5) + hash) + *arKey++; - hash = ((hash << 5) + hash) + *arKey++; - hash = ((hash << 5) + hash) + *arKey++; - hash = ((hash << 5) + hash) + *arKey++; - hash = ((hash << 5) + hash) + *arKey++; - hash = ((hash << 5) + hash) + *arKey++; - hash = ((hash << 5) + hash) + *arKey++; - hash = ((hash << 5) + hash) + *arKey++; - } - switch (nKeyLength) { - case 7: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ - case 6: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ - case 5: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ - case 4: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ - case 3: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ - case 2: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ - case 1: hash = ((hash << 5) + hash) + *arKey++; break; - case 0: break; -EMPTY_SWITCH_DEFAULT_CASE() + while(len >= 4) + { + unsigned int k = *(unsigned int *)data; + + k *= MURMUR_M; + k ^= k >> MURMUR_R; + k *= MURMUR_M; + + h *= MURMUR_M; + h ^= k; + + data += 4; + len -= 4; } - return hash; + + // Handle the last few bytes of the input array + switch(len) + { + case 3: h ^= data[2] << 16; + case 2: h ^= data[1] << 8; + case 1: h ^= data[0]; + h *= MURMUR_M; + }; + + // Do a few final mixes of the hash to ensure the last few + // bytes are well-incorporated. + h ^= h >> 13; + h *= MURMUR_M; + h ^= h >> 15; + + return h; } --------------000103050406080409060906--