Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:38160 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 18601 invoked from network); 11 Jun 2008 08:14:12 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Jun 2008 08:14:12 -0000 Authentication-Results: pb1.pair.com smtp.mail=michal.dziemianko@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=michal.dziemianko@gmail.com; sender-id=pass; domainkeys=bad Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.146.181 as permitted sender) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: michal.dziemianko@gmail.com X-Host-Fingerprint: 209.85.146.181 wa-out-1112.google.com Received: from [209.85.146.181] ([209.85.146.181:35526] helo=wa-out-1112.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 13/4D-26183-1598F484 for ; Wed, 11 Jun 2008 04:14:12 -0400 Received: by wa-out-1112.google.com with SMTP id v27so2694796wah.17 for ; Wed, 11 Jun 2008 01:14:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=DTWSpDxcQGHY2DvcMyKCEsAkAPVT05CPi/xu5yP4F64=; b=qskmWwR/pWEmcYCeIHOzjREjnhvGkgKWi4QzEpFZzWBkxHw+9OLYbCQYcW2mjzioeu HUkKNLunoVNuKq50+7AiRBMeRtBNREr5DmDwVzEj7WSPwClbDbtJBl2E68IK+lUwbpG2 fEo/fSRzaIqW2IYTuCrv10jj4PBy7z2pa6H5I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=mvTfjOPviI6Ci7Mk6vIi/MSNVGqKMDFVv8ca23au4OKiCDztl1mdaR0AB8KCvBlVw6 HRVoF/cBw8NdelHXjcweRdW8maGpg23FGrFfWKxeJbk3EOjaPGG7RYwEyke2zj0zGCLo 6wGHTeQInDWshsqDt7VBvYogx5jPjWzqQIsWc= Received: by 10.115.18.1 with SMTP id v1mr6107415wai.81.1213172047221; Wed, 11 Jun 2008 01:14:07 -0700 (PDT) Received: by 10.114.88.9 with HTTP; Wed, 11 Jun 2008 01:14:07 -0700 (PDT) Message-ID: <31fe29920806110114i263a47b4sd66d4849ceb22980@mail.gmail.com> Date: Wed, 11 Jun 2008 09:14:07 +0100 To: internals@lists.php.net In-Reply-To: <484D36EB.9080202@macvicar.net> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_26339_22317850.1213172047209" References: <7E62CA6E-83F4-4F9C-86FB-75EBE7D489C9@gmail.com> <484D36EB.9080202@macvicar.net> Subject: Re: [PHP-DEV] Algorithm Optimizations - string search From: michal.dziemianko@gmail.com ("Michal Dziemianko") ------=_Part_26339_22317850.1213172047209 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hello everybody, I have done small comparison of the 3 algorithms: the original zend_memnstr implementation, KMP, and Boyer-Moore (inefficient, "book" implementation). The first part - real life data. I have collected the data about the parameters passed to ZEND_MEMNSTR (not only while it is called from strpos) on my own server. Data is collected over 1h ~ 2 000 000 calls. (if somebody thinks how -> "the hard way" just fwrite() added to zend_memnstr to save in file data about the call) Here are some statistics: - average haystack length: 624.2 - average needle length: 1.9 ! -> 63% of needles of length 1 - avg length of haystacks shorter than avg: 41.0 -> 85% of all haystacks - avg length of haystacks longer than avg: 5685.11 I did several tests - both "human readable" and artificial texts and here are (some) results: In case of searching of words in text (piece of book copied into test bench): For short haystacks (<1000) the original implementation is actually the best one. KMP is slower due to preprocessing time and some overhead in the loop. BM is 10-30 times slower, thus not really applicable. In case of artificial texts (like search of params in kind of config - taken real life example from Joomla) KMP and original implementation are almost the same (the latter a little bit faster) BM is still much slower some results (times in seconds) - only 5 shown. If somebody wants to play with it - I can post small program with implementation of all 3 algorithms) ORIGINAL ------------------- test 1: 2.450000 test 2: 5.380000 test 3: 3.890000 test 4: 2.870000 test 5: 11.840000 KMP ------------------- test 1: 2.910000 test 2: 11.370000 test 3: 7.000000 test 4: 3.970000 test 5: 11.660000 BM ------------------- test 1: 35.030000 test 2: 35.460000 test 3: 8.570000 test 4: 30.500000 test 5: 31.860000 For longer text (<5000) KMP and original zend_memnstr are basically similar, BM starts to show it is better, the real fun is when haystack is loner than 5000 chars - than suddenly BM starts to be really good. Some examples for texts longer than 3000 chars ORIGINAL ------------------- test 1: 0.110000 <- long text with, strpos = 0 test 2: 97.620000 <- really long text (~10000chars, match ~8000) test 3:4400, 17.220000 <- match ~4400 test 4:0, 157.990000 <- degraded long text test 5:0, 99.920000 KMP ------------------- test 1:0, 0.550000 test 2:8596, 36.670000 test 3:4400, 50.610000 test 4:0, 122.000000 test 5:0, 93.800000 BM ------------------- test 1:0 28.690000 test 2:8596, 23.230000 test 3:4400, 18.420000 test 4:0 38.890000 test 5:0, 157.920000 <- this one is strange! As I haven't taken into account the information about haystack/needle KMP seemed to be good choice. But now think about 63% of calls of ZEND_MEMNSTR with needle_len = 1!! I think it is the main place for optimization. Although strpos implements fix for that, some other functions don't. My idea is than to implement ZEND_MEMNSTR once again in shape: if (needle_len = 1) here just linear sweep else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests needed to choose good value) original implementation (as it is the best one in this case) else BM/KMP (i think BM will be better in this case, as some people suggested) Then cleanup all the functions that use zend_memnstr, as they can be this way simpler. I did similar test for strripos - in this case KMP is good enough (the original implementation ALWAYS run in time o(nm) - it is just memcmp at eachposition no matter if the first chars match, thus is outperformed by KMP even for short strings). Implementing BM here is not necessary in my opinion - the match is usually quite close to the end of haystack (average ~80% of haystack length). But fix for needle_len = 1 should be left as here also many searches are with short needles. Do you think it is good idea to rewrite it this way? Cheers, Michal ------=_Part_26339_22317850.1213172047209--