Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:38160
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.146.181 as permitted sender)
DomainKey-Status: bad
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=mvTfjOPviI6Ci7Mk6vIi/MSNVGqKMDFVv8ca23au4OKiCDztl1mdaR0AB8KCvBlVw6
         HRVoF/cBw8NdelHXjcweRdW8maGpg23FGrFfWKxeJbk3EOjaPGG7RYwEyke2zj0zGCLo
         6wGHTeQInDWshsqDt7VBvYogx5jPjWzqQIsWc=
Message-ID: <31fe29920806110114i263a47b4sd66d4849ceb22980@mail.gmail.com>
Date: Wed, 11 Jun 2008 09:14:07 +0100
To: internals@lists.php.net
In-Reply-To: <484D36EB.9080202@macvicar.net>
MIME-Version: 1.0
Content-Type: multipart/alternative; 
	boundary="----=_Part_26339_22317850.1213172047209"
References: <7E62CA6E-83F4-4F9C-86FB-75EBE7D489C9@gmail.com>
	 <D406487CA14F443B81EAE37236CAD5E6@pc07654>
	 <484D36EB.9080202@macvicar.net>
Subject: Re: [PHP-DEV] Algorithm Optimizations - string search
From: michal.dziemianko@gmail.com ("Michal Dziemianko")

------=_Part_26339_22317850.1213172047209
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hello everybody,
I have done small comparison of the 3 algorithms: the original zend_memnstr
implementation, KMP, and Boyer-Moore (inefficient, "book" implementation).

The first part - real life data. I have collected the data about the
parameters passed to ZEND_MEMNSTR (not only while it is called from strpos)
on my own server. Data is collected over 1h ~ 2 000 000 calls. (if somebody
thinks how -> "the hard way" just fwrite() added to zend_memnstr to save in
file data about the call)

Here are some statistics:
- average haystack length: 624.2
- average needle length: 1.9 ! -> 63% of needles of length 1
- avg length of haystacks shorter than avg: 41.0 -> 85% of all haystacks
- avg length of haystacks  longer than avg: 5685.11

I did several tests - both "human readable" and artificial texts and here
are (some) results:

In case of searching of words in text (piece of book copied into test
bench):
For short haystacks (<1000) the original implementation is actually the best
one.
KMP is slower due to preprocessing time and some overhead in the loop.
BM is 10-30 times slower, thus not really applicable.

In case of artificial texts (like search of params in kind of config - taken
real life example from Joomla)
KMP and original implementation are almost the same (the latter a little bit
faster)
BM is still much slower

some results (times in seconds) - only 5 shown. If somebody wants to play
with it - I can post small program with implementation of all 3 algorithms)

ORIGINAL
-------------------
test 1: 2.450000
test 2: 5.380000
test 3: 3.890000
test 4: 2.870000
test 5: 11.840000

KMP
-------------------
test 1: 2.910000
test 2: 11.370000
test 3: 7.000000
test 4: 3.970000
test 5: 11.660000

BM
-------------------
test 1: 35.030000
test 2: 35.460000
test 3: 8.570000
test 4: 30.500000
test 5: 31.860000

For longer text (<5000) KMP and original zend_memnstr are basically similar,
BM starts to show it is better, the real fun is when haystack is loner than
5000 chars - than suddenly BM starts to be really good.

Some examples for texts longer than 3000 chars

ORIGINAL
-------------------
test 1: 0.110000  <- long text with, strpos = 0
test 2: 97.620000 <- really long text (~10000chars, match ~8000)
test 3:4400, 17.220000 <- match ~4400
test 4:0, 157.990000 <- degraded long text
test 5:0, 99.920000

KMP
-------------------
test 1:0, 0.550000
test 2:8596, 36.670000
test 3:4400, 50.610000
test 4:0, 122.000000
test 5:0, 93.800000

BM
-------------------
test 1:0 28.690000
test 2:8596, 23.230000
test 3:4400, 18.420000
test 4:0 38.890000
test 5:0, 157.920000 <- this one is strange!

As I haven't taken into account the information about haystack/needle KMP
seemed to be good choice. But now think about 63% of calls of ZEND_MEMNSTR
with needle_len = 1!!  I think it is the main place for optimization.
Although strpos implements fix for that, some other functions don't. My idea
is than to implement ZEND_MEMNSTR once again in shape:

if (needle_len = 1)
    here just linear sweep
else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests
needed to choose good value)
    original implementation (as it is the best one in this case)
else
    BM/KMP (i think BM will be better in this case, as some people
suggested)

Then cleanup all the functions that use zend_memnstr, as they can be this
way simpler.

I did similar test for strripos - in this case KMP is good enough (the
original implementation ALWAYS run in time o(nm) - it is just memcmp at
eachposition no matter if the first chars match, thus is outperformed by KMP
even for short strings). Implementing BM here is not necessary in my opinion
- the match is usually quite close to the end of haystack (average ~80% of
haystack length). But fix for needle_len = 1 should be left as here also
many searches are with short needles.

Do you think it is good idea to rewrite it this way?

Cheers,
Michal

------=_Part_26339_22317850.1213172047209--