Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:51551
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain damz.org from 209.85.216.42 cause and error)
MIME-Version: 1.0
In-Reply-To: <4D6A3A9C.50308@seld.be>
References: <4D6A3A9C.50308@seld.be>
Date: Thu, 3 Mar 2011 01:27:37 +0100
Message-ID: <AANLkTin8SZZZX1hQmcboPqzpQmqiFtd2XXO0fTMVRANY@mail.gmail.com>
To: Jordi Boggiano <j.boggiano@seld.be>
Cc: "internals@lists.php.net" <internals@lists.php.net>
Content-Type: multipart/alternative; boundary=0016364ef42eb37cf0049d891889
Subject: Re: [PHP-DEV] Volnitsky substring search algo
From: damz@damz.org (Damien Tournoud)

--0016364ef42eb37cf0049d891889
Content-Type: text/plain; charset=ISO-8859-1

On Sun, Feb 27, 2011 at 12:50 PM, Jordi Boggiano <j.boggiano@seld.be> wrote:

> http://volnitsky.com/project/str_search/


The algorithm seems flawed to me, at least in its reference implementation.
There does not seem to be any guarantee that the returned position is really
the *first* occurrence of the needle in the haystack.

It's easy to see with needle being a repetition of the same character:
  SS = 'a' * 1000
  W_size = 4

The hash map build will be clogged, with the value 1 stored in the first
slot for SS. As a consequence, the algorithm will step through the haystack,
trying to confirm a match for the needle at the step position. If a match is
found there, it will discard any previous matches that could be valid at
this position. All those haystack will return the same position 998:
  S = 'bbbb' * 997 + 'a' * 1000 (correct)
  S = 'bbb' * 996 + 'a' * 1001 (incorrect, should return 997)
  S = 'bb' * 995 + 'a' * 1002 (incorrect, should return 996)
  S = 'b' * 994 + 'a' * 1003 (incorrect, should return 995)

The implementation could be fixed (by adding an explicit string matching
when building the hash table, and by storing *all* the occurrences of a
given W in SS), but that will increase the overall cost (both computing and
memory) of the algorithm.

Damien Tournoud

--0016364ef42eb37cf0049d891889--