Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:51551 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 53942 invoked from network); 3 Mar 2011 00:27:40 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 3 Mar 2011 00:27:40 -0000 Authentication-Results: pb1.pair.com header.from=damz@damz.org; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=damz@damz.org; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain damz.org from 209.85.216.42 cause and error) X-PHP-List-Original-Sender: damz@damz.org X-Host-Fingerprint: 209.85.216.42 mail-qw0-f42.google.com Received: from [209.85.216.42] ([209.85.216.42:36071] helo=mail-qw0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 19/D1-40003-B70EE6D4 for ; Wed, 02 Mar 2011 19:27:40 -0500 Received: by qwd6 with SMTP id 6so468214qwd.29 for ; Wed, 02 Mar 2011 16:27:37 -0800 (PST) MIME-Version: 1.0 Received: by 10.229.85.203 with SMTP id p11mr424312qcl.73.1299112057338; Wed, 02 Mar 2011 16:27:37 -0800 (PST) Received: by 10.229.218.201 with HTTP; Wed, 2 Mar 2011 16:27:37 -0800 (PST) X-Originating-IP: [82.231.176.147] In-Reply-To: <4D6A3A9C.50308@seld.be> References: <4D6A3A9C.50308@seld.be> Date: Thu, 3 Mar 2011 01:27:37 +0100 Message-ID: To: Jordi Boggiano Cc: "internals@lists.php.net" Content-Type: multipart/alternative; boundary=0016364ef42eb37cf0049d891889 Subject: Re: [PHP-DEV] Volnitsky substring search algo From: damz@damz.org (Damien Tournoud) --0016364ef42eb37cf0049d891889 Content-Type: text/plain; charset=ISO-8859-1 On Sun, Feb 27, 2011 at 12:50 PM, Jordi Boggiano wrote: > http://volnitsky.com/project/str_search/ The algorithm seems flawed to me, at least in its reference implementation. There does not seem to be any guarantee that the returned position is really the *first* occurrence of the needle in the haystack. It's easy to see with needle being a repetition of the same character: SS = 'a' * 1000 W_size = 4 The hash map build will be clogged, with the value 1 stored in the first slot for SS. As a consequence, the algorithm will step through the haystack, trying to confirm a match for the needle at the step position. If a match is found there, it will discard any previous matches that could be valid at this position. All those haystack will return the same position 998: S = 'bbbb' * 997 + 'a' * 1000 (correct) S = 'bbb' * 996 + 'a' * 1001 (incorrect, should return 997) S = 'bb' * 995 + 'a' * 1002 (incorrect, should return 996) S = 'b' * 994 + 'a' * 1003 (incorrect, should return 995) The implementation could be fixed (by adding an explicit string matching when building the hash table, and by storing *all* the occurrences of a given W in SS), but that will increase the overall cost (both computing and memory) of the algorithm. Damien Tournoud --0016364ef42eb37cf0049d891889--