Volnitsky substring search algo

14 years ago by Jordi Boggiano — view source — reply

unread

Heya,

I just wanted to attract attention to this new substring search
algorithm that seems faster better stronger.. According to the author of
course. I don't quite have the required knowledge to evaluate this
properly, but thought it was worth investigating if someone wants to.

http://volnitsky.com/project/str_search/

Note that I contacted the author and although he intends to release it
as GPL, he said he would probably give out a special license for PHP if
needed, since the GPL is afaik not compatible with the PHP license.

He also said that his algo was more useful for big haystacks than small
ones due to the startup overhead, but it seems that his code is anyway
falling back to std::search for smallish haystacks.

Cheers

--
Jordi Boggiano
@seldaek :: http://seld.be/

14 years ago by Stas Malyshev — view source — reply

unread

Hi!

http://volnitsky.com/project/str_search/

I'm not sure it'd be easy to integrate this into PHP codebase as-is,
provided it relies on C++ standard libraries which PHP makes no use of
(and thus potentially introduces a world of dependencies and
complexities into the build process). I'm sure it can be re-done in pure
standard C, then it can be tested in PHP and if it's better - I don't
see why it can't be integrated.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

14 years ago by Pierre Joye — view source — reply

unread

hi!

I don't see why it can't be
integrated.

We should wait for a new license before even looking at it.

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

14 years ago by Jordi Boggiano — view source — reply

unread

I don't see why it can't be
integrated.

We should wait for a new license before even looking at it.

Then please ask him directly on behalf of PHP. He told me it was no
problem, but I didn't think it was really my place to ask.

As for looking at it and reimplementing in C, I don't think I can help,
I just wanted to raise awareness and hope that someone will pick up the
task.

Cheers

--
Jordi Boggiano
@seldaek :: http://seld.be/

14 years ago by Pierre Joye — view source — reply

unread

I don't see why it can't be
integrated.

We should wait for a new license before even looking at it.

Then please ask him directly on behalf of PHP. He told me it was no
problem, but I didn't think it was really my place to ask.

If you can take care of that, please go ahead.

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

14 years ago by keisial@gmail.com — view source — reply

unread

Stas Malyshev wrote:

Hi!

http://volnitsky.com/project/str_search/

I'm not sure it'd be easy to integrate this into PHP codebase as-is,
provided it relies on C++ standard libraries which PHP makes no use of
(and thus potentially introduces a world of dependencies and
complexities into the build process). I'm sure it can be re-done in
pure standard C, then it can be tested in PHP and if it's better - I
don't see why it can't be integrated.

Not really. It only uses std::search, which would be equivalent to the
current zend_memnstr(). And std::numeric_limits can be replaced with a
limits.h macro.
I'd be more concerned about the "only for little-endian platforms and
where access to misaligned W is allowed" remark. php is also available
for big endian architectures, but that seems easy. Some architecture
supported by php won't probably accept that, so it would also need some
configure test to disable it.

14 years ago by Damien Tournoud — view source — reply

unread

http://volnitsky.com/project/str_search/

The algorithm seems flawed to me, at least in its reference implementation.
There does not seem to be any guarantee that the returned position is really
the first occurrence of the needle in the haystack.

It's easy to see with needle being a repetition of the same character:
SS = 'a' * 1000
W_size = 4

The hash map build will be clogged, with the value 1 stored in the first
slot for SS. As a consequence, the algorithm will step through the haystack,
trying to confirm a match for the needle at the step position. If a match is
found there, it will discard any previous matches that could be valid at
this position. All those haystack will return the same position 998:
S = 'bbbb' * 997 + 'a' * 1000 (correct)
S = 'bbb' * 996 + 'a' * 1001 (incorrect, should return 997)
S = 'bb' * 995 + 'a' * 1002 (incorrect, should return 996)
S = 'b' * 994 + 'a' * 1003 (incorrect, should return 995)

The implementation could be fixed (by adding an explicit string matching
when building the hash table, and by storing all the occurrences of a
given W in SS), but that will increase the overall cost (both computing and
memory) of the algorithm.

Damien Tournoud

14 years ago by Leonid Volnitsky — view source — reply

unread

Damien Tournoud <damz <at> damz.org> writes:

On Sun, Feb 27, 2011 at 12:50 PM, Jordi Boggiano <j.boggiano <at> seld.be>
wrote:
The algorithm seems flawed to me, at least in its reference implementation.
There does not seem to be any guarantee that the returned position is really
the first occurrence of the needle in the haystack.

This is a bug. This algorithm is still something very new.

The implementation could be fixed (by adding an explicit string matching
when building the hash table, and by storing all the occurrences of a
given W in SS), but that will increase the overall cost (both computing and
memory) of the algorithm.

Bug was fixed by reversing hash fill order (without increase of overall cost).

@ All

About the licence. I am not a layer, I need to figure out the licensing.
I intend to give PHP project a special license, I just need to clarify several
things.