strtr vs. str_replace runtime

13 years ago by Nicolai Scheer — view source — reply

unread

Hi!

I stumbled upon a "problem" with the function strtr() the other day... I
noticed a very long running php script, and tried to reproduce the
behaviour.

I traced it down to a single call of strtr doing text replacements using a
search => replace array.
It wasn't quit obvious why the call would take that long, so I started to
investigate the issue.

From the docs, it says that strtr "... will be the most efficient when all
the keys have the same size".
My testcase showed, that in fact I was using Keys of very different lengths
(they are determined automatically, so there's no fixed list).

I wrote a simple script to reproduce this behaviour:

<?php

$text = str_repeat( 'm', 2000 );

$long_from_a = str_repeat( 'a', 1 );
$long_from_x = str_repeat( 'x', 1500 );

$replacements = array(
$long_from_a => 'b',
$long_from_x => 'y'
);

$start = microtime( true );
$result_1 = strtr( $text, $replacements );
echo "strtr: " . number_format( microtime( true ) - $start, 4 ) . "\n";

$start = microtime( true );
$result_2 = str_replace( array_keys( $replacements ), array_values(
$replacements ), $text );
echo "str_replace: " . number_format( microtime( true ) - $start, 4 ) .
"\n";

echo $result_1 === $result_2 ? "results match!\n": "no match!\n";

On my box, this reports 2.5 seconds for strtr and 0.0 seconds for
str_replace. As far as I know the only difference between str_replace and
strtr is that strtr does not replace stuff in already replaced parts of the
string. Might be wrong here, though.

If I adjust the str_repeat for "m" from 2000 to 20000 the runtime is 45
seconds for strtr and 0.0001 for str_replace.

I might have chosen the wrong tool for what I'm trying to achieve in the
first place, but can anyone comment on the algorithmic complexity of strtr?
This is definitely not the expected behaviour for such small inputs. Since
the inputs varied and the keys where determined automatically in my
original script, I was confronted with runtimes of several hours compared
to just a few seconds with str_replace.

If this is the expected behaviour, at least the documentation should be
adjusted to state that this function is very inefficient with keylengths
that are very distant from each other...

Greetings

Nico

13 years ago by Gustavo Lopes — view source — reply

unread

Em 2013-01-02 16:53, Nicolai Scheer escreveu:

I might have chosen the wrong tool for what I'm trying to achieve in
the
first place, but can anyone comment on the algorithmic complexity of
strtr?
This is definitely not the expected behaviour for such small inputs.
Since
the inputs varied and the keys where determined automatically in my
original script, I was confronted with runtimes of several hours
compared
to just a few seconds with str_replace.

If this is the expected behaviour, at least the documentation should
be
adjusted to state that this function is very inefficient with
keylengths
that are very distant from each other...

Please open a bug to track this.

The algorithm behaves very poorly in this case because at each position
of the text, all the substrings starting there and with size between m
and n (where m is the size of the smallest pattern and n is the largest)
are checked, even if there are only two patterns with size m and n. We
could fix this easily by building a set of the pattern sizes found and
try only with those. The hashing of the substrings could also be
improved; we don't have to recalculate everything when we advance in the
text.

--
Gustavo Lopes

13 years ago by Nicolai Scheer — view source — reply

unread

Hi!

Em 2013-01-02 16:53, Nicolai Scheer escreveu:

I might have chosen the wrong tool for what I'm trying to achieve in the
first place, but can anyone comment on the algorithmic complexity of
strtr?
This is definitely not the expected behaviour for such small inputs. Since
the inputs varied and the keys where determined automatically in my
original script, I was confronted with runtimes of several hours compared
to just a few seconds with str_replace.

If this is the expected behaviour, at least the documentation should be
adjusted to state that this function is very inefficient with keylengths
that are very distant from each other...

Please open a bug to track this.

The algorithm behaves very poorly in this case because at each position of
the text, all the substrings starting there and with size between m and n
(where m is the size of the smallest pattern and n is the largest) are
checked, even if there are only two patterns with size m and n. We could
fix this easily by building a set of the pattern sizes found and try only
with those. The hashing of the substrings could also be improved; we don't
have to recalculate everything when we advance in the text.

Ok, here it goes:

https://bugs.php.net/bug.php?id=63893

Greetings,

Nico

13 years ago by Gustavo Lopes — view source — reply

unread

On Thu, 03 Jan 2013 11:40:31 +0100, Gustavo Lopes glopes@nebm.ist.utl.pt
wrote:

The algorithm behaves very poorly in this case because at each position
of the text, all the substrings starting there and with size between m
and n (where m is the size of the smallest pattern and n is the largest)
are checked, even if there are only two patterns with size m and n. We
could fix this easily by building a set of the pattern sizes found and
try only with those. The hashing of the substrings could also be
improved; we don't have to recalculate everything when we advance in the
text.

Both optimizations (the hash rolling and limiting the substrings hashed on
each iteration) worked quite well.

But I got much better results with another algorithm [1], so I'm going to
merge the branch with it [2] instead. I get these results with a 1.7 MB
string and 13 replacement strings, the smallest with 6 characters and 30
iterations (x86-64, gcc -O3):

strtr: 0.1387
str_replace: 0.4471

The algorithm doesn't perform as well when the replacement strings are
small. Adding a replacement for the pattern '_' (1 character) yields:

strtr: 0.6157
str_replace: 0.6230

But even in this case, it works better than my optimized version of the
current algorithm.

I plan on merging to 5.4 and 5.5; you may want to review it as introducing
completely new code carries some risk.

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.2927
[2] https://github.com/cataphract/php-src/compare/strtr_wu94

--
Gustavo Lopes

13 years ago by Nicolai Scheer — view source — reply

unread

Hi!

On Thu, 03 Jan 2013 11:40:31 +0100, Gustavo Lopes glopes@nebm.ist.utl.pt
wrote:

The algorithm behaves very poorly in this case because at each position

of the text, all the substrings starting there and with size between m and
n (where m is the size of the smallest pattern and n is the largest) are
checked, even if there are only two patterns with size m and n. We could
fix this easily by building a set of the pattern sizes found and try only
with those. The hashing of the substrings could also be improved; we don't
have to recalculate everything when we advance in the text.

Both optimizations (the hash rolling and limiting the substrings hashed on
each iteration) worked quite well.

But I got much better results with another algorithm [1], so I'm going to
merge the branch with it [2] instead. I get these results with a 1.7 MB
string and 13 replacement strings, the smallest with 6 characters and 30
iterations (x86-64, gcc -O3):

strtr: 0.1387
str_replace: 0.4471

Nice :)

The algorithm doesn't perform as well when the replacement strings are
small. Adding a replacement for the pattern '_' (1 character) yields:

strtr: 0.6157
str_replace: 0.6230

Even that is way better than before :)

But even in this case, it works better than my optimized version of the
current algorithm.

I plan on merging to 5.4 and 5.5; you may want to review it as introducing
completely new code carries some risk.

[1] http://citeseerx.ist.psu.edu/**viewdoc/summary?doi=10.1.1.13.**2927 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.2927
[2] https://github.com/cataphract/php-src/compare/strtr_wu94

Does "merging to 5.4" mean I can grab the head of the 5.4 branch afterwards
and try it myself?

Thanks!

Greetings

Nico

13 years ago by Christopher Jones — view source — reply

unread

The algorithm behaves very poorly in this case because at each position of the text, all the substrings starting there and with size between m and n (where m is the size of the smallest pattern and n is the largest) are checked, even if there are only
two patterns with size m and n. We could fix this easily by building a set of the pattern sizes found and try only with those. The hashing of the substrings could also be improved; we don't have to recalculate everything when we advance in the text.

Both optimizations (the hash rolling and limiting the substrings hashed on each iteration) worked quite well.

But I got much better results with another algorithm [1], so I'm going to merge the branch with it [2] instead. I get these results with a 1.7 MB string and 13 replacement strings, the smallest with 6 characters and 30 iterations (x86-64, gcc -O3):

strtr: 0.1387
str_replace: 0.4471

The algorithm doesn't perform as well when the replacement strings are small. Adding a replacement for the pattern '_' (1 character) yields:

strtr: 0.6157
str_replace: 0.6230

How does this compare with your baseline results?

But even in this case, it works better than my optimized version of the current algorithm.

I plan on merging to 5.4 and 5.5; you may want to review it as introducing completely new code carries some risk.

Depending on the improvement, it might be tempting to merge to 5.4 but I would
prefer to see it in 5.5+. Let's keep 5.4 stable.

Chris

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.2927
[2] https://github.com/cataphract/php-src/compare/strtr_wu94

--
christopher.jones@oracle.com http://twitter.com/ghrd
Newly updated, free PHP & Oracle book:
http://www.oracle.com/technetwork/topics/php/underground-php-oracle-manual-098250.html

13 years ago by Gustavo Lopes — view source — reply

unread

Em 2013-01-11 0:32, Christopher Jones escreveu:

How does this compare with your baseline results?

I ran some benchmarks.
Configure line:
CC=gcc-mp-4.8 CFLAGS="-O3 -march=native" ./configure --disable-all
--host=x86_64-apple-darwin10 --build=x86_64-apple-darwin10
CPU: Intel(R) Core(TM) i5-2500S CPU @ 2.70GHz

The text being searched is this page copy pasted:
http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_vm_execute.h
The functions are run 30 time in a loop.

All new algorithm (branch cataphract/strtr_wu94):

Replacements: 13 total, smallest 6 largest 19
strtr: 0.1534
str_replace: 0.8625

Replacements: 14 total, smallest 1 largest 19
strtr: 0.5305
str_replace: 1.0654

Replacements: 1 total, smallest 7 largest 7
strtr: 0.1142
str_replace: 0.0985

Replacements: 2 total, smallest 7 largest 2500
strtr: 0.1165
str_replace: 0.1656

===
Old algorithm improved (branch cataphract/strtr):

Replacements: 13 total, smallest 6 largest 19
strtr: 0.8922
str_replace: 0.8606

Replacements: 14 total, smallest 1 largest 19
strtr: 1.2031
str_replace: 1.0687

Replacements: 1 total, smallest 7 largest 7
strtr: 0.4130
str_replace: 0.0991

Replacements: 2 total, smallest 7 largest 2500
strtr: 0.5886
str_replace: 0.1719

===
Current (branch master)

Replacements: 13 total, smallest 6 largest 19
strtr: 20.0317
str_replace: 0.8707
results match!

Replacements: 14 total, smallest 1 largest 19
strtr: 26.6792
str_replace: 1.1017
results match!

Replacements: 1 total, smallest 7 largest 7
strtr: 1.2030
str_replace: 0.0850
results match!

Replacements: 2 total, smallest 7 largest 2500
^C (got tired of waiting after a few minutes)

--
Gustavo Lopes

13 years ago by Gustavo Lopes — view source — reply

unread

On Wed, 09 Jan 2013 23:45:03 +0100, Gustavo Lopes glopes@nebm.ist.utl.pt
wrote:

On Thu, 03 Jan 2013 11:40:31 +0100, Gustavo Lopes
glopes@nebm.ist.utl.pt wrote:

The algorithm behaves very poorly in this case because at each position
of the text, all the substrings starting there and with size between m
and n (where m is the size of the smallest pattern and n is the
largest) are checked, even if there are only two patterns with size m
and n. We could fix this easily by building a set of the pattern sizes
found and try only with those. The hashing of the substrings could also
be improved; we don't have to recalculate everything when we advance in
the text.

Both optimizations (the hash rolling and limiting the substrings hashed
on each iteration) worked quite well.

But I got much better results with another algorithm [1], so I'm going
to merge the branch with it [2] instead.

OK, so now the plan is to merge this onto 5.4:

https://github.com/cataphract/php-src/compare/php:PHP-5.4...cataphract:strtr_wu94_54

And this to 5.5:

https://github.com/cataphract/php-src/compare/php:PHP-5.5...cataphract:strtr_wu94_55

The difference is that the 5.4 version does not introduce zend_qsort_r()
and instead has dedicated simple heap sort.

Any thoughts on this?

--
Gustavo Lopes

13 years ago by Christopher Jones — view source — reply

unread

The algorithm behaves very poorly in this case because at each position of the text, all the substrings starting there and with size between m and n (where m is the size of the smallest pattern and n is the largest) are checked, even if there are only
two patterns with size m and n. We could fix this easily by building a set of the pattern sizes found and try only with those. The hashing of the substrings could also be improved; we don't have to recalculate everything when we advance in the text.

Both optimizations (the hash rolling and limiting the substrings hashed on each iteration) worked quite well.

But I got much better results with another algorithm [1], so I'm going to merge the branch with it [2] instead.

OK, so now the plan is to merge this onto 5.4:

https://github.com/cataphract/php-src/compare/php:PHP-5.4...cataphract:strtr_wu94_54

And this to 5.5:

https://github.com/cataphract/php-src/compare/php:PHP-5.5...cataphract:strtr_wu94_55

The difference is that the 5.4 version does not introduce zend_qsort_r() and instead has dedicated simple heap sort.

Any thoughts on this?

Despite temptation, I'm still erring on the side of merging to 5.5+
only. My argument is based on the potential for regressions (behavior
and performance), added code maintenance issues, merge effort etc.
The ability to differentiate the benefits of PHP 5.5 over 5.4, and
promote migration to 5.5 will also be improved.

Chris

--
christopher.jones@oracle.com http://twitter.com/ghrd
Newly updated, free PHP & Oracle book:
http://www.oracle.com/technetwork/topics/php/underground-php-oracle-manual-098250.html

13 years ago by Stas Malyshev — view source — reply

unread

Hi!

OK, so now the plan is to merge this onto 5.4:

https://github.com/cataphract/php-src/compare/php:PHP-5.4...cataphract:strtr_wu94_54

This one looks mostly harmless, so if all strtr tests still pass I think
it's OK for 5.4.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by Gustavo Lopes — view source — reply

unread

On Mon, 14 Jan 2013 22:55:33 +0100, Gustavo Lopes glopes@nebm.ist.utl.pt
wrote:

OK, so now the plan is to merge this onto 5.4:

https://github.com/cataphract/php-src/compare/php:PHP-5.4...cataphract:strtr_wu94_54

And this to 5.5:

https://github.com/cataphract/php-src/compare/php:PHP-5.5...cataphract:strtr_wu94_55

The difference is that the 5.4 version does not introduce zend_qsort_r()
and instead has dedicated simple heap sort.

Given the marked improvements in performance and no objection from the
RMs, I went through with this plan.

--
Gustavo Lopes