Hello,
Here: http://212.85.117.53/DIFF.txt is small patch that will speed
up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)
The speedup of zend_memnstr is about 8% on average (about 30% in case
of artificial strings).
Functions strrpos and strripos are about 25% faster on average.
The only drawback - it needs some additional space (size of the needle).
All functions pass all the tests.
If it looks fine than I will apply for cvs account.
Cheers,
Michal
Hi Michal,
Everything looks fine here and it applies cleanly, do you think you
could make a patch against HEAD with this as well? I suspect it will be
different due to Unicode.
Scott
Michal Dziemianko wrote:
Hello,
Here: http://212.85.117.53/DIFF.txt is small patch that will speed up
following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr function)The speedup of zend_memnstr is about 8% on average (about 30% in case of
artificial strings).
Functions strrpos and strripos are about 25% faster on average.The only drawback - it needs some additional space (size of the needle).
All functions pass all the tests.
If it looks fine than I will apply for cvs account.
Cheers,
Michal
Hello,
Here: http://212.85.117.53/DIFF.txt is small patch that will speed
up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)
The code below is definitely wrong.
The pointers (haystack_dup & needle_dup) that were freed in the previous version
are not freed anymore, but I can see that they are still allocated, so you have a memleak there.
- efree(haystack_dup);
- efree(needle_dup);
- RETURN_FALSE;
- /* actual search - note "p" is reused! */
- p = strrpos_reverse_kmp( p, needle_dup, needle_len, e);
- if ( p != NULL) {
-
/* if somethign was found - return offset wrt to haystack beginning */
-
RETURN_LONG( p - haystack_dup );
- } else {
-
/* if nothing found - return `FALSE` */
-
RETURN_FALSE;
- }
}
/* }}} */
@@ -2530,11 +2550,9 @@
--
Wbr,
Antony Dovgal
Hello,
Here: http://212.85.117.53/DIFF.txt is small patch that will
speed up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)The code below is definitely wrong.
The pointers (haystack_dup & needle_dup) that were freed in the
previous version are not freed anymore, but I can see that they are
still allocated, so you have a memleak there.
Hi Antony,
Will correct it in a minute! and will be more careful:)
Cheers
Michal
Hello,
Here: http://212.85.117.53/DIFF.txt is small patch that will speed
up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)
There is also another thing - zend_memnstr() cannot use emalloc() because
there are cases when it's called before Zend MM is initialized.
One of them can be seen with this:
php --rf test
--
Wbr,
Antony Dovgal
Hi,
So some comments:
- you have some problems with the indentation. We only use tabs, so please
stick to that. Also, there are some lines that are not indented correctly - Have you considered the Boyer-Moore algorithm? I think it's a little
faster than KMP (take a look at e.g.
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/) - please remove the //TUTAJ SKONCZYLEM comment
- revert this change (as well as a few other that are similar):
- for (r_end = r + Z_STRLEN_P(return_value) - 1; r < r_end; ) {
+ for ( r_end = r + Z_STRLEN_P( return_value ) - 1; r < r_end; ){
(we like small diffs, not long diffs with changes that also break our coding
standards. e.g. we don't use space after the '(' char. Philip wrote a nice
article about diffs at http://wiki.php.net/doc/articles/whitespace) - in strrpos_reverse_kmp() I think you allocate 4 bytes less that you want
- I think you've too many comments.. We don't need 1 comment per line :)
After fixing all these points and after running the test suite (with
valgrind) and make sure there are no regressions, I think it's safe for you
to commit. Still, I would like to see some performance figures comparing the
KMP and the Boyer-Moore (or point me some papers about the subject).
Thanks for your work and good luck for the rest of the SoC project :)
Nuno
----- Original Message -----
From: "Michal Dziemianko" michal.dziemianko@gmail.com
To: internals@lists.php.net
Sent: Monday, June 09, 2008 12:39 PM
Subject: [PHP-DEV] Algorithm Optimizations - string search
Hello,
Here: http://212.85.117.53/DIFF.txt is small patch that will speed
up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)The speedup of zend_memnstr is about 8% on average (about 30% in case
of artificial strings).
Functions strrpos and strripos are about 25% faster on average.The only drawback - it needs some additional space (size of the needle).
All functions pass all the tests.
If it looks fine than I will apply for cvs account.
Cheers,
Michal
There is rabin-karp too but its worse case is O(nm) so that might not be
ideal, perhaps we should try to compare all of them.
Scott
Nuno Lopes wrote:
Hi,
So some comments:
- you have some problems with the indentation. We only use tabs, so
please stick to that. Also, there are some lines that are not indented
correctly- Have you considered the Boyer-Moore algorithm? I think it's a little
faster than KMP (take a look at e.g.
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/)- please remove the //TUTAJ SKONCZYLEM comment
- revert this change (as well as a few other that are similar):
- for (r_end = r + Z_STRLEN_P(return_value) - 1; r < r_end; ) {
+ for ( r_end = r + Z_STRLEN_P( return_value ) - 1; r < r_end; ){
(we like small diffs, not long diffs with changes that also break our
coding standards. e.g. we don't use space after the '(' char. Philip
wrote a nice article about diffs at
http://wiki.php.net/doc/articles/whitespace)- in strrpos_reverse_kmp() I think you allocate 4 bytes less that you want
- I think you've too many comments.. We don't need 1 comment per line :)
After fixing all these points and after running the test suite (with
valgrind) and make sure there are no regressions, I think it's safe for
you to commit. Still, I would like to see some performance figures
comparing the KMP and the Boyer-Moore (or point me some papers about the
subject).Thanks for your work and good luck for the rest of the SoC project :)
Nuno
----- Original Message ----- From: "Michal Dziemianko"
michal.dziemianko@gmail.com
To: internals@lists.php.net
Sent: Monday, June 09, 2008 12:39 PM
Subject: [PHP-DEV] Algorithm Optimizations - string searchHello,
Here: http://212.85.117.53/DIFF.txt is small patch that will speed
up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)The speedup of zend_memnstr is about 8% on average (about 30% in case
of artificial strings).
Functions strrpos and strripos are about 25% faster on average.The only drawback - it needs some additional space (size of the needle).
All functions pass all the tests.
If it looks fine than I will apply for cvs account.
Cheers,
Michal
Hello
There is rabin-karp too but its worse case is O(nm) so that might
not be ideal, perhaps we should try to compare all of them.
OK. I think RK will not be much better than original piece of code
which runs in time o(n + m) i most cases (but has worst time o(nm))
Scott
Nuno Lopes wrote:
Hi,
So some comments:
- you have some problems with the indentation. We only use tabs,
so please stick to that. Also, there are some lines that are not
indented correctly- Have you considered the Boyer-Moore algorithm? I think it's a
little faster than KMP (take a look at e.g. http://
www.cs.utexas.edu/users/moore/best-ideas/string-searching/)
Yes, but it requires additional space (size of alphabet) and average
running time is 3n = o(n), while KMP runs in o(n+m) that makes them
actually the same on average (BM is faster in some cases of course).
As Scott asked if I can make it running on UNICODE it seems to be
quite important (the additional space I mean).
- please remove the //TUTAJ SKONCZYLEM comment
of course
- revert this change (as well as a few other that are similar):
- for (r_end = r + Z_STRLEN_P(return_value) - 1; r <
r_end; ) {
+ for ( r_end = r + Z_STRLEN_P( return_value ) - 1; r <
r_end; ){
(we like small diffs, not long diffs with changes that also break
our coding standards. e.g. we don't use space after the '(' char.
Philip wrote a nice article about diffs at http://wiki.php.net/doc/
articles/whitespace)
that was not intentional i think
- in strrpos_reverse_kmp() I think you allocate 4 bytes less that
you want
actually the reverse - in zend_memnstr I allocate 4 bytes too much;)
- I think you've too many comments.. We don't need 1 comment per
line :)
After fixing all these points and after running the test suite
(with valgrind) and make sure there are no regressions, I think
it's safe for you to commit. Still, I would like to see some
performance figures comparing the KMP and the Boyer-Moore (or
point me some papers about the subject).
Will do
Thanks for your work and good luck for the rest of the SoC project :)
Thanks,
Michal
Nuno
----- Original Message ----- From: "Michal Dziemianko"
michal.dziemianko@gmail.com
To: internals@lists.php.net
Sent: Monday, June 09, 2008 12:39 PM
Subject: [PHP-DEV] Algorithm Optimizations - string searchHello,
Here: http://212.85.117.53/DIFF.txt is small patch that will speed
up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)The speedup of zend_memnstr is about 8% on average (about 30% in
case
of artificial strings).
Functions strrpos and strripos are about 25% faster on average.The only drawback - it needs some additional space (size of the
needle).All functions pass all the tests.
If it looks fine than I will apply for cvs account.
Cheers,
Michal
When I looked at the code, I assumed that it wasn't intended for international use
I'll have to go back and look to give you details, but it doesn't work for international use or unicode.
It would be ok for 8859-1.
-----Original Message-----
From: Scott MacVicar [mailto:scott@macvicar.net]
Sent: Monday, June 09, 2008 6:58 AM
To: Nuno Lopes
Cc: internals@lists.php.net; Michal Dziemianko
Subject: Re: [PHP-DEV] Algorithm Optimizations - string searchThere is rabin-karp too but its worse case is O(nm) so that
might not be ideal, perhaps we should try to compare all of them.Scott
Nuno Lopes wrote:
Hi,
So some comments:
- you have some problems with the indentation. We only use tabs, so
please stick to that. Also, there are some lines that are
not indented
correctly- Have you considered the Boyer-Moore algorithm? I think
it's a little
faster than KMP (take a look at e.g.
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/)- please remove the //TUTAJ SKONCZYLEM comment
- revert this change (as well as a few other that are similar):
- for (r_end = r + Z_STRLEN_P(return_value) - 1; r <
r_end; ) {
+ for ( r_end = r + Z_STRLEN_P( return_value ) - 1; r
< r_end;
){ (we like small diffs, not long diffs with changes that
also break
our coding standards. e.g. we don't use space after the '(' char.
Philip wrote a nice article about diffs at
http://wiki.php.net/doc/articles/whitespace)- in strrpos_reverse_kmp() I think you allocate 4 bytes
less that you
want- I think you've too many comments.. We don't need 1
comment per line
:)After fixing all these points and after running the test suite (with
valgrind) and make sure there are no regressions, I think it's safe
for you to commit. Still, I would like to see some
performance figures
comparing the KMP and the Boyer-Moore (or point me some
papers about
the subject).Thanks for your work and good luck for the rest of the SoC
project :)Nuno
----- Original Message ----- From: "Michal Dziemianko"
michal.dziemianko@gmail.com
To: internals@lists.php.net
Sent: Monday, June 09, 2008 12:39 PM
Subject: [PHP-DEV] Algorithm Optimizations - string searchHello,
Here: http://212.85.117.53/DIFF.txt is small patch that
will speed
up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)The speedup of zend_memnstr is about 8% on average (about
30% in case
of artificial strings).
Functions strrpos and strripos are about 25% faster on average.The only drawback - it needs some additional space (size
of the needle).All functions pass all the tests.
If it looks fine than I will apply for cvs account.
Cheers,
Michal--
To
unsubscribe, visit: http://www.php.net/unsub.php
Indeed it is not meant to work with UNICODE, but as an optimization for
PHP5. I tried a lot of examples in ISO-8859-2, and ISO-8859-10 as these are
the most interesting for me - and it worked fine. If it really causes
problems while original implementation works than details are appreciated.
Michal
When I looked at the code, I assumed that it wasn't intended for
international use
I'll have to go back and look to give you details, but it doesn't work for
international use or unicode.
It would be ok for 8859-1.-----Original Message-----
From: Scott MacVicar [mailto:scott@macvicar.net]
Sent: Monday, June 09, 2008 6:58 AM
To: Nuno Lopes
Cc: internals@lists.php.net; Michal Dziemianko
Subject: Re: [PHP-DEV] Algorithm Optimizations - string searchThere is rabin-karp too but its worse case is O(nm) so that
might not be ideal, perhaps we should try to compare all of them.Scott
Nuno Lopes wrote:
Hi,
So some comments:
- you have some problems with the indentation. We only use tabs, so
please stick to that. Also, there are some lines that are
not indented
correctly- Have you considered the Boyer-Moore algorithm? I think
it's a little
faster than KMP (take a look at e.g.
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/)- please remove the //TUTAJ SKONCZYLEM comment
- revert this change (as well as a few other that are similar):
- for (r_end = r + Z_STRLEN_P(return_value) - 1; r <
r_end; ) {
+ for ( r_end = r + Z_STRLEN_P( return_value ) - 1; r
< r_end;
){ (we like small diffs, not long diffs with changes that
also break
our coding standards. e.g. we don't use space after the '(' char.
Philip wrote a nice article about diffs at
http://wiki.php.net/doc/articles/whitespace)- in strrpos_reverse_kmp() I think you allocate 4 bytes
less that you
want- I think you've too many comments.. We don't need 1
comment per line
:)After fixing all these points and after running the test suite (with
valgrind) and make sure there are no regressions, I think it's safe
for you to commit. Still, I would like to see some
performance figures
comparing the KMP and the Boyer-Moore (or point me some
papers about
the subject).Thanks for your work and good luck for the rest of the SoC
project :)Nuno
----- Original Message ----- From: "Michal Dziemianko"
michal.dziemianko@gmail.com
To: internals@lists.php.net
Sent: Monday, June 09, 2008 12:39 PM
Subject: [PHP-DEV] Algorithm Optimizations - string searchHello,
Here: http://212.85.117.53/DIFF.txt is small patch that
will speed
up following functions:
strpos,
stripos,
strrpos
strripos,
and probably some others (all that use zend_memnstr/php_memnstr
function)The speedup of zend_memnstr is about 8% on average (about
30% in case
of artificial strings).
Functions strrpos and strripos are about 25% faster on average.The only drawback - it needs some additional space (size
of the needle).All functions pass all the tests.
If it looks fine than I will apply for cvs account.
Cheers,
Michal--
To
unsubscribe, visit: http://www.php.net/unsub.php
--
I love MacOS X....................................................
Hi,
When I looked at the code, I assumed that it wasn't intended for international use
I'll have to go back and look to give you details, but it doesn't work for international use or unicode.
It would be ok for 8859-1.
That's the default case in PHP < 6, in current PHP versions all string
operations use on "binary" strings, so all references to offset work on
byte not character base. That's one of the main reasons for PHP 6.
johannes
Ok, well then the code needs to use internationalized functions for string upper and lower.
Operating on the first character of the string without surrounding context is incorrect.
Operating on the string without locale is also incorrect.
The string operations should use ICU.
Also, ICU uses boyer-moore I believe. (Or it did last time I looked.)
Some other issues as well, but I will have to look at the code.
I wasn't thinking utf-16, so you might also look at surrogates.
Are there guidelines for php coding, and proper support for utf-16?
-----Original Message-----
From: Johannes Schlüter [mailto:johannes@schlueters.de]
Sent: Wednesday, June 11, 2008 5:32 AM
To: Texin, Tex
Cc: Scott MacVicar; Nuno Lopes; internals@lists.php.net;
Michal Dziemianko
Subject: RE: [PHP-DEV] Algorithm Optimizations - string searchHi,
When I looked at the code, I assumed that it wasn't intended for
international use I'll have to go back and look to give you
details, but it doesn't work for international use or unicode.
It would be ok for 8859-1.That's the default case in PHP < 6, in current PHP versions
all string operations use on "binary" strings, so all
references to offset work on byte not character base. That's
one of the main reasons for PHP 6.johannes
Hello everybody,
I have done small comparison of the 3 algorithms: the original zend_memnstr
implementation, KMP, and Boyer-Moore (inefficient, "book" implementation).
The first part - real life data. I have collected the data about the
parameters passed to ZEND_MEMNSTR (not only while it is called from strpos)
on my own server. Data is collected over 1h ~ 2 000 000 calls. (if somebody
thinks how -> "the hard way" just fwrite()
added to zend_memnstr to save in
file data about the call)
Here are some statistics:
- average haystack length: 624.2
- average needle length: 1.9 ! -> 63% of needles of length 1
- avg length of haystacks shorter than avg: 41.0 -> 85% of all haystacks
- avg length of haystacks longer than avg: 5685.11
I did several tests - both "human readable" and artificial texts and here
are (some) results:
In case of searching of words in text (piece of book copied into test
bench):
For short haystacks (<1000) the original implementation is actually the best
one.
KMP is slower due to preprocessing time and some overhead in the loop.
BM is 10-30 times slower, thus not really applicable.
In case of artificial texts (like search of params in kind of config - taken
real life example from Joomla)
KMP and original implementation are almost the same (the latter a little bit
faster)
BM is still much slower
some results (times in seconds) - only 5 shown. If somebody wants to play
with it - I can post small program with implementation of all 3 algorithms)
ORIGINAL
test 1: 2.450000
test 2: 5.380000
test 3: 3.890000
test 4: 2.870000
test 5: 11.840000
KMP
test 1: 2.910000
test 2: 11.370000
test 3: 7.000000
test 4: 3.970000
test 5: 11.660000
BM
test 1: 35.030000
test 2: 35.460000
test 3: 8.570000
test 4: 30.500000
test 5: 31.860000
For longer text (<5000) KMP and original zend_memnstr are basically similar,
BM starts to show it is better, the real fun is when haystack is loner than
5000 chars - than suddenly BM starts to be really good.
Some examples for texts longer than 3000 chars
ORIGINAL
test 1: 0.110000 <- long text with, strpos = 0
test 2: 97.620000 <- really long text (~10000chars, match ~8000)
test 3:4400, 17.220000 <- match ~4400
test 4:0, 157.990000 <- degraded long text
test 5:0, 99.920000
KMP
test 1:0, 0.550000
test 2:8596, 36.670000
test 3:4400, 50.610000
test 4:0, 122.000000
test 5:0, 93.800000
BM
test 1:0 28.690000
test 2:8596, 23.230000
test 3:4400, 18.420000
test 4:0 38.890000
test 5:0, 157.920000 <- this one is strange!
As I haven't taken into account the information about haystack/needle KMP
seemed to be good choice. But now think about 63% of calls of ZEND_MEMNSTR
with needle_len = 1!! I think it is the main place for optimization.
Although strpos implements fix for that, some other functions don't. My idea
is than to implement ZEND_MEMNSTR once again in shape:
if (needle_len = 1)
here just linear sweep
else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests
needed to choose good value)
original implementation (as it is the best one in this case)
else
BM/KMP (i think BM will be better in this case, as some people
suggested)
Then cleanup all the functions that use zend_memnstr, as they can be this
way simpler.
I did similar test for strripos - in this case KMP is good enough (the
original implementation ALWAYS run in time o(nm) - it is just memcmp at
eachposition no matter if the first chars match, thus is outperformed by KMP
even for short strings). Implementing BM here is not necessary in my opinion
- the match is usually quite close to the end of haystack (average ~80% of
haystack length). But fix for needle_len = 1 should be left as here also
many searches are with short needles.
Do you think it is good idea to rewrite it this way?
Cheers,
Michal
Hi!
Here are some statistics:
- average haystack length: 624.2
- average needle length: 1.9 ! -> 63% of needles of length 1
- avg length of haystacks shorter than avg: 41.0 -> 85% of all haystacks
- avg length of haystacks longer than avg: 5685.11
I think it would be interesting to see same excluding 1-char needles
since in this case it should do one-char lookup (btw, if we don't do it
on C level, it might be a good idea to).
Although strpos implements fix for that, some other functions don't. My idea
is than to implement ZEND_MEMNSTR once again in shape:if (needle_len = 1)
here just linear sweep
else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests
needed to choose good value)
original implementation (as it is the best one in this case)
else
BM/KMP (i think BM will be better in this case, as some people
suggested)
I'm not sure very big haystacks really worth the trouble - how many of
them are used? It may be interesting to see medians instead of averages
for that. But len=1 I think worth having special case.
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hello again
I'm not sure very big haystacks really worth the trouble - how many of them
are used? It may be interesting to see medians instead of averages for that.
But len=1 I think worth having special case.
Here are some charts showing haystack and needle length distribution.
Haystack: http://212.85.117.53/haystack.png (shows only lengths up to 2000
-> haystacks longer are only 0.7%)
Needle: http://212.85.117.53/needle.png
For haystack median is 19, for needle..1.
As both haystack and needle are in most cases short the optimization should
go in this direction.
Now lets forget about the first patch I made (if I did this profiling before
I wouldn't have came up with this idea). And consider for now only the patch
for better execution in case of needle_len = 1
Here are some results of a comparison (10 000 000 iterations, first row is
original zend_memnstr implementation, second with patch) time in seconds.
match at 11:
0.72
0.43
match at 773
12.61
12.43
Here is diff for the patch:
--- Zend/zend_operators.h 19 Mar 2008 12:44:13 -0000
1.94.2.4.2.10.2.7
+++ Zend/zend_operators.h 11 Jun 2008 21:35:58 -0000
@@ -220,6 +220,10 @@
char *p = haystack;
char ne = needle[needle_len-1];
-
if (needle_len == 1){
-
return (char *)memchr(p, *needle, (end-p+1));
-
}
-
end -= needle_len; while (p <= end) {
To avoid any discussion about UNICODE etc: this is patch for PHP_5_3 -
function memchr is used in original implementation of zend_memnstr (followed
by memcmp) I will take a look into HEAD later today.
Nathan Nobbe has shown me this:
http://marc.info/?t=121254018200001&r=1&w=2
http://marc.info/?t=121254018200001&r=1&w=2(thx Nathan). I did a
small comparison of different implementations of
stripos - the original one, and with lowering each caracter separately just
before comparing (2 different versions). Here are some results (1st row is
original implementation, 2nd with separate lowering), time in seconds, each
stripos executed 1000000 times:
string length 31, match at 15
0.63
0.53
str len 31, match at 28
0.72
0.93
str len 18 match at 15
0.44
0.50
str len 18 match at 5
0.42
0.32
string len 575 match at 296
9.03
6.07
len 575 match at 570
10.59
11.92
len 1220 match at 400
19.06
10.76
len 1220 match at 1206
22.190000
24.680000
It shows that in case of short haystacks (assuming that there is a match,
and it is in first half of the haystack) implementation that does not lower
the whole haystack at the beginning is better. Even in case of match closely
to the end of string the difference is not that big (2 seconds in last test
shown). As said before - the median is 19...so maybe it is worth
considering.
All comments are highly appreciated,
regards
Michal
I;ve just noticed that the diff I posted is form my test-bed, not the actual
final version. the code should be:
if (needle_len == 1){
return (char *)memchr(p, *needle, (end-p));
}
instead of
-
if (needle_len == 1){
-
return (char *)memchr(p, *needle, (end-p+1)); <- this fails
for strpos("", NULL);
-
}
Christopher Jones wrote:
- Hi Michal,
*This is always a fun topic.
*I think your stripos tests should show the no-match performance case.
*Chris
I haven't said the implementation with lowering chars at the compare time
is better: I've just shown some comparison results. Of course it is slower
in no-match case - it is even slower when the match is closer to end of the
string (it is why I posted 2 different results for the same string length -
one for match in the middle, and one close to the end - which is actually
similar to no-match). I just said it MIGHT be worth considering. I will
analyse my logs to find out what is the average match position, to check if
it is really worth implementing.
Hello again
I have setup small page where I am publishing all the profiling and
evaluation results. It is here: http://83.168.205.202/~michal/
standard15/
So far I have put there function usage profile, zend_memnstr
analysis, stripos and strrpos analysis including some charts etc. CVS
diffs where applicable are also posted along with comparison of
original and patched code.
Michal
Hi!
Here are some statistics:
- average haystack length: 624.2
- average needle length: 1.9 ! -> 63% of needles of length 1
- avg length of haystacks shorter than avg: 41.0 -> 85% of all
haystacks- avg length of haystacks longer than avg: 5685.11
I think it would be interesting to see same excluding 1-char
needles since in this case it should do one-char lookup (btw, if we
don't do it on C level, it might be a good idea to).Although strpos implements fix for that, some other functions
don't. My idea
is than to implement ZEND_MEMNSTR once again in shape:
if (needle_len = 1)
here just linear sweep
else if haystack_len < 5000 (5000 is arbitrary - maybe some more
tests
needed to choose good value)
original implementation (as it is the best one in this case)
else
BM/KMP (i think BM will be better in this case, as some people
suggested)I'm not sure very big haystacks really worth the trouble - how many
of them are used? It may be interesting to see medians instead of
averages for that. But len=1 I think worth having special case.Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hi,
Sorry for taking so long to answer, but I'm trying to catch up last stuff.
It's known that usually to optimize things for longer inputs you usually end
up making things for short inputs worst. So IMHO, I think you should have
the len==1 optimization and then use the KMP algorithm. Your implementation
can be further optimized (at least on 32 bits machines), but seems ok for
now.
I suggest you to produce a final patch, send it here, and then move on to
optimize other things (like strtr()
that I'm sure wikipedia guys will
appreciate).
Nuno
----- Original Message -----
Hello again
I have setup small page where I am publishing all the profiling and
evaluation results. It is here: http://83.168.205.202/~michal/ standard15/
So far I have put there function usage profile, zend_memnstr analysis,
stripos and strrpos analysis including some charts etc. CVS diffs where
applicable are also posted along with comparison of original and patched
code.
MichalHi!
Here are some statistics:
- average haystack length: 624.2
- average needle length: 1.9 ! -> 63% of needles of length 1
- avg length of haystacks shorter than avg: 41.0 -> 85% of all
haystacks- avg length of haystacks longer than avg: 5685.11
I think it would be interesting to see same excluding 1-char needles
since in this case it should do one-char lookup (btw, if we don't do it
on C level, it might be a good idea to).Although strpos implements fix for that, some other functions don't. My
idea
is than to implement ZEND_MEMNSTR once again in shape:
if (needle_len = 1)
here just linear sweep
else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests
needed to choose good value)
original implementation (as it is the best one in this case)
else
BM/KMP (i think BM will be better in this case, as some people
suggested)I'm not sure very big haystacks really worth the trouble - how many of
them are used? It may be interesting to see medians instead of averages
for that. But len=1 I think worth having special case.Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hi,
I am also sorry for delay - got ill recently and spend a day in bed
after night at emergency. I am working on other things now, and hope
to post some patches soon. I will create patch for zend_memnstr as
you suggest and post it here probably tomorrow. I have some ideas/
implementations/data already prepared for other functions (not only
string related) but haven't got time to publish it neither here, nor
on the page (http://212.85.117.53/gsoc/), but will try to do it till
Monday.
Michal
Hi,
Sorry for taking so long to answer, but I'm trying to catch up last
stuff.
It's known that usually to optimize things for longer inputs you
usually end up making things for short inputs worst. So IMHO, I
think you should have the len==1 optimization and then use the KMP
algorithm. Your implementation can be further optimized (at least
on 32 bits machines), but seems ok for now.
I suggest you to produce a final patch, send it here, and then move
on to optimize other things (likestrtr()
that I'm sure wikipedia
guys will appreciate).Nuno
----- Original Message -----
Hello again
I have setup small page where I am publishing all the profiling
and evaluation results. It is here: http://83.168.205.202/~michal/
standard15/
So far I have put there function usage profile, zend_memnstr
analysis, stripos and strrpos analysis including some charts etc.
CVS diffs where applicable are also posted along with comparison
of original and patched code.
MichalHi!
Here are some statistics:
- average haystack length: 624.2
- average needle length: 1.9 ! -> 63% of needles of length 1
- avg length of haystacks shorter than avg: 41.0 -> 85% of all
haystacks- avg length of haystacks longer than avg: 5685.11
I think it would be interesting to see same excluding 1-char
needles since in this case it should do one-char lookup (btw, if
we don't do it on C level, it might be a good idea to).Although strpos implements fix for that, some other functions
don't. My idea
is than to implement ZEND_MEMNSTR once again in shape:
if (needle_len = 1)
here just linear sweep
else if haystack_len < 5000 (5000 is arbitrary - maybe some
more tests
needed to choose good value)
original implementation (as it is the best one in this case)
else
BM/KMP (i think BM will be better in this case, as some people
suggested)I'm not sure very big haystacks really worth the trouble - how
many of them are used? It may be interesting to see medians
instead of averages for that. But len=1 I think worth having
special case.Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hello,
Here goes first diff - to keep it simple and avoid confusion I will
post them one-by-one after the previous is accepted/rejected.
Optimization of STRRPOS/STRRIPOS for PHP_5_3.
2 things are changed:
-
implementation of search loop from theta(NM) to o(N), omega(NM) -
it is now the same as original zend_memnstr - using zend_memrchr
instead of memchr function. Here is comparison of performance: http://
212.85.117.53/gsoc/index.php?
option=com_content&view=article&id=48:strrpos-patch-
proposal&catid=36:patches&Itemid=56
This implementation is faster than original for any reasonable input
(a little slower for "aaab" in "aaaaaaaab", but that is really bad
case), and I think in this case we should not bother with KMP/BM as I
haven't found any call to this functions with haystack longer than
100 (NOTE: on MY server, over ~1h - this can be different for
wikipedia guys;))). In case somebody needs KMP - I have it
implemented to work "backwards" thus suitable for strrpos. -
second thing is pre-lowering the haystack in strripos in case of
needle len=1 - why is explained here: http://212.85.117.53/gsoc/
index.php?option=com_content&view=article&id=49:strripos-possible-
patches&catid=36:patches&Itemid=56
Here is the diff:
diff -u -d -r1.445.2.14.2.69.2.22 string.c
--- ext/standard/string.c 27 May 2008 10:29:33 -0000
1.445.2.14.2.69.2.22
+++ ext/standard/string.c 20 Jun 2008 08:08:34 -0000
@@ -1876,18 +1876,19 @@
if (needle_len == 1) {
/* Single character search can shortcut memcmps */
-
while (e >= p) {
-
if (*e == *needle) {
-
RETURN_LONG(e - p + (offset > 0 ?
offset : 0));
-
}
-
e--;
-
}
-
if ((e = (char *)zend_memrchr(p, *needle, e - p +
1)) != NULL) {
-
RETURN_LONG(e - p + (offset > 0 ? offset : 0));
-
} RETURN_FALSE; } while (e >= p) {
-
if (memcmp(e, needle, needle_len) == 0) {
-
RETURN_LONG(e - p + (offset > 0 ? offset : 0));
-
if ((e = (char *)zend_memrchr(p, *needle, e - p +
1)) != NULL) {
-
if (!memcmp(needle, e, needle_len)) {
-
RETURN_LONG(e - p + (offset > 0 ?
offset : 0));
-
}
-
} else {
-
RETURN_FALSE; } e--; }
@@ -1925,6 +1926,11 @@
if ((haystack_len == 0) || (needle_len == 0)) {
RETURN_FALSE;
}
-
haystack_dup = estrndup(haystack, haystack_len);
-
php_strtolower(haystack_dup, haystack_len);
-
needle_dup = estrndup(needle, needle_len);
-
php_strtolower(needle_dup, needle_len); if (needle_len == 1) { /* Single character search can shortcut memcmps
@@ -1932,34 +1938,36 @@
if (offset >= 0) {
if (offset > haystack_len) {
php_error_docref(NULL TSRMLS_CC,
E_NOTICE, "Offset is greater than the length of haystack string");
-
efree(needle_dup);
-
efree(haystack_dup); RETURN_FALSE; }
-
p = haystack + offset;
-
e = haystack + haystack_len - 1;
-
p = haystack_dup + offset;
-
e = haystack_dup + haystack_len - 1; } else {
-
p = haystack;
-
p = haystack_dup; if (-offset > haystack_len || offset < -
INT_MAX) {
php_error_docref(NULL TSRMLS_CC,
E_NOTICE, "Offset is greater than the length of haystack string");
-
efree(needle_dup);
-
efree(haystack_dup); RETURN_FALSE; }
-
e = haystack + haystack_len + offset;
-
e = haystack_dup + haystack_len + offset; }
-
/* Borrow that ord_needle buffer to avoid repeatedly
tolower()ing needle */
-
*ord_needle = tolower(*needle);
-
while (e >= p) {
-
if (tolower(*e) == *ord_needle) {
-
if (*e == *needle_dup) {
-
efree(needle_dup);
-
efree(haystack_dup); RETURN_LONG(e - p + (offset > 0 ?
offset : 0));
}
e--;
}
-
efree(needle_dup);
-
efree(haystack_dup); RETURN_FALSE; }
-
needle_dup = estrndup(needle, needle_len);
-
php_strtolower(needle_dup, needle_len);
-
haystack_dup = estrndup(haystack, haystack_len);
-
php_strtolower(haystack_dup, haystack_len);
-
if (offset >= 0) { if (offset > haystack_len) { efree(needle_dup);
@@ -1985,11 +1993,13 @@
}
while (e >= p) {
-
if (memcmp(e, needle_dup, needle_len) == 0) {
-
efree(haystack_dup);
-
efree(needle_dup);
-
RETURN_LONG(e - p + (offset > 0 ? offset : 0));
-
}
-
if ((e = (char *)zend_memrchr(p, *needle, e - p +
1)) != NULL) {
-
if (!memcmp(needle, e, needle_len)) {
-
efree(haystack_dup);
-
efree(needle_dup);
-
RETURN_LONG(e - p + (offset > 0 ?
offset : 0));
-
}
-
} e--; }
Tested and PASS. Valgrind does not show any memory leaks, but if
somebody more competent can confirm it, than I will really appreciate
it.
Cheers,
Michal
PS If somebody can take a look at that: http://212.85.117.53/gsoc/
index.php?option=com_content&view=article&id=52:small-
buginconsistence&catid=38:other&Itemid=58 it would be also great. If
it should be the same in both functions than I have patch and test
cases for it prepared.
Hi,
Both patches seem ok to me. Please ask for a CVS account and prepare
yourself to commit (when your mentor says so) :)
In the second patch maybe you could use zend_memrchr() as well, but you
should get the same results. Also I think that bundling the backwards KMP
would be a nice addition. People use PHP for other things than regular web
scripting..
Nuno
----- Original Message -----
From: "Michal Dziemianko" michal.dziemianko@gmail.com
To: internals@lists.php.net
Sent: Friday, June 20, 2008 9:36 AM
Subject: Re: [PHP-DEV] Algorithm Optimizations - string search
Hello,
Here goes first diff - to keep it simple and avoid confusion I will
post them one-by-one after the previous is accepted/rejected.
Optimization of STRRPOS/STRRIPOS for PHP_5_3.2 things are changed:
implementation of search loop from theta(NM) to o(N), omega(NM) -
it is now the same as original zend_memnstr - using zend_memrchr
instead of memchr function. Here is comparison of performance: http://
212.85.117.53/gsoc/index.php?
option=com_content&view=article&id=48:strrpos-patch-
proposal&catid=36:patches&Itemid=56
This implementation is faster than original for any reasonable input
(a little slower for "aaab" in "aaaaaaaab", but that is really bad
case), and I think in this case we should not bother with KMP/BM as I
haven't found any call to this functions with haystack longer than
100 (NOTE: on MY server, over ~1h - this can be different for
wikipedia guys;))). In case somebody needs KMP - I have it
implemented to work "backwards" thus suitable for strrpos.second thing is pre-lowering the haystack in strripos in case of
needle len=1 - why is explained here: http://212.85.117.53/gsoc/
index.php?option=com_content&view=article&id=49:strripos-possible-
patches&catid=36:patches&Itemid=56Here is the diff:
diff -u -d -r1.445.2.14.2.69.2.22 string.c
--- ext/standard/string.c 27 May 2008 10:29:33 -0000
1.445.2.14.2.69.2.22
+++ ext/standard/string.c 20 Jun 2008 08:08:34 -0000
@@ -1876,18 +1876,19 @@if (needle_len == 1) { /* Single character search can shortcut memcmps */
while (e >= p) {
if (*e == *needle) {
RETURN_LONG(e - p + (offset > 0 ?
offset : 0));
}
e--;
}
if ((e = (char *)zend_memrchr(p, *needle, e - p +
1)) != NULL) {
RETURN_LONG(e - p + (offset > 0 ? offset : 0));
} RETURN_FALSE; } while (e >= p) {
if (memcmp(e, needle, needle_len) == 0) {
RETURN_LONG(e - p + (offset > 0 ? offset : 0));
if ((e = (char *)zend_memrchr(p, *needle, e - p +
1)) != NULL) {
if (!memcmp(needle, e, needle_len)) {
RETURN_LONG(e - p + (offset > 0 ?
offset : 0));
}
} else {
RETURN_FALSE; } e--; }
@@ -1925,6 +1926,11 @@
if ((haystack_len == 0) || (needle_len == 0)) {
RETURN_FALSE;
}
haystack_dup = estrndup(haystack, haystack_len);
php_strtolower(haystack_dup, haystack_len);
needle_dup = estrndup(needle, needle_len);
php_strtolower(needle_dup, needle_len); if (needle_len == 1) { /* Single character search can shortcut memcmps
@@ -1932,34 +1938,36 @@
if (offset >= 0) {
if (offset > haystack_len) {
php_error_docref(NULL TSRMLS_CC,
E_NOTICE, "Offset is greater than the length of haystack string");
efree(needle_dup);
efree(haystack_dup); RETURN_FALSE; }
p = haystack + offset;
e = haystack + haystack_len - 1;
p = haystack_dup + offset;
e = haystack_dup + haystack_len - 1; } else {
p = haystack;
p = haystack_dup; if (-offset > haystack_len || offset < -
INT_MAX) {
php_error_docref(NULL TSRMLS_CC,
E_NOTICE, "Offset is greater than the length of haystack string");
efree(needle_dup);
efree(haystack_dup); RETURN_FALSE; }
e = haystack + haystack_len + offset;
e = haystack_dup + haystack_len + offset; }
/* Borrow that ord_needle buffer to avoid repeatedly
tolower()ing needle */
*ord_needle = tolower(*needle);
while (e >= p) {
if (tolower(*e) == *ord_needle) {
if (*e == *needle_dup) {
efree(needle_dup);
efree(haystack_dup); RETURN_LONG(e - p + (offset > 0 ?
offset : 0));
}
e--;
}
efree(needle_dup);
efree(haystack_dup); RETURN_FALSE; }
needle_dup = estrndup(needle, needle_len);
php_strtolower(needle_dup, needle_len);
haystack_dup = estrndup(haystack, haystack_len);
php_strtolower(haystack_dup, haystack_len);
if (offset >= 0) { if (offset > haystack_len) { efree(needle_dup);
@@ -1985,11 +1993,13 @@
}while (e >= p) {
if (memcmp(e, needle_dup, needle_len) == 0) {
efree(haystack_dup);
efree(needle_dup);
RETURN_LONG(e - p + (offset > 0 ? offset : 0));
}
if ((e = (char *)zend_memrchr(p, *needle, e - p +
1)) != NULL) {
if (!memcmp(needle, e, needle_len)) {
efree(haystack_dup);
efree(needle_dup);
RETURN_LONG(e - p + (offset > 0 ?
offset : 0));
}
} e--; }
Tested and PASS. Valgrind does not show any memory leaks, but if
somebody more competent can confirm it, than I will really appreciate
it.
Cheers,
MichalPS If somebody can take a look at that: http://212.85.117.53/gsoc/
index.php?option=com_content&view=article&id=52:small-
buginconsistence&catid=38:other&Itemid=58 it would be also great. If
it should be the same in both functions than I have patch and test
cases for it prepared.
Hello,
I was again in bed for few days:/// Hope this time I have fully
recovered:/
Anyways the patch for today is for auxiliary php_charmask function
used by trim, str_word_count and addcslashes. It is basically just
bunch of if-statements. But it can sped up by reordering them.
Original needs 4+3 tests executed in case of single char, and 4 in
case of range. Optimized needs only 1 test for single char + 3 for
range. Maybe not a big improvement, but it is way faster for chars,
and about the same speed for ranges. Here are some evaluation results
(the results for MOD1):
http://212.85.117.53/gsoc/index.php?
option=com_content&view=article&id=62:phpmask-possible-
reimplementation&catid=36:patches&Itemid=56
The diff is below.
Cheers
Michal
Index: ext/standard/string.c
RCS file: /repository/php-src/ext/standard/string.c,v
retrieving revision 1.445.2.14.2.69.2.29
diff -u -d -r1.445.2.14.2.69.2.29 string.c
--- ext/standard/string.c 3 Jul 2008 14:00:20 -0000 1.445.2.14.2.69.2.29
+++ ext/standard/string.c 8 Jul 2008 20:27:50 -0000
@@ -663,36 +663,38 @@
memset(mask, 0, 256);
for (end = input+len; input < end; input++) {
c=*input;
-
if ((input+3 < end) && input[1] == '.' && input[2] == '.'
-
&& input[3] >= c) {
-
memset(mask+c, 1, input[3] - c + 1);
-
input+=3;
-
} else if ((input+1 < end) && input[0] == '.' && input[1] == '.') {
-
/* Error, try to be as helpful as possible:
-
(a range ending/starting with '.' won't be captured here) */
-
if (end-len >= input) { /* there was no 'left' char */
-
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid '..'-range,
no character to the left of '..'");
-
result = FAILURE;
-
continue;
-
}
-
if (input+2 >= end) { /* there is no 'right' char */
-
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid '..'-range,
no character to the right of '..'");
-
if (input[1] != '.') {
-
mask[c] = 1;
-
} else {
-
if ((input+3 < end) && input[2] == '.' && input[3] >= c) {
-
memset(mask+c, 1, input[3] - c + 1);
-
input+=3;
-
} else if ((input+1 < end) && input[0] == '.' && input[1] == '.') {
-
if (end-len >= input) { /* there was no 'left' char */
-
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid '..'-
range, no character to the left of '..'");
-
result = FAILURE;
-
continue;
-
}
-
if (input+2 >= end) { /* there is no 'right' char */
-
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid '..'-
range, no character to the right of '..'");
-
result = FAILURE;
-
continue;
-
}
-
if (input[-1] > input[2]) { /* wrong order */
-
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid '..'-
range, '..'-range needs to be incrementing");
-
result = FAILURE;
-
continue;
-
}
-
/* FIXME: better error (a..b..c is the only left possibility?) */
-
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid '..'-range"); result = FAILURE; continue;
-
} else {
-
mask[c] = 1; }
-
if (input[-1] > input[2]) { /* wrong order */
-
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid '..'-range,
'..'-range needs to be incrementing");
-
result = FAILURE;
-
continue;
-
}
-
/* FIXME: better error (a..b..c is the only left possibility?) */
-
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid '..'-range");
-
result = FAILURE;
-
continue;
-
} else {
-
mask[c]=1; }
- }
- }
- return result;
}
/* }}} */