You probably saw that I have committed initial implementation of
TextIterator. The impetus for this is that direct indexing of Unicode
strings via [] operator is slow, very slow, at least currently. The
reason is that [] cannot simply perform random-offset indexing into
UCHar* strings. It needs to start from the beginning of the string and
iterate forward until it reaches the desired offset, because our
default unit is a codepoint, which can take up 1 or 2 UChar's.
So here are some (rough) numbers on the relative performance of
TextIterator vs. []. The script I used was a simple one (attached after
the signature). Each test was 10000 runs over 500-character string.
[] operator: 27.16373 s
TextIterator: 1.89697 s (!)
For comparison, running the same [] operator test on a 500-character
binary (old-style) string gives me 9.11334 s. Quite interesting, I'd
say.
I am not sure how we can optimize [] to be faster than the iterator
approach. Food for thought?
- Andrei
<?php
$a = str_repeat('a\U010201bcß', 100);
var_dump($a);
/* warm up the engine */
for ($x = 0; $x < 100; $x++) {
foreach (new TextIterator($a) as $c) {
}
}
/* measure [] */
$start = microtime(true);
for ($x = 0; $x < 10000; $x++) {
$len = strlen($a);
for ($i = 0; $i < $len; $i++) {
$c = $a[$i];
}
}
$end = microtime(true);
printf("[] run time: %.5f\n", $end - $start);
/* measure iterator */
$start = microtime(true);
for ($x = 0; $x < 10000; $x++) {
foreach (new TextIterator($a) as $c) {
}
}
$end = microtime(true);
printf("iterator run time: %.5f\n", $end - $start);
?
For yet another comparison, the [] operator test under PHP 4 gives
7.24410 s.
- Andrei
You probably saw that I have committed initial implementation of
TextIterator. The impetus for this is that direct indexing of Unicode
strings via [] operator is slow, very slow, at least currently. The
reason is that [] cannot simply perform random-offset indexing into
UCHar* strings. It needs to start from the beginning of the string and
iterate forward until it reaches the desired offset, because our
default unit is a codepoint, which can take up 1 or 2 UChar's.So here are some (rough) numbers on the relative performance of
TextIterator vs. []. The script I used was a simple one (attached
after the signature). Each test was 10000 runs over 500-character
string.[] operator: 27.16373 s
TextIterator: 1.89697 s (!)For comparison, running the same [] operator test on a 500-character
binary (old-style) string gives me 9.11334 s. Quite interesting, I'd
say.I am not sure how we can optimize [] to be faster than the iterator
approach. Food for thought?
- Andrei
<?php
$a = str_repeat('a\U010201bcß', 100);
var_dump($a);/* warm up the engine */
for ($x = 0; $x < 100; $x++) {
foreach (new TextIterator($a) as $c) {
}
}/* measure [] */
$start = microtime(true);
for ($x = 0; $x < 10000; $x++) {
$len = strlen($a);
for ($i = 0; $i < $len; $i++) {
$c = $a[$i];
}
}
$end = microtime(true);printf("[] run time: %.5f\n", $end - $start);
/* measure iterator */
$start = microtime(true);
for ($x = 0; $x < 10000; $x++) {
foreach (new TextIterator($a) as $c) {
}
}
$end = microtime(true);printf("iterator run time: %.5f\n", $end - $start);
?
Andrei Zmievski wrote:
I am not sure how we can optimize [] to be faster than the iterator
approach. Food for thought?
You could cache the last position (PHP- and Unicode string index) and
start from there. This assumes that most accesses are (more or less)
sequential. If you can step backward as well as forward you could use
the cached version for both directions but even if you can only go
forward it would cover the most common case I guess.
Very simple idea but maybe it helps,
- Chris
Cache it where? In the zval or the opcode? What if the string changes?
How do you detect that and invalidate the cached position?
-Andrei
You could cache the last position (PHP- and Unicode string index) and
start from there. This assumes that most accesses are (more or less)
sequential. If you can step backward as well as forward you could use
the cached version for both directions but even if you can only go
forward it would cover the most common case I guess.Very simple idea but maybe it helps,
- Chris
Hello Christian,
caching? There is nothing to cache. And even if we would do that we would
make every string an object since we would need to invalidate the position
cache on write operations. Also i agree with the others that most common
usage would be accessing a few chars probably changing them.
And I never had code where I used the same position twice. Besides the all
time favorite search for backlsash and forward slash. But that can be done
better using the right search functions anyway.
Also looking for backslashes and changing them to forward slashes can be done
with iterators. Then checking if the second char is a ':' (common usecase
under windows) is best done with [], but that's a one time read access.
The place caching and its optimization effect i see left is sequential
scanning. But for all of that iterators and functions are much better.
So i am convinced that the cache would only blow up the code, make everything
much more complex and in the end slow down php.
best regards
marcus
Friday, February 3, 2006, 2:19:27 AM, you wrote:
Andrei Zmievski wrote:
I am not sure how we can optimize [] to be faster than the iterator
approach. Food for thought?
You could cache the last position (PHP- and Unicode string index) and
start from there. This assumes that most accesses are (more or less)
sequential. If you can step backward as well as forward you could use
the cached version for both directions but even if you can only go
forward it would cover the most common case I guess.
Very simple idea but maybe it helps,
- Chris
Best regards,
Marcus
First of all I was simply proposing a very generic concept without
bothering about the implementation on purpose. If it's not feasible then
simply ignore it.
Marcus Boerger wrote:
caching? There is nothing to cache. And even if we would do that we would
make every string an object since we would need to invalidate the position
cache on write operations. Also i agree with the others that most common
Tracking changes to the string could be tricky, agreed. I don't know
enough about the internal handling of strings, from the user perspective
PHP strings look somewhat immutable but that could be very wrong
internally, you know better than me. Changing Unicode strings in place
sounds kinda tricky to me too so I'd have expected that to the
encapsulated somewhere.
And I never had code where I used the same position twice. Besides the all
You don't need to access the exact same position. If you know the last
array index plus the Unicode offset then you can step by Unicode
characters from there which would result to one single Unicode step for
iterating over a string. But would also work for $a[$i += 2] as opposed
to the originally proposed TextIterator. And if there's a way to step
backwards then $a[$i -= 2] could work too.
So i am convinced that the cache would only blow up the code, make everything
much more complex and in the end slow down php.
Could well be. It was just an idea, feel free to ignore it ;-)
- Chris