Unicode string iterator performance

19 years ago by Andrei Zmievski — view source — reply

unread

You probably saw that I have committed initial implementation of
TextIterator. The impetus for this is that direct indexing of Unicode
strings via [] operator is slow, very slow, at least currently. The
reason is that [] cannot simply perform random-offset indexing into
UCHar* strings. It needs to start from the beginning of the string and
iterate forward until it reaches the desired offset, because our
default unit is a codepoint, which can take up 1 or 2 UChar's.

So here are some (rough) numbers on the relative performance of
TextIterator vs. []. The script I used was a simple one (attached after
the signature). Each test was 10000 runs over 500-character string.

[] operator: 27.16373 s
TextIterator: 1.89697 s (!)

For comparison, running the same [] operator test on a 500-character
binary (old-style) string gives me 9.11334 s. Quite interesting, I'd
say.

I am not sure how we can optimize [] to be faster than the iterator
approach. Food for thought?

Andrei

<?php
$a = str_repeat('a\U010201bcß', 100);
var_dump($a);

/* warm up the engine */
for ($x = 0; $x < 100; $x++) {
foreach (new TextIterator($a) as $c) {
}
}

/* measure [] */
$start = microtime(true);
for ($x = 0; $x < 10000; $x++) {
$len = strlen($a);
for ($i = 0; $i < $len; $i++) {
$c = $a[$i];
}
}
$end = microtime(true);

printf("[] run time: %.5f\n", $end - $start);

/* measure iterator */
$start = microtime(true);
for ($x = 0; $x < 10000; $x++) {
foreach (new TextIterator($a) as $c) {
}
}
$end = microtime(true);

printf("iterator run time: %.5f\n", $end - $start);
?

19 years ago by Andrei Zmievski — view source — reply

unread

For yet another comparison, the [] operator test under PHP 4 gives
7.24410 s.

Andrei

You probably saw that I have committed initial implementation of
TextIterator. The impetus for this is that direct indexing of Unicode
strings via [] operator is slow, very slow, at least currently. The
reason is that [] cannot simply perform random-offset indexing into
UCHar* strings. It needs to start from the beginning of the string and
iterate forward until it reaches the desired offset, because our
default unit is a codepoint, which can take up 1 or 2 UChar's.

So here are some (rough) numbers on the relative performance of
TextIterator vs. []. The script I used was a simple one (attached
after the signature). Each test was 10000 runs over 500-character
string.

[] operator: 27.16373 s
TextIterator: 1.89697 s (!)

For comparison, running the same [] operator test on a 500-character
binary (old-style) string gives me 9.11334 s. Quite interesting, I'd
say.

I am not sure how we can optimize [] to be faster than the iterator
approach. Food for thought?

Andrei

<?php
$a = str_repeat('a\U010201bcß', 100);
var_dump($a);

/* warm up the engine */
for ($x = 0; $x < 100; $x++) {
foreach (new TextIterator($a) as $c) {
}
}

/* measure [] */
$start = microtime(true);
for ($x = 0; $x < 10000; $x++) {
$len = strlen($a);
for ($i = 0; $i < $len; $i++) {
$c = $a[$i];
}
}
$end = microtime(true);

printf("[] run time: %.5f\n", $end - $start);

/* measure iterator */
$start = microtime(true);
for ($x = 0; $x < 10000; $x++) {
foreach (new TextIterator($a) as $c) {
}
}
$end = microtime(true);

printf("iterator run time: %.5f\n", $end - $start);
?

19 years ago by Christian Schneider — view source — reply

unread

Andrei Zmievski wrote:

I am not sure how we can optimize [] to be faster than the iterator
approach. Food for thought?

You could cache the last position (PHP- and Unicode string index) and
start from there. This assumes that most accesses are (more or less)
sequential. If you can step backward as well as forward you could use
the cached version for both directions but even if you can only go
forward it would cover the most common case I guess.

Very simple idea but maybe it helps,

Chris

19 years ago by Andrei Zmievski — view source — reply

unread

Cache it where? In the zval or the opcode? What if the string changes?
How do you detect that and invalidate the cached position?

-Andrei

You could cache the last position (PHP- and Unicode string index) and
start from there. This assumes that most accesses are (more or less)
sequential. If you can step backward as well as forward you could use
the cached version for both directions but even if you can only go
forward it would cover the most common case I guess.

Very simple idea but maybe it helps,

Chris

19 years ago by Marcus Boerger — view source — reply

unread

Hello Christian,

caching? There is nothing to cache. And even if we would do that we would
make every string an object since we would need to invalidate the position
cache on write operations. Also i agree with the others that most common
usage would be accessing a few chars probably changing them.

And I never had code where I used the same position twice. Besides the all
time favorite search for backlsash and forward slash. But that can be done
better using the right search functions anyway.

Also looking for backslashes and changing them to forward slashes can be done
with iterators. Then checking if the second char is a ':' (common usecase
under windows) is best done with [], but that's a one time read access.

The place caching and its optimization effect i see left is sequential
scanning. But for all of that iterators and functions are much better.

So i am convinced that the cache would only blow up the code, make everything
much more complex and in the end slow down php.

best regards
marcus

Friday, February 3, 2006, 2:19:27 AM, you wrote:

Andrei Zmievski wrote:

I am not sure how we can optimize [] to be faster than the iterator
approach. Food for thought?

You could cache the last position (PHP- and Unicode string index) and
start from there. This assumes that most accesses are (more or less)
sequential. If you can step backward as well as forward you could use
the cached version for both directions but even if you can only go
forward it would cover the most common case I guess.

Very simple idea but maybe it helps,

Chris

Best regards,
Marcus

19 years ago by Christian Schneider — view source — reply

unread

First of all I was simply proposing a very generic concept without
bothering about the implementation on purpose. If it's not feasible then
simply ignore it.

Marcus Boerger wrote:

caching? There is nothing to cache. And even if we would do that we would
make every string an object since we would need to invalidate the position
cache on write operations. Also i agree with the others that most common

Tracking changes to the string could be tricky, agreed. I don't know
enough about the internal handling of strings, from the user perspective
PHP strings look somewhat immutable but that could be very wrong
internally, you know better than me. Changing Unicode strings in place
sounds kinda tricky to me too so I'd have expected that to the
encapsulated somewhere.

And I never had code where I used the same position twice. Besides the all

You don't need to access the exact same position. If you know the last
array index plus the Unicode offset then you can step by Unicode
characters from there which would result to one single Unicode step for
iterating over a string. But would also work for $a[$i += 2] as opposed
to the originally proposed TextIterator. And if there's a way to step
backwards then $a[$i -= 2] could work too.

So i am convinced that the cache would only blow up the code, make everything
much more complex and in the end slow down php.

Could well be. It was just an idea, feel free to ignore it ;-)

Chris

19 years ago by Andrei Zmievski — view source — reply

unread

The real test however would be random character access, rather then
sequential scans from start to end :-).

How often do you access random characters in a string vs. sequential
scans? Which is the more likely scenario in PHP scripts? I think it's
the latter.

-Andrei

19 years ago by Stefan Walk — view source — reply

unread

Andrei Zmievski wrote:

The real test however would be random character access, rather then
sequential scans from start to end :-).

How often do you access random characters in a string vs. sequential
scans? Which is the more likely scenario in PHP scripts? I think it's
the latter.

-Andrei

--

I think the most common usage would be grabbing one or a few characters
and then going to do something else... if that happens alot, it will
look more like like "random" string access than sequential scans.

Regards,
Stefan

19 years ago by Andrei Zmievski — view source — reply

unread

That's still sequential, and not random access.

-Andrei

I think the most common usage would be grabbing one or a few
characters and then going to do something else... if that happens
alot, it will look more like like "random" string access than
sequential scans.

Regards,
Stefan

19 years ago by Xuefer — view source — reply

unread

Hi Andrei,

Pardon me for my ignorance, as I have not even looked at the Unicode
stuff, but based on what you wrote, what about always allocating two
UChars per codepoint? It would take a bit more space, but then
random-offset indexing is fast and easy (the codepoint would always
start at "index << 1").
what u say id UCS-2 not UTF-16.
i know little about icu either, but for those who're not familiar:
according to icu manual, UCS-2 is a subset of UTF-16 and is
deprecated. UTF-32(UCS-4?)takes more memory space which "kill
performance" for memory bandwidth. and there's a list of advantage of
UTF-16 over UTF-8 (on the icu manual). these reason makes icu all the
way to UTF-16.

i see no advantage with UTF-16 on the problem of random string offset
access because both take variable length of code unit(s) for 1 code
point.
both UCS-2/4 is good for random access.

imho, while u guys solving the problem with some way that is not 0
cost, it would be nice to have a mode to use UCS-2 instaed of UTF-16
for those who care about performance, at compile time, something like
php-src/configure --with-icu-encoding=UCS-2 (default to UTF-16). (the
code have to aware that whether ucs-2 is in used). code point out of
BMP isn't useful at all case.