reasonability of change the mbfl library

6 years ago by Legale Legage — view source

unread

Hello, internals!

While I was working on a new function mb_str_split
(https://wiki.php.net/rfc/mb_str_split) for the extension mbstring, I
noticed a place to seriously improve the mbfl library performance for
the utf-16 encoding.
Currently, all variable-length encodings are processed byte-by-byte.

for(int i = 0; i < string_length; ++i){
.......
}

utf-8 strings are processed with precounted char length table.

while (i < string_length) {
int m = mbtab[*p];
i += m;
.....
}

This conception can be used for the utf-16 encoding, but table size
would be 65536 bytes against 256 byte for the utf-8 table. Moreover
the tables would be 2, one for the utf-16 big endian and 1 for the
utf-16 little endian.

The results of my tests show a more than 2 times speed increase.
The implementation of the proposed concept is here:

https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38

To do, or not to do: that is the question.
What do you think?

Regards,
Ruslan

6 years ago by Rowan Collins — view source

unread

This conception can be used for the utf-16 encoding, but table size
would be 65536 bytes against 256 byte for the utf-8 table.

Rather than two 65 kilobyte lookup tables with most entries identical,
would it be reasonable to use a bit mask to check for the range we care
about?

I may have this slightly wrong, but something like:

#define UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0xFC00 == 0xD800)
#define UTF16_BE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0x00FC == 0x00D8)

m = UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE(*(uint16_t *)p) ? 4 : 2;

Regards,

--
Rowan Collins
[IMSoP]

6 years ago by Legale Legage — view source

unread

Good idea, thanks. should be a bit slower than lookup table, but faster
then now.

This conception can be used for the utf-16 encoding, but table size
would be 65536 bytes against 256 byte for the utf-8 table.

Rather than two 65 kilobyte lookup tables with most entries identical,
would it be reasonable to use a bit mask to check for the range we care
about?

I may have this slightly wrong, but something like:

#define UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0xFC00 == 0xD800)
#define UTF16_BE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0x00FC == 0x00D8)

m = UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE(*(uint16_t *)p) ? 4 : 2;

Regards,

--
Rowan Collins
[IMSoP]

6 years ago by Dan Ackroyd — view source

unread

https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38

To do, or not to do: that is the question.
What do you think?

Opening separate pull requests for separate changes is good as it
allows them to be discussed separately. That change is bundled with
the mb_str_split() changes, so it's quite hard to see what is
optimisation and what is part of the approved RFC.

Although memory is cheap, the change appears to increase the static
allocation of memory by 128KB for something that >95% of PHP
programmers will never use, which is not a good idea.

show a more than 2 times speed increase.

Lies, damn lies and statistics.

If it takes the time to parse a megabyte string from 0.000002 to
0.000001, no one cares.
If it takes the time to parse a megabyte string from 2 seconds to 1
second, wow that's great!

i.e. Saying a two times speed increase without context doesn't give
people enough information to evaluate it.

But this would be easier to discuss as a separate PR.

cheers
Dan

6 years ago by Legale Legage — view source

unread

Got it. Thanks.

On Sun, 10 Feb 2019 at 12:29, Legale Legage legale.legale@gmail.com
wrote:

https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38

To do, or not to do: that is the question.
What do you think?

Opening separate pull requests for separate changes is good as it
allows them to be discussed separately. That change is bundled with
the mb_str_split() changes, so it's quite hard to see what is
optimisation and what is part of the approved RFC.

Although memory is cheap, the change appears to increase the static
allocation of memory by 128KB for something that >95% of PHP
programmers will never use, which is not a good idea.

show a more than 2 times speed increase.

Lies, damn lies and statistics.

If it takes the time to parse a megabyte string from 0.000002 to
0.000001, no one cares.
If it takes the time to parse a megabyte string from 2 seconds to 1
second, wow that's great!

i.e. Saying a two times speed increase without context doesn't give
people enough information to evaluate it.

But this would be easier to discuss as a separate PR.

cheers
Dan

6 years ago by Legale Legage — view source

unread

Hello, internals.

As Rowan Collins suggested i've replaced lookup table with simple macros:
#define UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0xFC00 == 0xD800)
#define UTF16_BE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0x00FC == 0x00D8)

I repeated the benchmarks again. Here is the results:

String foobar was repeated 1000000 times. Result string size is 11.4mb
mb_str_split(): string was splitted by 50 into 120000 chunks 1 in
0.400670 s
mb_str_split_utf16(): string was splitted by 50 into 120000 chunks 1 in
0.038947 s

I satisfied my research interest. The question is there practical value?
Interested in your opinion.

php benchmark code:

<?php
/**

benchmark function for scoring function perfomance by cycling it given
times
bmark(int $rounds, string $function, mixed $arg [, mixed $... ] ): ?float
*/
function bmark(): ?float
{
$args = func_get_args();
$len = count($args);

if ($len < 3) {
trigger_error("At least 3 args expected. Only $len given.", 256);
return null;
}

$cnt = array_shift($args);
$fun = array_shift($args);

$start = microtime(true);
$i = 0;
while ($i < $cnt) {
++$i;
$res = call_user_func_array($fun, $args);
}
$end = microtime(true) - $start;
return $end;
}
/* this function to convert data size value in bytes to the best unit of
measurement */
function convert($size){
if ($size == 0) {
return 0;
}
$unit = array('b', 'kb', 'mb', 'gb', 'tb', 'pb');
$i = (int)floor(log($size, 1024));
return round($size / pow(1024, $i), 1) . $unit[$i];
}

$string = "foobar";
$utf16 = mb_convert_encoding($string,"UTF-16");
$k = 1e6;
$long = str_repeat($utf16, $k);
$size = convert(strlen($long));
$rounds = 1;
$split_length = 50;

echo "String $string was repeated $k times. Result string size is $size\n";
printf("mb_str_split(): string was splitted by %d into %d chunks %d
in %f s\n"
, $split_length
, count(mb_str_split($long, $split_length, "UTF-16"))
, $rounds
, bmark($rounds, "mb_str_split", $long, $split_length, "UTF-16")
);

printf("mb_str_split_utf16(): string was splitted by %d into %d chunks %d
in %f s\n"
, $split_length
, count(mb_str_split_utf16($long, $split_length, "UTF-16"))
, $rounds
, bmark($rounds, "mb_str_split_utf16", $long, $split_length, "UTF-16")
);

On Sun, 10 Feb 2019 at 12:29, Legale Legage legale.legale@gmail.com
wrote:

https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38

To do, or not to do: that is the question.
What do you think?

Opening separate pull requests for separate changes is good as it
allows them to be discussed separately. That change is bundled with
the mb_str_split() changes, so it's quite hard to see what is
optimisation and what is part of the approved RFC.

Although memory is cheap, the change appears to increase the static
allocation of memory by 128KB for something that >95% of PHP
programmers will never use, which is not a good idea.

show a more than 2 times speed increase.

Lies, damn lies and statistics.

If it takes the time to parse a megabyte string from 0.000002 to
0.000001, no one cares.
If it takes the time to parse a megabyte string from 2 seconds to 1
second, wow that's great!

i.e. Saying a two times speed increase without context doesn't give
people enough information to evaluate it.

But this would be easier to discuss as a separate PR.

cheers
Dan