Hello, internals.
As Rowan Collins suggested i've replaced lookup table with simple macros:
#define UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0xFC00 == 0xD800)
#define UTF16_BE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0x00FC == 0x00D8)
I repeated the benchmarks again. Here is the results:
String foobar was repeated 1000000 times. Result string size is 11.4mb
mb_str_split(): string was splitted by 50 into 120000 chunks 1 in
0.400670 s
mb_str_split_utf16(): string was splitted by 50 into 120000 chunks 1 in
0.038947 s
I satisfied my research interest. The question is there practical value?
Interested in your opinion.
php benchmark code:
<?php
/**
-
benchmark function for scoring function perfomance by cycling it given
times
-
bmark(int $rounds, string $function, mixed $arg [, mixed $... ] ): ?float
*/
function bmark(): ?float
{
$args = func_get_args()
;
$len = count($args);
if ($len < 3) {
trigger_error("At least 3 args expected. Only $len given.", 256);
return null;
}
$cnt = array_shift($args);
$fun = array_shift($args);
$start = microtime(true);
$i = 0;
while ($i < $cnt) {
++$i;
$res = call_user_func_array($fun, $args);
}
$end = microtime(true) - $start;
return $end;
}
/* this function to convert data size value in bytes to the best unit of
measurement */
function convert($size){
if ($size == 0) {
return 0;
}
$unit = array('b', 'kb', 'mb', 'gb', 'tb', 'pb');
$i = (int)floor(log($size, 1024));
return round($size / pow(1024, $i), 1) . $unit[$i];
}
$string = "foobar";
$utf16 = mb_convert_encoding($string,"UTF-16");
$k = 1e6;
$long = str_repeat($utf16, $k);
$size = convert(strlen($long));
$rounds = 1;
$split_length = 50;
echo "String $string was repeated $k times. Result string size is $size\n";
printf("mb_str_split(): string was splitted by %d into %d chunks %d
in %f s\n"
, $split_length
, count(mb_str_split($long, $split_length, "UTF-16"))
, $rounds
, bmark($rounds, "mb_str_split", $long, $split_length, "UTF-16")
);
printf("mb_str_split_utf16(): string was splitted by %d into %d chunks %d
in %f s\n"
, $split_length
, count(mb_str_split_utf16($long, $split_length, "UTF-16"))
, $rounds
, bmark($rounds, "mb_str_split_utf16", $long, $split_length, "UTF-16")
);
On Sun, 10 Feb 2019 at 12:29, Legale Legage legale.legale@gmail.com
wrote:
https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38
To do, or not to do: that is the question.
What do you think?
Opening separate pull requests for separate changes is good as it
allows them to be discussed separately. That change is bundled with
the mb_str_split() changes, so it's quite hard to see what is
optimisation and what is part of the approved RFC.
Although memory is cheap, the change appears to increase the static
allocation of memory by 128KB for something that >95% of PHP
programmers will never use, which is not a good idea.
show a more than 2 times speed increase.
Lies, damn lies and statistics.
If it takes the time to parse a megabyte string from 0.000002 to
0.000001, no one cares.
If it takes the time to parse a megabyte string from 2 seconds to 1
second, wow that's great!
i.e. Saying a two times speed increase without context doesn't give
people enough information to evaluate it.
But this would be easier to discuss as a separate PR.
cheers
Dan