not sure if this message belongs on php-general@lists.php.net or
internals@lists.php.net or elsewhere, i'll just try here first and see what
happens,
recently made some tests to check performance of UTF8 validators, and in
that (simple non-comprehensive) test, preg_match()
is ~33 times faster than
mb_check_encoding()
, and mb_check_encoding()
is less than half the speed of
iconv()
(even though iconv isn't designed for validating at all, and makes
a full copy of the string),
can someone shed some light on this? why does mb_check_encoding seem to be
so much slower than the alternatives?
benchmark code+results is here https://stackoverflow.com/a/68690757/1067003
can someone shed some light on this? why does mb_check_encoding seem to be
so much slower than the alternatives?
benchmark code+results is here https://stackoverflow.com/a/68690757/1067003
Hi Hans,
Since you ran the test on PHP 7.4, the relevant implementation is here:
https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl
As you can maybe see, it takes a rather "brute force" approach: it runs
the entire string through a conversion routine, and then checks (among
other things) that the output is identical to the input. That makes it
scale horribly with string length, with no optimization for returning
false early.
The good news is that Alex Dowad has been doing a lot of work to improve
ext/mbstring recently, and landed a completely new implementation for
mb_check_encoding a few months ago:
https://github.com/php/php-src/commit/be1a2155 although it was then
changed slightly by later cleanup:
https://github.com/php/php-src/commit/3e7acf90
That was too late for PHP 8.0, so I compiled an up to date git checkout,
and ran your benchmark (with 100_000 iterations instead of 1_000_000; I
guess my PC's a lot slower than yours!)
PHP 7.4:
mbstring: 57000 / 57100 / 56200
PCRE: 1500 / 1200 / 12400
PHP 8.1 beta:
mbstring: 35600 / 1200 / 36700
PCRE: 1400 / 1200 / 12100
So, mbstring now detects a failure at the start of the string as quickly
as PCRE does, because the new algorithm has an early return, but is
still slower than PCRE when it has to check the whole string.
Looking at the PCRE source, I think the relevant code is this:
https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markup
It has the advantage of only handling a handful of encodings, and only
needing to do a few operations on them. The main problem ext/mbstring
has is that it supports a lot of operations, on a lot of different
encodings, so it's still reusing a general purpose "convert and filter"
algorithm.
Regards,
--
Rowan Tommins
[IMSoP]
On Mon, Aug 9, 2021 at 10:14 PM Rowan Tommins rowan.collins@gmail.com
wrote:
can someone shed some light on this? why does mb_check_encoding seem to
be
so much slower than the alternatives?
benchmark code+results is here
https://stackoverflow.com/a/68690757/1067003Hi Hans,
Since you ran the test on PHP 7.4, the relevant implementation is here:
https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl
As you can maybe see, it takes a rather "brute force" approach: it runs
the entire string through a conversion routine, and then checks (among
other things) that the output is identical to the input. That makes it
scale horribly with string length, with no optimization for returning
false early.The good news is that Alex Dowad has been doing a lot of work to improve
ext/mbstring recently, and landed a completely new implementation for
mb_check_encoding a few months ago:
https://github.com/php/php-src/commit/be1a2155 although it was then
changed slightly by later cleanup:
https://github.com/php/php-src/commit/3e7acf90That was too late for PHP 8.0, so I compiled an up to date git checkout,
and ran your benchmark (with 100_000 iterations instead of 1_000_000; I
guess my PC's a lot slower than yours!)PHP 7.4:
mbstring: 57000 / 57100 / 56200
PCRE: 1500 / 1200 / 12400PHP 8.1 beta:
mbstring: 35600 / 1200 / 36700
PCRE: 1400 / 1200 / 12100So, mbstring now detects a failure at the start of the string as quickly
as PCRE does, because the new algorithm has an early return, but is
still slower than PCRE when it has to check the whole string.Looking at the PCRE source, I think the relevant code is this:
https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markupIt has the advantage of only handling a handful of encodings, and only
needing to do a few operations on them. The main problem ext/mbstring
has is that it supports a lot of operations, on a lot of different
encodings, so it's still reusing a general purpose "convert and filter"
algorithm.
I think a key problem with the mbstring implementation is that input
(encoding to wchar) filters work by handling one byte at a time. This means
that state has to be managed internally by the filter, and we need to use a
filter-chain interface.
What would be better is an interface along the lines of int decode(char
**input, size_t *input_len), where the filter returns the decoded
character, while advancing the input/input_len pointers. Possibly with an
indication that the input is incomplete and more characters are necessary
to allow streaming use.
This would allow the filter to handle one unicode character at a time
(regardless of how many bytes it is encoded as), and would allow to use the
calling code to use a simple while loop rather than a filter chain.
Of course, this would require rewriting all our filter code...
Regards,
Nikita