mb_check_encoding slow performance?

4 years ago by Hans Henrik Bergan — view source

unread

not sure if this message belongs on php-general@lists.php.net or
internals@lists.php.net or elsewhere, i'll just try here first and see what
happens,

recently made some tests to check performance of UTF8 validators, and in
that (simple non-comprehensive) test, preg_match() is ~33 times faster than
mb_check_encoding(), and mb_check_encoding() is less than half the speed of
iconv() (even though iconv isn't designed for validating at all, and makes
a full copy of the string),
can someone shed some light on this? why does mb_check_encoding seem to be
so much slower than the alternatives?
benchmark code+results is here https://stackoverflow.com/a/68690757/1067003

4 years ago by Rowan Tommins — view source

unread

can someone shed some light on this? why does mb_check_encoding seem to be
so much slower than the alternatives?
benchmark code+results is here https://stackoverflow.com/a/68690757/1067003

Hi Hans,

Since you ran the test on PHP 7.4, the relevant implementation is here:
https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl

As you can maybe see, it takes a rather "brute force" approach: it runs
the entire string through a conversion routine, and then checks (among
other things) that the output is identical to the input. That makes it
scale horribly with string length, with no optimization for returning
false early.

The good news is that Alex Dowad has been doing a lot of work to improve
ext/mbstring recently, and landed a completely new implementation for
mb_check_encoding a few months ago:
https://github.com/php/php-src/commit/be1a2155 although it was then
changed slightly by later cleanup:
https://github.com/php/php-src/commit/3e7acf90

That was too late for PHP 8.0, so I compiled an up to date git checkout,
and ran your benchmark (with 100_000 iterations instead of 1_000_000; I
guess my PC's a lot slower than yours!)

PHP 7.4:
mbstring: 57000 / 57100 / 56200
PCRE: 1500 / 1200 / 12400

PHP 8.1 beta:
mbstring: 35600 / 1200 / 36700
PCRE: 1400 / 1200 / 12100

So, mbstring now detects a failure at the start of the string as quickly
as PCRE does, because the new algorithm has an early return, but is
still slower than PCRE when it has to check the whole string.

Looking at the PCRE source, I think the relevant code is this:
https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markup

It has the advantage of only handling a handful of encodings, and only
needing to do a few operations on them. The main problem ext/mbstring
has is that it supports a lot of operations, on a lot of different
encodings, so it's still reusing a general purpose "convert and filter"
algorithm.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Nikita Popov — view source

unread

On Mon, Aug 9, 2021 at 10:14 PM Rowan Tommins rowan.collins@gmail.com
wrote:

can someone shed some light on this? why does mb_check_encoding seem to
be
so much slower than the alternatives?
benchmark code+results is here
https://stackoverflow.com/a/68690757/1067003

Hi Hans,

Since you ran the test on PHP 7.4, the relevant implementation is here:

https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl

As you can maybe see, it takes a rather "brute force" approach: it runs
the entire string through a conversion routine, and then checks (among
other things) that the output is identical to the input. That makes it
scale horribly with string length, with no optimization for returning
false early.

The good news is that Alex Dowad has been doing a lot of work to improve
ext/mbstring recently, and landed a completely new implementation for
mb_check_encoding a few months ago:
https://github.com/php/php-src/commit/be1a2155 although it was then
changed slightly by later cleanup:
https://github.com/php/php-src/commit/3e7acf90

That was too late for PHP 8.0, so I compiled an up to date git checkout,
and ran your benchmark (with 100_000 iterations instead of 1_000_000; I
guess my PC's a lot slower than yours!)

PHP 7.4:
mbstring: 57000 / 57100 / 56200
PCRE: 1500 / 1200 / 12400

PHP 8.1 beta:
mbstring: 35600 / 1200 / 36700
PCRE: 1400 / 1200 / 12100

So, mbstring now detects a failure at the start of the string as quickly
as PCRE does, because the new algorithm has an early return, but is
still slower than PCRE when it has to check the whole string.

Looking at the PCRE source, I think the relevant code is this:
https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markup

It has the advantage of only handling a handful of encodings, and only
needing to do a few operations on them. The main problem ext/mbstring
has is that it supports a lot of operations, on a lot of different
encodings, so it's still reusing a general purpose "convert and filter"
algorithm.

I think a key problem with the mbstring implementation is that input
(encoding to wchar) filters work by handling one byte at a time. This means
that state has to be managed internally by the filter, and we need to use a
filter-chain interface.

What would be better is an interface along the lines of int decode(char
**input, size_t *input_len), where the filter returns the decoded
character, while advancing the input/input_len pointers. Possibly with an
indication that the input is incomplete and more characters are necessary
to allow streaming use.

This would allow the filter to handle one unicode character at a time
(regardless of how many bytes it is encoded as), and would allow to use the
calling code to use a simple while loop rather than a filter chain.

Of course, this would require rewriting all our filter code...

Regards,
Nikita

3 years ago by Nikita Popov — view source

unread

Dear Nikita,

It looks like we think alike.

I already have a local dev branch on my PC with the beginnings of the
internal interface change which you describe here. In some cases which I
tested, it makes mbstring encoding conversion operations about 3 times
faster. However, the specific function signatures which I am using are
different from what you showed. Here is an example:

static void* mb_utf16_to_wchar(zend_string *str, uint32_t *buf, size_t
bufsize, mb_wchar_consumer consumer, void context);
static void mb_wchar_to_utf16le(uint32_t *input, size_t len, bool end,
void *context);

Your comments on the interfaces would be appreciated. Here are some points:

The "legacy to wchar" functions take a zend_string as input, since
that is what mbstring receives from user code anyways. I don't know if
there are some places where we would like to use these functions
internally, but don't have a zend_string readily available. If so, it could
take a pointer to char/length pair instead.

We do not want to force any part of mbstring to have to convert an
entire input string to wchars before processing them, since this could
cause memory usage spikes (when someone passes in a huge input string).
Hence, the "legacy to wchar" functions work in "chunks", repeatedly filling
a uint32_t buffer with wchars and passing it through to whatever needs to
process the wchars. The uint32_t buffer is reused each time. Working in
chunks avoids huge memory allocations, but also amortizes the overheads
which we are currently paying for processing legacy text one character at a
time.

There is really no reason why the "legacy to wchar" functions need
to take the uint32_t buffer from their caller. It should be fine for each
such function to stack-allocate its own buffer of whatever size it wants.
But when I was benchmarking, it seemed to be faster if the caller
stack-allocates the buffer and passes it in. (Why???)

Because there are a variety of places where we need to convert
legacy text to wchars, the "legacy to wchar" functions take a generic
function pointer to a "wchar consumer", so they can be reused for various
purposes. As is usual in such C APIs, we need to pass through an opaque
void pointer.

For encoding conversion, the void* will actually be something like
an "output buffer". It may need to be realloc'd partway through a
conversion operation. That is why all the functions return a void*; the
returned void* is usually the opaque pointer that was originally passed in,
but if that pointer was realloc'd, it is the new pointer.

To get more perf, I am using a dirty hack... the dynamically-grown
output buffers have the same memory layout as a zend_string, and I set
things up such that when a conversion operation is done, the output buffer
can be cast to zend_string* and returned. This avoids an extra memory copy.

The main thing I'm unsure about in this scheme is whether we want to keep
the current approach where the filter is responsible for calling the
consumer. To keep with the general spirit of your approach, a possible
alternative would be something like this:

// Advances in/in_size.
// Returns number of characters written to out.
static size_t mb_utf16_to_wchar(unsigned char **in, size_t *in_size,
uint32_t *out, size_t out_len);

Then the caller would be responsible for doing something with the output,
e.g. for encoding validation it would go something like this:

uint32_t out[16];
while ((out_len = to_wchar(&in, &in_size, out, sizeof(out)))) {
for (size_t i = 0; i < out_len; i++) {
if (out[i] & MBFL_WCSGROUP_THROUGH) return false;
}
}
return true;

I think this may make for nicer implementations because we don't need to
deal with callback functions, and I would expect it to be more efficient as
well, as we save on the virtual dispatch. This would end up being pretty
similar to the iconv interface.

What I originally had in mind is to just return a single codepoint on each
to_wchar call (but consuming potentially multiple input code units), which
would make for the simplest implementations. Of course, processing multiple
at a time is more efficient.

Why I am not upstreaming this right now:

I tried running gcov on mbstring and discovered that our test suite
coverage is very bad. (And after all the work I have done adding more
tests! Sigh.) I am just trying to get the test coverage close to 100%,
before doing any major surgery.

Yeah, mbstring test coverage leaves something to be desired :) Looking
forward to improvements in this area!

Regards,
Nikita

By the way, I would welcome special-casing commonly used text encodings in

mb_check_encoding. AFTER we have close to 100% test coverage, that is.

Thanks,
Alex

On Mon, Aug 16, 2021 at 10:11 AM Nikita Popov nikita.ppv@gmail.com
wrote:

On Mon, Aug 9, 2021 at 10:14 PM Rowan Tommins rowan.collins@gmail.com
wrote:

can someone shed some light on this? why does mb_check_encoding seem
to be
so much slower than the alternatives?
benchmark code+results is here
https://stackoverflow.com/a/68690757/1067003

Hi Hans,

Since you ran the test on PHP 7.4, the relevant implementation is here:

https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl

As you can maybe see, it takes a rather "brute force" approach: it runs
the entire string through a conversion routine, and then checks (among
other things) that the output is identical to the input. That makes it
scale horribly with string length, with no optimization for returning
false early.

The good news is that Alex Dowad has been doing a lot of work to improve
ext/mbstring recently, and landed a completely new implementation for
mb_check_encoding a few months ago:
https://github.com/php/php-src/commit/be1a2155 although it was then
changed slightly by later cleanup:
https://github.com/php/php-src/commit/3e7acf90

That was too late for PHP 8.0, so I compiled an up to date git checkout,
and ran your benchmark (with 100_000 iterations instead of 1_000_000; I
guess my PC's a lot slower than yours!)

PHP 7.4:
mbstring: 57000 / 57100 / 56200
PCRE: 1500 / 1200 / 12400

PHP 8.1 beta:
mbstring: 35600 / 1200 / 36700
PCRE: 1400 / 1200 / 12100

So, mbstring now detects a failure at the start of the string as quickly
as PCRE does, because the new algorithm has an early return, but is
still slower than PCRE when it has to check the whole string.

Looking at the PCRE source, I think the relevant code is this:
https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markup

It has the advantage of only handling a handful of encodings, and only
needing to do a few operations on them. The main problem ext/mbstring
has is that it supports a lot of operations, on a lot of different
encodings, so it's still reusing a general purpose "convert and filter"
algorithm.

I think a key problem with the mbstring implementation is that input
(encoding to wchar) filters work by handling one byte at a time. This means
that state has to be managed internally by the filter, and we need to use a
filter-chain interface.

What would be better is an interface along the lines of int decode(char
**input, size_t *input_len), where the filter returns the decoded
character, while advancing the input/input_len pointers. Possibly with an
indication that the input is incomplete and more characters are necessary
to allow streaming use.

This would allow the filter to handle one unicode character at a time
(regardless of how many bytes it is encoded as), and would allow to use the
calling code to use a simple while loop rather than a filter chain.

Of course, this would require rewriting all our filter code...

Regards,
Nikita