Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:115799 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 47950 invoked from network); 25 Aug 2021 08:03:51 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 25 Aug 2021 08:03:51 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id CF4571804DB for ; Wed, 25 Aug 2021 01:37:59 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, PDS_OTHER_BAD_TLD,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com [209.85.167.50]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 25 Aug 2021 01:37:59 -0700 (PDT) Received: by mail-lf1-f50.google.com with SMTP id m28so4983242lfj.6 for ; Wed, 25 Aug 2021 01:37:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=eI+FPqE0jW35AGuv3Ms1JW7nunRbrF8ucbha4HY1yA8=; b=FjrbZZJEHsIGDA7wQ9Ypyvprgd7hEAWKPfGLuJzC5BwF7itxuBXCOkgSL/wnVao7D4 v4JqDLgAFBHHcGsuCyKjdFmmfU8saU9Fbxe0BcS3w4wxwQDnStJXwrhMmyGgch1FX658 xfUw1UD0l+NfGCP4F4tCWO5FzJRWBJEW8lbEt0LmOhiKw98z2422XfWAcXbRux2JEVEZ 3vOEN4w5j52C+O/BrXn0SKF0SKggAiRs8N8pE2NrRpHqF0aFH700U8B0qj3KLgR3Ec1i nNlNCybxJUvw2xSPUZShb3sRb/J/5001LxscUzFNyX9grjeGN7w3dy1Bf9CA6cE71YN6 w+rQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=eI+FPqE0jW35AGuv3Ms1JW7nunRbrF8ucbha4HY1yA8=; b=lGPR1b13TkoK9vzjW8r98dYJ1RSbN139iq736AtPXRTdyz+/aRuoGYOclDSEVU4nKu 3ZdtuhVG4/+kPIGZ4odVk83bouYXyYqqu13uQVA5rxpg4z2MBCOgePQEFRo6FltDf7UR crksVBn1MGJxOC/xkIpljwQ3xo5gpIlZZDpAMP/cSyYrK80s+fej53j/7HrJoj0M5Pg1 IJJLKRTaL7WY2ia6x3AwPRmS55JuPxo7xRXqfW4iTJQchNAu9IBciyR6kGIajD4y+pzo 8LkzQlhfOHeWIqVFWPaQy67UODh4EAsIVhlrLJbY8oVqd6WYBlMHs6gbunWQqs3XV6o+ xQRQ== X-Gm-Message-State: AOAM530MI4axLkH13nA8vhPFuOTiiTAS/HO3/Ea0f+QVP4Hg+8/GZIh6 twfG5po3sg6x96gEUmeyggvNyCJ3xaVggk3v6JY= X-Google-Smtp-Source: ABdhPJwwprcT+1RFrYTQt1d3RNbFyKJ5cg4IWjFiAAmfrLFBWbKDEOEcyB3EDzbIycrDUAEyL9DZ9WTE5WRwaeTdzVg= X-Received: by 2002:a05:6512:21a:: with SMTP id a26mr1099380lfo.223.1629880676210; Wed, 25 Aug 2021 01:37:56 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Wed, 25 Aug 2021 10:37:40 +0200 Message-ID: To: Alex Cc: Rowan Tommins , PHP internals Content-Type: multipart/alternative; boundary="000000000000bb5d0b05ca5e2a14" Subject: Re: [PHP-DEV] mb_check_encoding slow performance? From: nikita.ppv@gmail.com (Nikita Popov) --000000000000bb5d0b05ca5e2a14 Content-Type: text/plain; charset="UTF-8" On Mon, Aug 16, 2021 at 10:56 AM Alex wrote: > Dear Nikita, > > It looks like we think alike. > > I already have a local dev branch on my PC with the beginnings of the > internal interface change which you describe here. In some cases which I > tested, it makes mbstring encoding conversion operations about 3 times > faster. However, the specific function signatures which I am using are > different from what you showed. Here is an example: > > static void* mb_utf16_to_wchar(zend_string *str, uint32_t *buf, size_t > bufsize, mb_wchar_consumer consumer, void *context); > static void* mb_wchar_to_utf16le(uint32_t *input, size_t len, bool end, > void *context); > > Your comments on the interfaces would be appreciated. Here are some points: > > - The "legacy to wchar" functions take a zend_string as input, since > that is what mbstring receives from user code anyways. I don't know if > there are some places where we would like to use these functions > internally, but don't have a zend_string readily available. If so, it could > take a pointer to char/length pair instead. > - We do *not* want to force any part of mbstring to have to convert an > entire input string to wchars before processing them, since this could > cause memory usage spikes (when someone passes in a huge input string). > Hence, the "legacy to wchar" functions work in "chunks", repeatedly filling > a uint32_t buffer with wchars and passing it through to whatever needs to > process the wchars. The uint32_t buffer is reused each time. Working in > chunks avoids huge memory allocations, but also amortizes the overheads > which we are currently paying for processing legacy text one character at a > time. > - There is really no reason why the "legacy to wchar" functions need > to take the uint32_t buffer from their caller. It should be fine for each > such function to stack-allocate its own buffer of whatever size it wants. > But when I was benchmarking, it seemed to be faster if the caller > stack-allocates the buffer and passes it in. (Why???) > - Because there are a variety of places where we need to convert > legacy text to wchars, the "legacy to wchar" functions take a generic > function pointer to a "wchar consumer", so they can be reused for various > purposes. As is usual in such C APIs, we need to pass through an opaque > void pointer. > - For encoding conversion, the void* will actually be something like > an "output buffer". It may need to be realloc'd partway through a > conversion operation. That is why all the functions return a void*; the > returned void* is usually the opaque pointer that was originally passed in, > but if that pointer was realloc'd, it is the new pointer. > - To get more perf, I am using a dirty hack... the dynamically-grown > output buffers have the same memory layout as a zend_string, and I set > things up such that when a conversion operation is done, the output buffer > can be cast to zend_string* and returned. This avoids an extra memory copy. > > The main thing I'm unsure about in this scheme is whether we want to keep the current approach where the filter is responsible for calling the consumer. To keep with the general spirit of your approach, a possible alternative would be something like this: // Advances in/in_size. // Returns number of characters written to out. static size_t mb_utf16_to_wchar(unsigned char **in, size_t *in_size, uint32_t *out, size_t out_len); Then the caller would be responsible for doing something with the output, e.g. for encoding validation it would go something like this: uint32_t out[16]; while ((out_len = to_wchar(&in, &in_size, out, sizeof(out)))) { for (size_t i = 0; i < out_len; i++) { if (out[i] & MBFL_WCSGROUP_THROUGH) return false; } } return true; I think this may make for nicer implementations because we don't need to deal with callback functions, and I would expect it to be more efficient as well, as we save on the virtual dispatch. This would end up being pretty similar to the iconv interface. What I originally had in mind is to just return a single codepoint on each to_wchar call (but consuming potentially multiple input code units), which would make for the simplest implementations. Of course, processing multiple at a time is more efficient. Why I am not upstreaming this right now: > > I tried running gcov on mbstring and discovered that our test suite > coverage is very bad. (And after all the work I have done adding more > tests! Sigh.) I am just trying to get the test coverage close to 100%, > before doing any major surgery. > Yeah, mbstring test coverage leaves something to be desired :) Looking forward to improvements in this area! Regards, Nikita By the way, I would welcome special-casing commonly used text encodings in > mb_check_encoding. *AFTER* we have close to 100% test coverage, that is. > > Thanks, > Alex > > On Mon, Aug 16, 2021 at 10:11 AM Nikita Popov > wrote: > >> On Mon, Aug 9, 2021 at 10:14 PM Rowan Tommins >> wrote: >> >>> On 07/08/2021 18:57, Hans Henrik Bergan wrote: >>> > can someone shed some light on this? why does mb_check_encoding seem >>> to be >>> > so much slower than the alternatives? >>> > benchmark code+results is here >>> https://stackoverflow.com/a/68690757/1067003 >>> >>> >>> Hi Hans, >>> >>> Since you ran the test on PHP 7.4, the relevant implementation is here: >>> >>> https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl >>> >>> As you can maybe see, it takes a rather "brute force" approach: it runs >>> the entire string through a conversion routine, and then checks (among >>> other things) that the output is identical to the input. That makes it >>> scale horribly with string length, with no optimization for returning >>> false early. >>> >>> The good news is that Alex Dowad has been doing a lot of work to improve >>> ext/mbstring recently, and landed a completely new implementation for >>> mb_check_encoding a few months ago: >>> https://github.com/php/php-src/commit/be1a2155 although it was then >>> changed slightly by later cleanup: >>> https://github.com/php/php-src/commit/3e7acf90 >>> >>> That was too late for PHP 8.0, so I compiled an up to date git checkout, >>> and ran your benchmark (with 100_000 iterations instead of 1_000_000; I >>> guess my PC's a lot slower than yours!) >>> >>> PHP 7.4: >>> mbstring: 57000 / 57100 / 56200 >>> PCRE: 1500 / 1200 / 12400 >>> >>> PHP 8.1 beta: >>> mbstring: 35600 / 1200 / 36700 >>> PCRE: 1400 / 1200 / 12100 >>> >>> So, mbstring now detects a failure at the start of the string as quickly >>> as PCRE does, because the new algorithm has an early return, but is >>> still slower than PCRE when it has to check the whole string. >>> >>> Looking at the PCRE source, I think the relevant code is this: >>> https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markup >>> >>> It has the advantage of only handling a handful of encodings, and only >>> needing to do a few operations on them. The main problem ext/mbstring >>> has is that it supports a lot of operations, on a lot of different >>> encodings, so it's still reusing a general purpose "convert and filter" >>> algorithm. >>> >> >> I think a key problem with the mbstring implementation is that input >> (encoding to wchar) filters work by handling one byte at a time. This means >> that state has to be managed internally by the filter, and we need to use a >> filter-chain interface. >> >> What would be better is an interface along the lines of int decode(char >> **input, size_t *input_len), where the filter returns the decoded >> character, while advancing the input/input_len pointers. Possibly with an >> indication that the input is incomplete and more characters are necessary >> to allow streaming use. >> >> This would allow the filter to handle one unicode character at a time >> (regardless of how many bytes it is encoded as), and would allow to use the >> calling code to use a simple while loop rather than a filter chain. >> >> Of course, this would require rewriting all our filter code... >> >> Regards, >> Nikita >> > --000000000000bb5d0b05ca5e2a14--