Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:115755 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 47417 invoked from network); 16 Aug 2021 07:39:34 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 16 Aug 2021 07:39:34 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 693DB180505 for ; Mon, 16 Aug 2021 01:11:30 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: * X-Spam-Status: No, score=1.6 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, PDS_OTHER_BAD_TLD,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS,URIBL_SBL,URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 16 Aug 2021 01:11:30 -0700 (PDT) Received: by mail-lj1-f172.google.com with SMTP id q21so4333512ljj.6 for ; Mon, 16 Aug 2021 01:11:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tUoQNoLgZNlAeaK3PsDh4fWcPxVhIMGwxgpCNwSuAtA=; b=nlXlkx3A4xKlegFqtf8bR61suoKzcE7WZ8nCknon6XYHLb4z2j1hWgmZ9JpR2fzjwT RIwt2Xgqx8bd2kWidMEeyDGhI7rY+t/MivRsJKqiohBN4TdOlfQPWn2PGcuVxetqjkMu pF1YimTgJXv9Ie6E3e1d/vtcFX2oWaPPXzmUwYHfPgp8yMu897bYH81DEBaOP6RsU/TC TqCcMLznkAq3ZY2CcHeXvpXZVGpWQOPnuW4Pc+OlMTZ+RemYeeaT/cJ2aMa/GcEhgvAz YjjAAYVGDThoOUDydgEW68topqad7wXtHa5MGmq/wRz9e+RUrOdIzCK5pIPPHnmz4OQM 4CUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tUoQNoLgZNlAeaK3PsDh4fWcPxVhIMGwxgpCNwSuAtA=; b=CwQFI/d+GN1Kw1hf+nZkNCl189tjObLwMMtTYobT5AFw3siO0WXADE4RX7dyZRGBwy KFkjaotPv4HaYNmq7aXmK8QqQ/CIP/VPUoWsS+gwSu6JAFqWlgVYtJgHMVp1RT+9G4Pg 9i2I3ZfChvVepkvjO/J5DPXjM1M+FkzrlVI6TpwWA+nZ4qLnUOjDJohSlDG4PzYUkf+g Z7Wefe1kCQNF8RO6HTMbAvdw5BdRIyQAIMQUsiNtVX7stFpvgIVG7I3SxBNu3a63tF9s lhb2HVRI2vHZdcPDO0sNQS4NAgLyRRc7WqHHf773qC7ZyWDj9en+FPfGFFXUJ8vA0udn zWig== X-Gm-Message-State: AOAM532sDSElwixhzCQVu3V1Kq12EYVHjB8thVIE6fgEt6Eky2Jlaxd6 3N+LbkFyBmm8j8fjLl4brhkDWaSRviQsNoDGTwc= X-Google-Smtp-Source: ABdhPJyXvL7k2ioVT96vZkcDz4LhThB72SNuHdIJwu+7HLAFGcdO2SF5+Wjk9B8WstSzOOcm8w59Gxikr3ItLPBzqf8= X-Received: by 2002:a2e:9304:: with SMTP id e4mr11592827ljh.244.1629101488410; Mon, 16 Aug 2021 01:11:28 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Mon, 16 Aug 2021 10:11:12 +0200 Message-ID: To: Rowan Tommins , Alex Cc: PHP internals Content-Type: multipart/alternative; boundary="00000000000085149e05c9a8bfb4" Subject: Re: [PHP-DEV] mb_check_encoding slow performance? From: nikita.ppv@gmail.com (Nikita Popov) --00000000000085149e05c9a8bfb4 Content-Type: text/plain; charset="UTF-8" On Mon, Aug 9, 2021 at 10:14 PM Rowan Tommins wrote: > On 07/08/2021 18:57, Hans Henrik Bergan wrote: > > can someone shed some light on this? why does mb_check_encoding seem to > be > > so much slower than the alternatives? > > benchmark code+results is here > https://stackoverflow.com/a/68690757/1067003 > > > Hi Hans, > > Since you ran the test on PHP 7.4, the relevant implementation is here: > > https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl > > As you can maybe see, it takes a rather "brute force" approach: it runs > the entire string through a conversion routine, and then checks (among > other things) that the output is identical to the input. That makes it > scale horribly with string length, with no optimization for returning > false early. > > The good news is that Alex Dowad has been doing a lot of work to improve > ext/mbstring recently, and landed a completely new implementation for > mb_check_encoding a few months ago: > https://github.com/php/php-src/commit/be1a2155 although it was then > changed slightly by later cleanup: > https://github.com/php/php-src/commit/3e7acf90 > > That was too late for PHP 8.0, so I compiled an up to date git checkout, > and ran your benchmark (with 100_000 iterations instead of 1_000_000; I > guess my PC's a lot slower than yours!) > > PHP 7.4: > mbstring: 57000 / 57100 / 56200 > PCRE: 1500 / 1200 / 12400 > > PHP 8.1 beta: > mbstring: 35600 / 1200 / 36700 > PCRE: 1400 / 1200 / 12100 > > So, mbstring now detects a failure at the start of the string as quickly > as PCRE does, because the new algorithm has an early return, but is > still slower than PCRE when it has to check the whole string. > > Looking at the PCRE source, I think the relevant code is this: > https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markup > > It has the advantage of only handling a handful of encodings, and only > needing to do a few operations on them. The main problem ext/mbstring > has is that it supports a lot of operations, on a lot of different > encodings, so it's still reusing a general purpose "convert and filter" > algorithm. > I think a key problem with the mbstring implementation is that input (encoding to wchar) filters work by handling one byte at a time. This means that state has to be managed internally by the filter, and we need to use a filter-chain interface. What would be better is an interface along the lines of int decode(char **input, size_t *input_len), where the filter returns the decoded character, while advancing the input/input_len pointers. Possibly with an indication that the input is incomplete and more characters are necessary to allow streaming use. This would allow the filter to handle one unicode character at a time (regardless of how many bytes it is encoded as), and would allow to use the calling code to use a simple while loop rather than a filter chain. Of course, this would require rewriting all our filter code... Regards, Nikita --00000000000085149e05c9a8bfb4--