Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:115676 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 57891 invoked from network); 9 Aug 2021 19:44:12 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 9 Aug 2021 19:44:12 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id D660B180212 for ; Mon, 9 Aug 2021 13:14:27 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: * X-Spam-Status: No, score=1.5 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,NICE_REPLY_A, PDS_OTHER_BAD_TLD,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS,URIBL_SBL,URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 9 Aug 2021 13:14:27 -0700 (PDT) Received: by mail-wm1-f47.google.com with SMTP id x17so11391245wmc.5 for ; Mon, 09 Aug 2021 13:14:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding:content-language; bh=xNgWJ8+N/8+61kD5em35YYvhrNYaBXPfTk1LPndQy9Y=; b=lj3KBvs5Bq53g5icxYq692jX65xyEE4aX5/jKZrRX7+9WAxlPPFMG1FKwAnpQvBA2l mMgKSPl0g1jJNRKRsaE3LGu9wwI7swqdSXMgPWZ0fvJcLlhzh8Jkcsri/YcBN4jwXMv4 J74px9L+NaM2lRHO8wwz4cOu7Z9iRxjQrRKGxKFZ10hrAg+bLdWqkBYoM+f0PBD5FT29 uBnQOZ+h+31G+8vlL3CRe51Fm3TamSp0SZ6SJ5iwFAwduYySVzS4IxsuSKOhGS73G5Gi o7WTn0PVk7RO1ZknALnzC5CXFT8loyO3wLJwQ1JBdyeZqDfUoGg7twjRhiA+uMVqGeLU Mtbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=xNgWJ8+N/8+61kD5em35YYvhrNYaBXPfTk1LPndQy9Y=; b=RSFjMuZ5JeoDGhDdDI10qQpZEBJ1vP5s7euhLQWCu0vMxLJphKERiYHHxADMi69rZR nh21WtWpNtAbrVtdm83HgD882XA33JTjOJCXyZtp0/8xDTIG7/BzX7lLtFcfQQz87YGc oKmWY4MTRG/E10fCn54Ruwpm1VP0ER4RprcT6dr+zUFCMdIVrvHufwbz+msylxVJ+vAk JR8aO3h5zTTk4mezVmCWZL9TMZMUV0Ac8ULqWOLpPyXqd1gOg+1GsdJRKyjWjJEN13Q2 dN6qVfr9UW6mUe5qpyaErrpjJTxiqnwpRYLZdxIOglOHqhWC14XKHfThOtixcVEfaloF fDXA== X-Gm-Message-State: AOAM533vH9B+pDaQzTxONYGYHJhk3EqVY8C8GRCs7iaBd2PFwg95MvW6 630LJB51+SzyXSYr0Cg7zXvxb656z0k= X-Google-Smtp-Source: ABdhPJwly4DHvHiVJStWyyadxYkk/eabDywnML1uPvd2XHbXRCRbk9sM4syJk88DrvIRbwhqI0piAw== X-Received: by 2002:a7b:cc14:: with SMTP id f20mr856722wmh.38.1628540062210; Mon, 09 Aug 2021 13:14:22 -0700 (PDT) Received: from [192.168.0.22] (cpc104104-brig22-2-0-cust548.3-3.cable.virginm.net. [82.10.58.37]) by smtp.googlemail.com with ESMTPSA id s13sm482487wmc.47.2021.08.09.13.14.20 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 09 Aug 2021 13:14:21 -0700 (PDT) To: internals@lists.php.net References: Message-ID: Date: Mon, 9 Aug 2021 21:14:17 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-GB Subject: Re: [PHP-DEV] mb_check_encoding slow performance? From: rowan.collins@gmail.com (Rowan Tommins) On 07/08/2021 18:57, Hans Henrik Bergan wrote: > can someone shed some light on this? why does mb_check_encoding seem to be > so much slower than the alternatives? > benchmark code+results is here https://stackoverflow.com/a/68690757/1067003 Hi Hans, Since you ran the test on PHP 7.4, the relevant implementation is here: https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl As you can maybe see, it takes a rather "brute force" approach: it runs the entire string through a conversion routine, and then checks (among other things) that the output is identical to the input. That makes it scale horribly with string length, with no optimization for returning false early. The good news is that Alex Dowad has been doing a lot of work to improve ext/mbstring recently, and landed a completely new implementation for mb_check_encoding a few months ago: https://github.com/php/php-src/commit/be1a2155 although it was then changed slightly by later cleanup: https://github.com/php/php-src/commit/3e7acf90 That was too late for PHP 8.0, so I compiled an up to date git checkout, and ran your benchmark (with 100_000 iterations instead of 1_000_000; I guess my PC's a lot slower than yours!) PHP 7.4: mbstring: 57000 / 57100 / 56200 PCRE: 1500 / 1200 / 12400 PHP 8.1 beta: mbstring: 35600 / 1200 / 36700 PCRE: 1400 / 1200 / 12100 So, mbstring now detects a failure at the start of the string as quickly as PCRE does, because the new algorithm has an early return, but is still slower than PCRE when it has to check the whole string. Looking at the PCRE source, I think the relevant code is this: https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markup It has the advantage of only handling a handful of encodings, and only needing to do a few operations on them. The main problem ext/mbstring has is that it supports a lot of operations, on a lot of different encodings, so it's still reusing a general purpose "convert and filter" algorithm. Regards, -- Rowan Tommins [IMSoP]