Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:108835 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 46178 invoked from network); 3 Mar 2020 23:57:44 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 3 Mar 2020 23:57:44 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 4CD6F1804D8 for ; Tue, 3 Mar 2020 14:16:57 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS8560 212.227.0.0/16 X-Spam-Virus: No X-Envelope-From: Received: from mout.gmx.net (mout.gmx.net [212.227.17.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 3 Mar 2020 14:16:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.net; s=badeba3b8450; t=1583273814; bh=ignbCR2VAjtJzq/toWVi/ElpukqJljxPjDxvk//UZds=; h=X-UI-Sender-Class:From:To:Subject:Date; b=TbBdrPyhb68U6PclKFP+NtsD3EBha+3iOgGBpOiQtGfFXq9GTesMkZ1jrAU3zqBtJ AjC3+gJZonRODEcznvKuBWXybokjJB9Oa40WVUaremPp1PlQgj8l3qTew45e0OKkMx 5LIO2H3HHu7DJay6h/77GJyA1Pr/lJZmryaPl7ZY= X-UI-Sender-Class: 01bb95c1-4bf8-414a-932a-4f6e2808ef9c Received: from [192.168.2.130] ([84.179.232.93]) by mail.gmx.com (mrgmx105 [212.227.17.168]) with ESMTPSA (Nemesis) id 1MJE27-1ip7KX2qAV-00Khv6 for ; Tue, 03 Mar 2020 23:16:54 +0100 To: PHP internals Message-ID: <09dd1b84-ed33-a059-82f9-5efd179e69d6@gmx.de> Date: Tue, 3 Mar 2020 23:16:54 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: de-DE Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:x9oQfqOG/2VrmrOu3pO81p8yjolT65jiMSiSCRIzuGYGM5yfa89 tljpyRlXeBjXVDS8F2HNpZ6cfElfSDRLYakuzYi104Cs915Mw5jDXA0X4YnaR1iyzgEl+QM /VeLwvPGLuYH5Z0nD2isn8YSzWLWtxy49YjZr3ma7vfT7UBLTof2ngmno9aU3UI8xC/TpFw T4xr/b81xiRaU9/bR6hZQ== X-UI-Out-Filterresults: notjunk:1;V03:K0:5xMLm8+VJyo=:dmFhw+HYtTzQhLWhcOt8EO XuPfpyiTLEthOUeRO+pAcqSXTjG90H4L50LIAaKb4LdzzVpkih1d1Dbd5+iJbVFJ/kxOW+zyE lVbKe6CGb7OPSzinFyoucgZ9FOwdSSZZOOJqJJdMXGfwP6infdZVOtxBuF7JFJ2IlGvQJXyo1 nVh11S8e6HMT2er5UOxkaYr0jFutd/2Ge4UibCAwAVchJTjXE18UvJsmUIVyDrTrnlaqMyQjH HHwfbcNqyRU4GanARy40xXhOTbVMy25XRt4yzQcM3SERo7Hx3qA+UshVlwtaR6I74pC9qMQpO PVar28KmnQ/1LGYyIrQbL6LYQu3A1zmvxrhTdJI8Xr0ZB/4WiG3z9wttrSJmVnYRfV34iuCYr E2Ao1eM9qi2RhJ37n//pSVBqLZ7cIfbHXeTqjNfoAYaNcGQrWckhbSafL7At0IfXg53xdHEg4 VUMCfI8HMCMTbNY4SAqnq9vUQBA2LcOwTvhaSDCTPETihhe1fClmC+kzFBIhJ/Ku0l8t1/EHz GwbAn4oa57ECRv2+B4Mtc6f+xGwax0TmAHvHWLScU+Sq7uCQSD3O8YemsyV+I3R4pwyxCNB9L KhkJBIPBf5Eb1sGMeJZj+Aod68MKbp/ggkh58UMVAqbLviCy/ICFsU8hde4K67lN6SW56gVDx US7Jp7g7dpu7y68DJ3Qc02FKrJmvbBtqsakS8nfn3KoS4Kk3FJMGGikFTR+gt/4PWAT+ZP55l 8X5r4+DzAN8V9FxBztOjUoA5SXSItrj58L0fk4b+72A8MpTpOfpwkO0TVQxxt72/KSTtqFKY8 Xb4yknzD6xVs90wqHyVFsBLhjnjIVlwVnJ5zMCBKlWZtLqkrKT2lFX2cUAYpt3vUSZIaU45if w5lOP6HKpHWE2jWYl0epHrsWyPb6lar6oGUUt7wViGWstVLlAeItvK5wMD/CtzcJVc9R+u+Rh jCMxwEqDSulqYbUw8b6nr8k/Vs7DuWfn6v48YZ9oAK3lHj+xku8cGEzFN3o0v2LEldvEXhd49 hNzhEadPB04kM0CJFwTtPZgtf47SurwEQJeQQzkamD1XeQWqORQQIpXOecwOCD2KGiFyLeSDe V0amSCkQv9WSrvmf2O/h8kc1awpcpzy9WZbbXjybPCvKgl+Weh9nBNRxn2Zwd9NY15NVQZGSJ zNlu4vs25dx1BCHdxVTkJ3zS1bTbYLKXjcK3CKgXhB5gWdjOLIVASQrRLLErugmtkTLma7/9L 5brJTBaMCyljo4V5+ Subject: iconv vs. mbstring From: cmbecker69@gmx.de ("Christoph M. Becker") Hi all, we still have 2 bundled extensions for working with strings in different encodings: ext/mbstring and ext/iconv. While working on bug #79200[1], I've noticed that the implementation of many of the iconv_*() functions is rather suboptimal. This is mostly because iconv() is meant just for character encoding *conversion*, but ext/iconv puts several other useful string functions on top of that, but can't have these really optimized, because the extension doesn't really know anything about those character encodings. For instance, iconv_strlen() is basically implemented by converting the input string to UCS-4, and then simply counting the UCS-4 characters. On the other hand, mb_strlen() makes use of length tables (where appropriate), and as such does not even need to convert the string in many typical cases. Some quick benchmarks on getting the string length of UTF-8 strings show that mb_strlen() is roughly 10 times faster than iconv_strlen(). Now it would be trivially possible to improve the iconv_strlen() implementation by converting a larger number of characters in one go (instead of currently up to two only[2]), which would make the function much faster (roughly 3 to 4 times for a 1024 character buffer), but still mb_strlen() would obviously beat that. The situation for the other iconv_*() functions is similar, more or less. However, it seems that iconv() can be much faster than mb_convert_encoding(). Quick benchmarks show a factor of 2 to 3. So I wonder if we wouldn't be better off if we unbundle ext/iconv, but move the iconv() function (and possibly the convert.iconv.* stream filter) into ext/standard. It shouldn't be hard to update code which uses any of the iconv_*() functions to use respective mb_*() functions, and users who couldn't do this, or don't want to for whatever reason, could still use the iconv package available from PECL. However, users who would switch to mbstring would likely get better performance for their applications. For core developers that would obviously save time to maintain both extensions. For users learning PHP, and also for new code, it would be beneficial to not have to decide which of these extensions to use; if they need character encoding conversion, iconv() would be preferable; for more general string functionality, it would be ext/mbstring. Thoughts? [1] [2] =2D- Christoph M. Becker