Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:108898 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 24300 invoked from network); 8 Mar 2020 17:22:17 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 8 Mar 2020 17:22:17 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 3E11A1804A7 for ; Sun, 8 Mar 2020 08:42:41 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 8 Mar 2020 08:42:40 -0700 (PDT) Received: by mail-wm1-f47.google.com with SMTP id a141so7209198wme.2 for ; Sun, 08 Mar 2020 08:42:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding:content-language; bh=6cz+hDnyBBPB9/vEl2YDmO7oNcXzclRxpXPWewvTzK0=; b=SRLlajwCWWlxdc3qbIOexOD5h7qZ2EsUuHnEJbdje8ZkDDezEtzoc7v3Y0JAF0Xuo5 UifyXj8Yr8qRW75/gywvUIgk7CSN3qa2GYDQARP8xrIqrSnS+J9+weg1OyQ3E6iOv1GH MERGlGuEO3ab7yErWXNtSXsH8WCKnUKHpTzr9ASMS72/VNukVZ0pVDaAkKIj3n1HH+Qo H9nH1gpiMka6O2pNFneeNUbuWpoeyQRRi/srFT4a7RNYziOnX/glCRWhOa80IyWB0vIT mcmqO/WK9mHIzrgK+6NJsoIxePbGmNG8PSKQHGhynmaMc3f6oPSc1px5v2r/ALHd0Qps A5vA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=6cz+hDnyBBPB9/vEl2YDmO7oNcXzclRxpXPWewvTzK0=; b=DCY4SIpQtbQ2MhuVVui6KrqgJdy93pBlebhNjjRT7BR85/Lt8zFJiuh2+djBDRZqUP UM917dxTTzxZArIkrmxfwppEPmre7/2m1r5YH1fM4PEU9XomQ2ZdSbviD+tHQircLUzN eUfPOsoHDUIOZLLBSuT9uLFh9fRF+Mce1Vm9pgaZxsvbs17rxH8rgzsjlIXxYoZzrteI +JeZhTr6PtwjvgO7GcchptBh0X5AjYQgP6hMn+IqB4v7mX/U6S+EBJTxzB17Grsmp9r4 j/pwn2ml4N7NhkEM6ALwJBT9FlSTaEnCYIY2hg2msrjZGtz5xZ+DxF4EtTYFf+BnLAwN dY8A== X-Gm-Message-State: ANhLgQ3VRg70yK6h0rX8LfUChCuF9pRmVZ+drtLvqsvBYsmidDasFnUU 9AFdYhXSUecGWcJBMEiYYepFCgc+ X-Google-Smtp-Source: ADFU+vu6NUuWgfaAIiMXIPvgxynXkKiaG8m4Gmi8hh/1IaBroYfvvMtCPTt3s0BIZrgrYILj4f9KxA== X-Received: by 2002:a1c:41d6:: with SMTP id o205mr15055579wma.122.1583682157611; Sun, 08 Mar 2020 08:42:37 -0700 (PDT) Received: from [192.168.0.14] (cpc84253-brig22-2-0-cust114.3-3.cable.virginm.net. [81.108.141.115]) by smtp.googlemail.com with ESMTPSA id o3sm25536162wme.36.2020.03.08.08.42.36 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 08 Mar 2020 08:42:36 -0700 (PDT) To: internals@lists.php.net References: <09dd1b84-ed33-a059-82f9-5efd179e69d6@gmx.de> Message-ID: <3952f4f7-a782-b392-50f7-27c2ef05fbb2@gmail.com> Date: Sun, 8 Mar 2020 15:42:35 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [PHP-DEV] iconv vs. mbstring From: rowan.collins@gmail.com (Rowan Tommins) On 08/03/2020 14:08, Dan Ackroyd wrote: > Related to this discussion, please could someone remind me why the > mbstring extension is an extension and not part of core PHP? > > I realise at the time it was introduced, UTF-8 was far less widely > used: https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg > > But now UTF-8 is pretty much the default for the vast majority of > projects, so does that decision to keep it as an optional extension > still hold up? From what I can make out, mbstring was not actually built for Unicode string-handling, but for what we would now consider "legacy encodings". Its original niche seems to have been support for various Japanese text encodings, and UTF-8 support was added relatively late. That has some implications for its design: - every function takes encoding as a parameter, and defaults to a run-time global setting - on the other hand, there is no support for locales in functions which would benefit, e.g. mb_convert_case, mb_stripos - Unicode is treated as just another character encoding, so there is no support for concepts like normalisation, graphemes, character properties, etc - instead, there are lots of niche functions for CJK languages like mb_convert_kana and mb_strwidth It also includes some things which probably wouldn't pass review if proposed today: - a lot of global state, with combined get-or-set functions like mb_detect_order(), mb_substitute_character(), etc - mb_send_mail seems oddly specific, and has its own concept of "language" not shared by anything else - there's an entire regex implementation, with its own API and some compatibility with the removed ereg_* functions; I believe the preg_* functions included in core already support UTF-8 For handling of Unicode, ext/intl is generally superior, with a more structured API based on Unicode-specific concepts, rather than attempting to map them to concepts used in older character encodings. There may be a need for a more user-friendly subset of this (a "UString" class is a common suggestion), but it shouldn't look like ext/mbstring, IMHO. I believe both extensions require fairly large external libraries, which probably justifies them being optional. From what I've read, ICU, which ext/intl is built on, would have been bundled with PHP 6, but its size and performance contributed to the failure of that project. Regards, -- Rowan Tommins (né Collins) [IMSoP]