Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124907 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 17F191A00B7 for ; Mon, 12 Aug 2024 20:27:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723494580; bh=v5KkJ53s6HU5E/ZDg5DzwRTNr1HZ4MVftjiPw/R3DzA=; h=Date:Subject:To:References:From:In-Reply-To:From; b=Jj4B4hSRjwD2ff4QI9c8/j6CBYvyNaHkZHHTo64i7Hyx/4RYZaLVbr5809YnLzl/4 6Hr/63xAf/TBWKB3K0iFHIraNmpxiN41uJtiyoxGxLJPcpqlZHB5/6i0rnIyLIAQQZ BOZRSkR6zeoBnriUBmDsPbvSJh/oGBdF4qXxfBhvvkZY4Frg+t2R/ved0F9ZSQkjdx j5GYNp4M4g93TMYxTv5JLGIGfJ7rpc6J0X2IJwXIHPaQS9G82mWj6D8wJ/SeqLLMll 0r2Opp+SCai3WyTm1QDfRq55Ek6tt4d+XktiG02joRSwEWhdAggzv9SqVEfi6pso6u DbC1PfT0LyGPg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id CB30618003E for ; Mon, 12 Aug 2024 20:29:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,RCVD_IN_DNSWL_LOW, SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from fout3-smtp.messagingengine.com (fout3-smtp.messagingengine.com [103.168.172.146]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 12 Aug 2024 20:29:39 +0000 (UTC) Received: from phl-compute-03.internal (phl-compute-03.nyi.internal [10.202.2.43]) by mailfout.nyi.internal (Postfix) with ESMTP id BB31C138FEB0 for ; Mon, 12 Aug 2024 16:25:54 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by phl-compute-03.internal (MEProxy); Mon, 12 Aug 2024 16:25:54 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rwec.co.uk; h=cc :content-transfer-encoding:content-type:content-type:date:date :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1723494354; x=1723580754; bh=21t7d3r7UJZAV1flHcC8dogf0TLFe2RRa0aNBBPt+1k=; b= L98t67xUr136loed+vBqI80wEX0GPp6t8pP0v3P10DehqfSLuCSaoVDI7pybx7IF nf1m7rC8LRND3xf3CX9Tr1q+QH+iX8Z6tJ587inBUGDDJv/eSk2eDo5szUllvzri 7A+AQCF/3uroIXfN3MKKo7EEclFBTiln/5H1BZsoxWyISFvAtdHOv4GFxQnSKayz dA4LWmtV/GX5ISwoo/QYdaoCtr07cJru6rb83wcYsXUWT7PoDYnKvdfFNPmxkgPI 26X9WNHIirrz6pMv5SxlrmsQMgaZ+mgvJ0nNLZqHmAliFZ5dpkC6/ux9cNwaoXCf VIlPdMaiwjzIX6lgJm45uQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1723494354; x= 1723580754; bh=21t7d3r7UJZAV1flHcC8dogf0TLFe2RRa0aNBBPt+1k=; b=m hTHHGYDa1A4CAod4XO3Wotb3grjl8TYtsou2QrMp3VMo6W48eYAoiH687c95xu2T MhqPRv5d/pFrQQNzpbsgGU1SzVXwnLTsKiABTpitksY/7mvSAOG9Xnfine8Rz+qv fywZRGHZWOfUhG59HFDzVOu9mfS9QJ/hEx+r2AYvXHz9qrbWiKpkPYXbVNcneoZg wtXLQfDWdP9KNaburoe1mE/0ekF2mxq41ZOILix/RFoC0gQAZWBTSITI9SsGrAPz ti3dA07GaJQg0mbMEkqWw8XHLX4Ea00KbVAJWaEitr1w/eieWbwLfvk1TNwNvpbY 7TzIrBmicvKpDeIeo7X1w== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddruddttddgudegkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpggftfghnshhusghstghrihgsvgdp uffrtefokffrpgfnqfghnecuuegrihhlohhuthemuceftddtnecunecujfgurhepkfffgg gfuffvfhfhjggtgfesthekredttddvjeenucfhrhhomhepfdftohifrghnucfvohhmmhhi nhhsucglkffoufhorfgnfdcuoehimhhsohhprdhphhhpsehrfigvtgdrtghordhukheqne cuggftrfgrthhtvghrnhepffekveduffduvdehjedvfeekleeftddugeefheejudehgeei udffgeeggeevfeehnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh hfrhhomhepihhmshhophdrphhhphesrhifvggtrdgtohdruhhkpdhnsggprhgtphhtthho pedupdhmohguvgepshhmthhpohhuthdprhgtphhtthhopehinhhtvghrnhgrlhhssehlih hsthhsrdhphhhprdhnvght X-ME-Proxy: Feedback-ID: id5114917:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Mon, 12 Aug 2024 16:25:54 -0400 (EDT) Message-ID: Date: Mon, 12 Aug 2024 21:25:53 +0100 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? To: internals@lists.php.net References: <1AFE8300-D363-43D8-A989-15D001B9879C@newclarity.net> Content-Language: en-GB In-Reply-To: <1AFE8300-D363-43D8-A989-15D001B9879C@newclarity.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit From: imsop.php@rwec.co.uk ("Rowan Tommins [IMSoP]") On 12/08/2024 17:37, Mike Schinkel wrote: > A really standout paragraph from that link is: > > "IMO, the whole situation is a shame. Unicode should be > in the stdlib of every language by default. It’s the lingua > franca of the internet! It’s not even new: we’ve been living > with Unicode for 20 years now." I actually think that paragraph rather ignores everything else the article has just explained. "Putting Unicode in the stdlib" is an incredibly difficult task, and it's not entirely clear what it should even mean. In PHP, we have ext/intl, built around a library called ICU, developed by the Unicode consortium. Unfortunately, it only exposes a small selection of ICU's functions, e.g. there's nothing for locale-based case folding of whole strings. The ext/intl documentation is also very patchy, and the actual ICU documentation isn't always much better. The main reason it's not *mandatory* for all builds of PHP, just "bundled", is that the sheer complexity of Unicode means that the library is rather large - somebody (Rasmus, I think?) joked that relying on it for PHP 6 would have made PHP a small library attached to the side of ICU. We also have the "mbstring" extension, which was *not* designed around Unicode, but was originally built for various encodings popular in Japan 20+ years ago. It doesn't have the databases of codepoint information that ICU does, so can't answer questions like "what script does this code point belong to?" or "what is the uppercase equivalent of this grapheme, assuming a Turkish locale?" -- Rowan Tommins [IMSoP]