Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125009 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id E49991A00BD for ; Fri, 16 Aug 2024 23:37:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723851575; bh=/eBbZ3o7iHeHLooLHVCJnJRNaQVP8dDdLsF5Wkw2kDQ=; h=Date:From:To:Subject:In-Reply-To:References:From; b=BhSY4yKq1+BE4n/m3EU2oq+uxEOX3wxf2r4HMsupfOx7q8hYhLLiTnqeGxs85ixmU btM+Uoan4OKws79Yg0uGgbdyXJIam6UxkXntrt358MnkUj3q3oVrjSSb+BdHQpefCb kv6k6y9BZ/YOkBe58mVZANuFH0XjpCC0D6KGbFZpNwQ8v4BQuA4NCAGt2mcQ1dwehz DTE3DBdQ6MBSG2SIbksWuWu4FRVLLTNXxsjXVRA+F4GH7x5gjj7x4ICJXLoTBVP0xb vBy82N1UV+REFf69q2m9h2yueWILtqP44pCOEgfU6Ae40/K1q0vTUr8IG7bNXbS9IZ +B+Kaelc2ePZQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 08D0D1801DB for ; Fri, 16 Aug 2024 23:39:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,RCVD_IN_DNSWL_LOW, SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from fout8-smtp.messagingengine.com (fout8-smtp.messagingengine.com [103.168.172.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 16 Aug 2024 23:39:33 +0000 (UTC) Received: from phl-compute-03.internal (phl-compute-03.nyi.internal [10.202.2.43]) by mailfout.nyi.internal (Postfix) with ESMTP id C07DE138FC69 for ; Fri, 16 Aug 2024 19:37:45 -0400 (EDT) Received: from phl-mailfrontend-01 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Fri, 16 Aug 2024 19:37:45 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rwec.co.uk; h=cc :content-transfer-encoding:content-type:content-type:date:date :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1723851465; x=1723937865; bh=/eBbZ3o7iHeHLooLHVCJnJRNaQVP8dDdLsF5Wkw2kDQ=; b= fZJ4BVEBnoxtfBOTt4rxlh5/ttN0uuu3lHIB2dxM7+XNoqM3gOz/jYK7i7RzQeC/ Kah90FFZzCeidx4SHqsRJ8K/QhOUHgrba2P/qFbxn5NDJNHI3q2ctLpa+gRXIY1Q GSWiy+d4WM8ZHSer9eXeA4ptwAz/QaZ0FiMdcZJVQTPdqHRNlxGKznpyoBmdnZAw L3e9zhh0JUp35lETBWFVih/CfCeLeVovL66eCHW6Mrhvzza6jtP98JOR8w8yfsLd LfRmN1GeVxK7x/1YOvqKCoP8IESiYmyygspYb03n0bw+MXFpQWzB6COf5r57vcxE mf1HSlCmWKLqZiYPn2bTOw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1723851465; x= 1723937865; bh=/eBbZ3o7iHeHLooLHVCJnJRNaQVP8dDdLsF5Wkw2kDQ=; b=K JYRXrfDFfnKQwLGsTzR/LfD1bkmk1+zMlyzfgvvWyG+RfQtFPTqJCVhvg2+Yc2Ev sWX/7YGYK96k8g8yGRD7WKoon/MA92gkGpvFfowrxUUP0GSJPad5N9/YioVzSvDY +PHmz6i6I/cuKxgFp2LD6NcjgzWv2RpgwUYVFb48z6J8WES2rhwuuQnMhMTbGvLu iF19CEeQo82UvTrFh6rhj8AaX7TBa/xjuY2Py2Ja0Yc5E7dwLPW4DaIwyVq7uTex htTRb1lv9UUtsiY9Lr3NELDXgmCf3Lbx9N8uv9bOtaPt4Ptua2fEDbQhkPASMPBi qmliG9lhOCSl9rxMiMF/A== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddruddtledgvdehucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdggtfgfnhhsuhgsshgtrhhisggvpdfu rfetoffkrfgpnffqhgenuceurghilhhouhhtmecufedttdenucenucfjughrpeffhffvuf gfjghfkfggtgfgsehtqhhmtddtreejnecuhfhrohhmpedftfhofigrnhcuvfhomhhmihhn shculgfkoffuohfrngdfuceoihhmshhophdrphhhphesrhifvggtrdgtohdruhhkqeenuc ggtffrrghtthgvrhhnpeehleffteeigfevudetfedugedtudevledugeeugeelheeihfeh gfdtkeevvefgleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfh hrohhmpehimhhsohhprdhphhhpsehrfigvtgdrtghordhukhdpnhgspghrtghpthhtohep uddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepihhnthgvrhhnrghlsheslhhish htshdrphhhphdrnhgvth X-ME-Proxy: Feedback-ID: id5114917:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Fri, 16 Aug 2024 19:37:45 -0400 (EDT) Date: Sat, 17 Aug 2024 00:37:43 +0100 To: internals@lists.php.net Subject: =?US-ASCII?Q?Re=3A_=5BPHP-DEV=5D=5BDiscussion=5D_Should_All_S?= =?US-ASCII?Q?tring_Functions_Become_Multi-Byte_Safe=3F?= User-Agent: K-9 Mail for Android In-Reply-To: <8360937cc7ca31bf3bd0f8e3050c53cb32663428.camel@ageofdream.com> References: <1AFE8300-D363-43D8-A989-15D001B9879C@newclarity.net> <270D6057-626D-4720-B44A-3CB7A7B9320B@newclarity.net> <8360937cc7ca31bf3bd0f8e3050c53cb32663428.camel@ageofdream.com> Message-ID: <36BC79B6-718D-4A01-B23C-8F0C652ED1C2@rwec.co.uk> Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: imsop.php@rwec.co.uk ("Rowan Tommins [IMSoP]") On 16 August 2024 23:11:18 BST, Nick Lockheart wr= ote: >I used the rather broad title "Should All String Functions Become >Multi-Byte Safe" because there are many smaller related topics, but my >intention was to discuss multi-byte in general I think it was probably not the best choice, because it seems like what yo= u're specifically interested in is mostly not about existing functions, and= not particularly about encodings being more than one byte wide=2E For instance, even good old 7-bit-per-character ASCII contains control cha= racters you might want help sanitising out; and plenty of 8-bit-per-charact= er encodings include more than one script, even more than one writing direc= tion (e=2Eg=2E ISO 8859-8 Latin/Hebrew)=2E But, the specific topic of safe input handling is definitely an interestin= g one=2E And focussing on Unicode, rather than every possible encoding (mul= tibyte or not) makes sense in modern usage=2E >There's a lot of potential pitfalls for dealing with Unicode input, and >there are some best practices per the Unicode Consortium It's worth looking into whether the ICU library has explicit functions to = help with those recommendations (if you can navigate its slightly patchy do= cumentation)=2E Since most of ext/intl is just a thin wrapper on that libra= ry, that could make our lives a lot easier=2E >For example, there should be a function that removes unassigned code >points=2E > >There should also be a function that removes "scripts" (as defined by >Unicode)=2E > >We should have an easy way to remove private use code points (unless >you're running a Star Trek fan site and really do need Klingon)=2E These all seem like good ideas=2E I think you can do at least some of it w= ith regular expressions, but dedicated functions have potential to be both = easier to use and more efficient=2E >And the default replacement character for `mb_scrub` shouldn't be `?`=2E This is trickier, and where mixing the terms "multibyte" and "Unicode" act= ually matters=2E The mbstring extension supports a number of different text= encodings, most of which don't have a dedicated replacement character to u= se=2E It also has the ability to set the default in global state with mb_su= bstitute_character() so it's not immediately obvious how a different defaul= t could be applied based on the specified encoding=2E (I'm not a fan of tha= t API design, but it's what we've got!) >Each of these and other ideas could be part of an RFC, or we could >brainstorm a Unicode built-in class that handles lots of the common use >cases=2E I don't think a single class that tries to "do Unicode" makes sense; it wo= uld be like having a "maths class" that contains methods for anything deali= ng with numbers=2E In fact, I think the group of functions you're suggesting are a great illu= stration of what I was saying in my last message to Rob: they make perfect = sense as standalone features, and don't need any grand plan to "have Unicod= e in core" before we proceed with them=2E Regards, Rowan Tommins [IMSoP]