Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125001 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id DCC751A00BD for ; Fri, 16 Aug 2024 22:11:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723846387; bh=aNYcObs5xAZxZPyGKrlF89tV0u8u7TawcG61gQRk3ck=; h=Subject:From:To:Date:In-Reply-To:References:From; b=a+Y1fcJV/KTFyD8snA9UYk4faTiWitBlw3ZHedQv4Fz7+BaAQRqwqf3Moc11FLHyx y7UAn7Zcw4bvTgnDwfrpfSAi7c0V4f735vYT/DpXiF3S/gTtJlxSzvDDhgE+YXHfbI jKojKWY8Up15xHZRqbV8sSfuf+jU8MzzYqWaO9WESFsaZCqu4UtbwgtRfp084n7l9V VEXdyCKq36IJxInYrbH70lXx0GRYhAdnC4fjYvfbKHAMEVoqUqGGbGurPsNGAmuCuK a0m/jlpWvDAeD8zSrbJtajxwWhsK4rgtBE9PfipbVGXbmWazRLZ/EE7/A86109iJXD FaohRevv9Qklw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 6EC871801E7 for ; Fri, 16 Aug 2024 22:13:07 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_PASS, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from ageofdream.com (ageofdream.com [45.33.21.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 16 Aug 2024 22:13:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ageofdream.com; s=ageofdream; t=1723846278; bh=aNYcObs5xAZxZPyGKrlF89tV0u8u7TawcG61gQRk3ck=; h=Subject:From:To:Date:In-Reply-To:References:From; b=YctMNLQ6m8iXGhWhgnBoM6RJKsp8jO6OqNxZoUVZtJuIPv7EQHFHQhaRJEmjBt9Av L2c6ifOdglgADmGu1qCOKAC2ypYq0HjR7+LKKooOQxSjeELiJB4gpf6T7E8IKYP6tD dNhI9Ko3AxKn35Nu/8jfKKRURluCA/T7J3kajHw50ZhN82tnc/ki68BLz8mgG4DeUR AD10FLGOzZfVI6EzVBiLcltomrTe6gicTEoRvQbuSjMhZi12qesv/AOx5aTyAXoXhM cRLTqPbEfGs3Su3FUAWxJ/ZZwbP+y0IrDZjKDFf4H1wsGNj/DTkrPr8x4PDpv/HM+h 9o5dWzvyGEWZA== Received: from [192.168.1.7] (unknown [72.255.193.122]) by ageofdream.com (Postfix) with ESMTPSA id 8A5B52783D for ; Fri, 16 Aug 2024 18:11:18 -0400 (EDT) Message-ID: <8360937cc7ca31bf3bd0f8e3050c53cb32663428.camel@ageofdream.com> Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? To: internals@lists.php.net Date: Fri, 16 Aug 2024 18:11:18 -0400 In-Reply-To: References: <1AFE8300-D363-43D8-A989-15D001B9879C@newclarity.net> <270D6057-626D-4720-B44A-3CB7A7B9320B@newclarity.net> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.46.4-2 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 From: lists@ageofdream.com (Nick Lockheart) I wanted to reply generally to this and not to any person in particular, as I'm the one who started the thread. I used the rather broad title "Should All String Functions Become Multi-Byte Safe" because there are many smaller related topics, but my intention was to discuss multi-byte in general, and see if there was some consensus on action items that could have a more limited scope/RFC for that task. My overall intent and goal was to make PHP safer against multi-byte attacks by providing developers with tools that could become best practices for dealing with user input stings, the same way we had mysql_real_escape_string, and then PDO prepared statements for SQL. There's a lot of potential pitfalls for dealing with Unicode input, and there are some best practices per the Unicode Consortium that I'm not sure how to implement in PHP, and it seems that since everyone needs them, they might be better as a shared library in core. For example, there should be a function that removes unassigned code points. There should also be a function that removes "scripts" (as defined by Unicode). We should have an easy way to remove private use code points (unless you're running a Star Trek fan site and really do need Klingon). And the default replacement character for `mb_scrub` shouldn't be `?`. Each of these and other ideas could be part of an RFC, or we could brainstorm a Unicode built-in class that handles lots of the common use cases. Having a team-built and audited Unicode class would benefit almost everyone using PHP.