Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124897 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 60DA21A00B7 for ; Mon, 12 Aug 2024 17:15:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723483048; bh=rIoV9eHQHN4yMcu0cHpwruaxEWWtmoSqTmlM4zlAUJU=; h=Subject:From:To:Date:In-Reply-To:References:From; b=kBEAcs+zi8nKPCNP9+jsStjf3JtnpnTYb+Nw8QSYLgTaJHJPD58KCEsbQRD9izYWa CK4IoM1RDc64LfWCFaBX1lYGdEv/8bHoL25u3vtj4/S6J80/MZS0re1Y2VNcz5rxP/ kbnVSnkYU5+Ckd1gazajImp79NgeSjuI17PODr+U9um8iUaHi43hJk6piUa6HdxAR5 BPmRtp7L7Z1tzMxa0Sdg/ENYNH5eUBVedoOrIUvK3nuFRhlBCUVrgYg/dpCiGn1285 0QPoYNQfnVmFdxle+zfCNGIcF9SoM5nvH6wllTfQOmN9Jetgl+lIX8K/Rg+58rQ6Yl dkKaQTDnUoTiw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 81EB4180078 for ; Mon, 12 Aug 2024 17:17:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_PASS, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from ageofdream.com (ageofdream.com [45.33.21.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 12 Aug 2024 17:17:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ageofdream.com; s=ageofdream; t=1723482941; bh=rIoV9eHQHN4yMcu0cHpwruaxEWWtmoSqTmlM4zlAUJU=; h=Subject:From:To:Date:In-Reply-To:References:From; b=BMkTy6DYcAFxxiuyUKjSan+redxibXA9JOJRH03rbGGJ/jAJAexmZmzGELNN0a6li 1ho2TFgC8LlXCfnz4h7gwEIWNvy7fV3cNQKXvWv9WRSjOQYAhbRTKag/BbLDDzcT+t D0XbPnY2BQoxOxaBwsDwkBaRJhhUJt68NSACAjNJJ5TtUMX4Wv5kdwEGz+Lx6Snww/ qAZSdoectrGKvPJ09o24fNqo3iFIZXy9nFXF4HCeqYjoJ0SiIK3X4kn19yzj92gOiO aYqgNvEOchaZ592AM9KOM5k386b7I8MUdpb0ogoJir+HCcXjX0eawS7s/hWzrjxypx si+OZ5kUi/W7w== Received: from [192.168.1.7] (unknown [72.255.193.122]) by ageofdream.com (Postfix) with ESMTPSA id 0ABA42783D for ; Mon, 12 Aug 2024 13:15:41 -0400 (EDT) Message-ID: <9bce8dd60b23486ec34caa0174d844096eed26af.camel@ageofdream.com> Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? To: internals@lists.php.net Date: Mon, 12 Aug 2024 13:15:40 -0400 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.46.4-2 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 From: lists@ageofdream.com (Nick Lockheart) > Currently, PHP strings are binary safe (thus can store any encoding). > I generally think of PHP strings as being an array of bytes vs. a > "string" you are familiar with in other languages. The name is > unfortunate in that regard, but working with them is straightforward > (imagine having an actual array of bytes in PHP and trying to work on > them). >=20 PHP was the first language I leaned to program in, followed by JavaScript. At that point, and for many years thereafter, I never thought of strings as anything more than chunks of human readable text. It wasn't until I started to learn C++ that my understanding of strings changed. It was no longer "text", but a sequence of bytes. The problem is, when people start out learning a higher-level language like PHP, they don't start by understanding what the computer actually stores in memory, or how data structures are represented internally. They write: echo "hello world!'; ...and give no thought to how the letters they typed become the letters that come out. Character encoding doesn't cross your mind until a spammer tries to paste foreign characters into a contact form and your application crashes. And when you try to learn more, the most you find is advise to slap a few "utf-8" stickers in some places like the HTML response header, the charset meta tag, and to use some mb_* internal/output encoding functions. I've been looking for the past few weeks now, and I've asked on some community groups as well, and I have been unable to find a good, comprehensive security-minded guide for dealing with multi-byte characters and character attacks in PHP. There's general guidance from the Unicode Consortium on what should be done, but no guides on how to implement their security recommendations in PHP. One report is: https://www.unicode.org/reports/tr36 There's several things in their guide. They recommend that illegal byte sequences not be deleted as this can create an attack vector where two bytes that fit together are split by an illegal sequence, that, once removed, puts the two bytes back together to make something new, *after* the program has checked for dangerous characters: https://www.unicode.org/reports/tr36/#SecureEncodingConversion In PHP, you should be able to do that with: $ScrubbedBody =3D mb_scrub($_POST['body'], 'UTF-8'); But there's a pitfall here! By default, `mb_scrub` and several other PHP conversion functions replace illegal byte sequences with a `?` instead of `U+FFFD`, the designated replacement character. A question mark is an important character with special meaning, and the default implementation of `mb_scrub` will allow an attacker to put a `?` anywhere they want by inserting illegal bytes where they want a question mark inserted. To get the correct behavior, a developer must know to call: mb_substitute_character(0xFFFD); $ScrubbedBody =3D mb_scrub($_POST['body'], 'UTF-8'); There's also some Unicode Consortium recommendations on sets of characters that should be stripped from user input. https://www.unicode.org/reports/tr36/#Recommendations_General=20 The report says: "Private use characters must be avoided in identifiers, except in closed environments. There is no predicting what either the visual display or the programmatic interpretation will be on any given machine, so this can obviously lead to security problems." They go on to say, "What is true for private use characters is doubly true of unassigned code points. Secure systems will not use them: any future Unicode Standard could assign those codepoints to any new character. This is especially important in the case of certification." But how do we remove these private use characters and unassigned code points using PHP? You can use `mb_ereg` or `preg` with `/u` to remove character ranges, but this is clumsy at best. The guide warns against trying to restrict characters by language, and recommends using a "writing system" instead: https://www.unicode.org/reports/tr36/#Language_Based_Security "Creating "safe character sets" is an important goal in a security context, and it would appear that the characters used in a language is an obvious choice. However, because of the indeterminate set of characters used for a language, it is typically more effective to move to the higher level, the script, which can be more easily specified and tested." While I could probably hack together an array of regular expressions for identifying white-listed (language) scripts, this seems like something that should be built-in as a single function. In any application that reflects text back to other users, securely processing incoming Unicode is as important to stopping XSS attacks as PDO prepared statements are to stopping SQL injection. As for the second recommendation, removing "unassigned code points", I have not even started to work out how to do this with PHP. Since Unicode presents a security concern, I think it is important that function behavior with regard to Unicode be well documented, and also, that we have some functions that are easy to use to properly handle the complexities of Unicode security.