Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:34990 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 23207 invoked by uid 1010); 29 Jan 2008 01:27:05 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 23192 invoked from network); 29 Jan 2008 01:27:05 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 Jan 2008 01:27:05 -0000 Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 204.11.219.139 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 204.11.219.139 mail.lerdorf.com Received: from [204.11.219.139] ([204.11.219.139:38969] helo=mail.lerdorf.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id CB/A2-25507-7E08E974 for ; Mon, 28 Jan 2008 20:27:05 -0500 Received: from [192.168.200.139] (c-24-6-219-206.hsd1.ca.comcast.net [24.6.219.206]) (authenticated bits=0) by mail.lerdorf.com (8.14.2/8.14.2/Debian-2) with ESMTP id m0T1QmOU011046 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Mon, 28 Jan 2008 17:26:49 -0800 Message-ID: <479E80D8.5020206@lerdorf.com> Date: Mon, 28 Jan 2008 17:26:48 -0800 User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: Peter Brodersen CC: Stanislav Malyshev , internals Mailing List References: <200801241426.39756.arnaud.lb@gmail.com> <479A613C.8030604@zend.com> <3hpsp3hmn2de4fard8lkpentg24k70jrhg@4ax.com> In-Reply-To: <3hpsp3hmn2de4fard8lkpentg24k70jrhg@4ax.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.92/5590/Mon Jan 28 15:53:31 2008 on colo.lerdorf.com X-Virus-Status: Clean Subject: Re: [PHP-DEV] [PATCH] Bug #43896 htmlspecialchars returns empty stringon invalid unicode sequence From: rasmus@lerdorf.com (Rasmus Lerdorf) Peter Brodersen wrote: > On Fri, 25 Jan 2008 14:22:52 -0800, in php.internals stas@zend.com > (Stanislav Malyshev) wrote: > >>> Should really theses functions discard the whole string for a single >>> incomplete sequence ? >> I think since it is not possible to recover true content of the string, >> it is ok to return failure value. Cutting it in random places or >> ignoring problems doesn't seem a good idea - it might lead to all kinds >> of nasty things, such as security filtering checking one data and >> database getting entirely different data. > > On the other hand utf8_decode() also expects the input to be UTF-8 > encoded, but it replaces incomplete sequences with the character "?". > > I don't know if it is a recommended standard for invalid input but I > have seen this conversion as well in a couple of other applications, > e.g. Firefox. utf8_decode() doesn't replace invalid chars with a ? eg. php -r '$a="abcd".chr(0xE0);echo iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x1 0000000 61 62 63 64 0a 61 62 63 64 03 So, iconv() when told to take utf-8 as input and spit out utf-8 as output strips out invalid utf-8 chars whereas utf8_decode() does who knows what. 0xE0 gets converted to 0x03? It would be a horrendously bad idea to replace invalid chars with some other valid char. Way worse than returning nothing. Think about what would happen in a regex, for example, if a user was able to inject a '?' by sending an invalid utf-8 sequence that ends up in a regular expression. If we are going to do anything here, it would be to strip the invalid utf-8 bytes, but technically that's not a great solution from a security perspective. The results could be quite unexpected. The most secure approach is to fail on invalid input. It's your job to validate input and feed the function the input it expects. -Rasmus