Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:34991 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 31309 invoked by uid 1010); 29 Jan 2008 02:04:27 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 31294 invoked from network); 29 Jan 2008 02:04:27 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 Jan 2008 02:04:27 -0000 Authentication-Results: pb1.pair.com header.from=penguin@php.net; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=penguin@php.net; spf=unknown; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 195.41.46.236 as permitted sender) X-PHP-List-Original-Sender: penguin@php.net X-Host-Fingerprint: 195.41.46.236 pfepb.post.tele.dk Linux 2.5 (sometimes 2.4) (4) Received: from [195.41.46.236] ([195.41.46.236:33859] helo=pfepb.post.tele.dk) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 29/A3-25507-AA98E974 for ; Mon, 28 Jan 2008 21:04:27 -0500 Received: from workpenguin (0x5358bbb8.bynxx18.adsl-dhcp.tele.dk [83.88.187.184]) by pfepb.post.tele.dk (Postfix) with SMTP id 3A3DAA5001C; Tue, 29 Jan 2008 03:04:20 +0100 (CET) To: rasmus@lerdorf.com (Rasmus Lerdorf) Cc: Stanislav Malyshev , internals Mailing List Date: Tue, 29 Jan 2008 03:03:48 +0100 Message-ID: References: <200801241426.39756.arnaud.lb@gmail.com> <479A613C.8030604@zend.com> <3hpsp3hmn2de4fard8lkpentg24k70jrhg@4ax.com> <479E80D8.5020206@lerdorf.com> In-Reply-To: <479E80D8.5020206@lerdorf.com> X-Mailer: Forte Agent 1.91/32.564 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] [PATCH] Bug #43896 htmlspecialchars returns empty stringoninvalid unicode sequence From: penguin@php.net (Peter Brodersen) On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals rasmus@lerdorf.com (Rasmus Lerdorf) wrote: >> On the other hand utf8_decode() also expects the input to be UTF-8 >> encoded, but it replaces incomplete sequences with the character "?". > >utf8_decode() doesn't replace invalid chars with a ? > >eg. > >php -r '$a=3D"abcd".chr(0xE0);echo >iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x1 > >0000000 61 62 63 64 0a 61 62 63 64 03 Yes it does, but not in your case :-) However: $ php -r '$a=3D"abcd".chr(0xE0)."e"; echo iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);'|hd 00000000 61 62 63 64 0a 61 62 63 64 3f |abcd.abcd?| $ php -r 'print utf8_decode("Fl=C3=B8de p=C3=A5 =C3=A6blegr=C3=B8d");' =46l?p?blegr? >It would be a horrendously bad idea to replace invalid chars with some >other valid char. Way worse than returning nothing. Think about what >would happen in a regex, for example, if a user was able to inject a '?' >by sending an invalid utf-8 sequence that ends up in a regular = expression. I don't disagree with you and I have thought of the same issue (although I suppose any sanitation should happen after any given conversion; other charsets than utf-8 might be able to encode lowbits such as "?" as well - but this is beside the point) I'm not fond of the "?" feature as well, but it is present in utf8_decode() and other non-php applications with utf-8 conversion. My guess is still that some standard recommends this conversion as a possible fallback for error handling. --=20 - Peter Brodersen