Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:34993 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 47278 invoked by uid 1010); 29 Jan 2008 03:42:31 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 47263 invoked from network); 29 Jan 2008 03:42:31 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 Jan 2008 03:42:31 -0000 Authentication-Results: pb1.pair.com header.from=penguin@php.net; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=penguin@php.net; spf=unknown; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 195.41.46.236 as permitted sender) X-PHP-List-Original-Sender: penguin@php.net X-Host-Fingerprint: 195.41.46.236 pfepb.post.tele.dk Linux 2.5 (sometimes 2.4) (4) Received: from [195.41.46.236] ([195.41.46.236:40722] helo=pfepb.post.tele.dk) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 65/85-25507-6A0AE974 for ; Mon, 28 Jan 2008 22:42:30 -0500 Received: from workpenguin (0x5358bbb8.bynxx18.adsl-dhcp.tele.dk [83.88.187.184]) by pfepb.post.tele.dk (Postfix) with SMTP id 8E466A5002B; Tue, 29 Jan 2008 04:42:24 +0100 (CET) To: rasmus@lerdorf.com (Rasmus Lerdorf) Cc: internals Mailing List Date: Tue, 29 Jan 2008 04:41:52 +0100 Message-ID: References: <200801241426.39756.arnaud.lb@gmail.com> <479A613C.8030604@zend.com> <3hpsp3hmn2de4fard8lkpentg24k70jrhg@4ax.com> <479E80D8.5020206@lerdorf.com> In-Reply-To: <479E80D8.5020206@lerdorf.com> X-Mailer: Forte Agent 1.91/32.564 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] [PATCH] Bug #43896 htmlspecialchars returns empty stringoninvalid unicode sequence From: penguin@php.net (Peter Brodersen) On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals rasmus@lerdorf.com (Rasmus Lerdorf) wrote: >It would be a horrendously bad idea to replace invalid chars with some >other valid char. Way worse than returning nothing. Think about what >would happen in a regex, for example, if a user was able to inject a '?' >by sending an invalid utf-8 sequence that ends up in a regular = expression. By the way, unicode characters that doesn't exist in iso8859-1 are also replaced into a question mark: $ php -r 'print utf8_decode(pack("c*",0xe2,0x98,0x83));'|od -t x1 0000000 3f http://php.net/xml also documents this replacement: =3D=3D If PHP encounters characters in the parsed XML document that can not be represented in the chosen target encoding, the problem characters will be "demoted". Currently, this means that such characters are replaced by a question mark. =3D=3D http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt mentions: =3D=3D According to ISO 10646-1:2000, sections D.7 and 2.3c, a device receiving UTF-8 shall interpret a "malformed sequence in the same way that it interprets a character that is outside the adopted subset" and "characters that are not within the adopted subset shall be indicated to the user" by a receiving device. A quite commonly used approach in UTF-8 decoders is to replace any malformed UTF-8 sequence by a replacement character (U+FFFD), which looks a bit like an inverted question mark, or a similar symbol. It might be a good idea to visually distinguish a malformed UTF-8 sequence from a correctly encoded Unicode character that is just not available in the current font but otherwise fully legal, even though ISO 10646-1 doesn't mandate this. In any case, just ignoring malformed sequences or unavailable characters does not conform to ISO 10646, will make debugging more difficult, and can lead to user confusion. =3D=3D --=20 - Peter Brodersen