Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:34993
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 195.41.46.236 as permitted sender)
To: rasmus@lerdorf.com (Rasmus Lerdorf)
Cc: internals Mailing List <internals@lists.php.net>
Date: Tue, 29 Jan 2008 04:41:52 +0100
Message-ID: <ci4tp3hro32ukdru6ik4d3t0kqsfounccf@4ax.com>
References: <200801241426.39756.arnaud.lb@gmail.com> <479A613C.8030604@zend.com> <3hpsp3hmn2de4fard8lkpentg24k70jrhg@4ax.com> <479E80D8.5020206@lerdorf.com>
In-Reply-To: <479E80D8.5020206@lerdorf.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Subject: Re: [PHP-DEV] [PATCH] Bug #43896 htmlspecialchars returns empty stringoninvalid unicode sequence
From: penguin@php.net (Peter Brodersen)

On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals rasmus@lerdorf.com
(Rasmus Lerdorf) wrote:

>It would be a horrendously bad idea to replace invalid chars with some
>other valid char.  Way worse than returning nothing.  Think about what
>would happen in a regex, for example, if a user was able to inject a '?'
>by sending an invalid utf-8 sequence that ends up in a regular =
expression.

By the way, unicode characters that doesn't exist in iso8859-1 are also
replaced into a question mark:

$ php -r 'print utf8_decode(pack("c*",0xe2,0x98,0x83));'|od -t x1
0000000 3f

http://php.net/xml also documents this replacement:
=3D=3D
If PHP encounters characters in the parsed XML document that can not be
represented in the chosen target encoding, the problem characters will be
"demoted". Currently, this means that such characters are replaced by a
question mark.
=3D=3D

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt mentions:
=3D=3D
According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a "malformed sequence in the same way
that it interprets a character that is outside the adopted subset" and
"characters that are not within the adopted subset shall be indicated
to the user" by a receiving device. A quite commonly used approach in
UTF-8 decoders is to replace any malformed UTF-8 sequence by a
replacement character (U+FFFD), which looks a bit like an inverted
question mark, or a similar symbol. It might be a good idea to
visually distinguish a malformed UTF-8 sequence from a correctly
encoded Unicode character that is just not available in the current
font but otherwise fully legal, even though ISO 10646-1 doesn't
mandate this. In any case, just ignoring malformed sequences or
unavailable characters does not conform to ISO 10646, will make
debugging more difficult, and can lead to user confusion.
=3D=3D


--=20
- Peter Brodersen