Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:34994 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 65838 invoked by uid 1010); 29 Jan 2008 05:21:54 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 65823 invoked from network); 29 Jan 2008 05:21:54 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 Jan 2008 05:21:54 -0000 Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 204.11.219.139 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 204.11.219.139 mail.lerdorf.com Received: from [204.11.219.139] ([204.11.219.139:57391] helo=mail.lerdorf.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id DD/B7-25507-1F7BE974 for ; Tue, 29 Jan 2008 00:21:53 -0500 Received: from [192.168.200.139] (c-24-6-219-206.hsd1.ca.comcast.net [24.6.219.206]) (authenticated bits=0) by mail.lerdorf.com (8.14.2/8.14.2/Debian-2) with ESMTP id m0T5LnH0010709 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Mon, 28 Jan 2008 21:21:50 -0800 Message-ID: <479EB7ED.8070606@lerdorf.com> Date: Mon, 28 Jan 2008 21:21:49 -0800 User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: Peter Brodersen CC: internals Mailing List References: <200801241426.39756.arnaud.lb@gmail.com> <479A613C.8030604@zend.com> <3hpsp3hmn2de4fard8lkpentg24k70jrhg@4ax.com> <479E80D8.5020206@lerdorf.com> In-Reply-To: X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.92/5590/Mon Jan 28 15:53:31 2008 on colo.lerdorf.com X-Virus-Status: Clean Subject: Re: [PHP-DEV] [PATCH] Bug #43896 htmlspecialchars returns empty stringoninvalid unicode sequence From: rasmus@lerdorf.com (Rasmus Lerdorf) Peter Brodersen wrote: > http://php.net/xml also documents this replacement: > == > If PHP encounters characters in the parsed XML document that can not be > represented in the chosen target encoding, the problem characters will be > "demoted". Currently, this means that such characters are replaced by a > question mark. > == That was back in the expat days. We don't use that xml parser anymore. > http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt mentions: > == > According to ISO 10646-1:2000, sections D.7 and 2.3c, a device > receiving UTF-8 shall interpret a "malformed sequence in the same way > that it interprets a character that is outside the adopted subset" and > "characters that are not within the adopted subset shall be indicated > to the user" by a receiving device. A quite commonly used approach in > UTF-8 decoders is to replace any malformed UTF-8 sequence by a > replacement character (U+FFFD), which looks a bit like an inverted > question mark, or a similar symbol. It might be a good idea to > visually distinguish a malformed UTF-8 sequence from a correctly > encoded Unicode character that is just not available in the current > font but otherwise fully legal, even though ISO 10646-1 doesn't > mandate this. In any case, just ignoring malformed sequences or > unavailable characters does not conform to ISO 10646, will make > debugging more difficult, and can lead to user confusion. > == That part is completely different. That's at the display level. Replacing it in the backend makes no sense to me. Don't use utf8_decode. Use iconv() so you know what the heck is going on. -Rasmus