Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:37987 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 26849 invoked from network); 29 May 2008 04:23:27 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 May 2008 04:23:27 -0000 X-Host-Fingerprint: 24.228.80.153 ool-18e45099.dyn.optonline.net Received: from [24.228.80.153] ([24.228.80.153:4070] helo=localhost.localdomain) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 8A/30-24593-DBF2E384 for ; Thu, 29 May 2008 00:23:26 -0400 Message-ID: <8A.30.24593.DBF2E384@pb1.pair.com> To: internals@lists.php.net Date: Thu, 29 May 2008 00:23:24 -0400 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041206 Thunderbird/1.0 Mnenhy/0.6.0.104 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Posted-By: 24.228.80.153 Subject: Unicode and XML From: edwardzyang@thewritingpot.com ("Edward Z. Yang") In PHP 6, incoming user data will automatically be in (unicode) form. (That is, assuming that the JIT functionality for converting gets implemented). One of the implementation details I'd like to consider involves non-XML and/or non-SGML codepoints inside markup. As per the Unicode specification, it is perfectly valid for a Unicode string to contain the codepoints U+0000 (null byte), U+FFFF (non-character) and friends. However, it is not valid for an XML document to contain these characters; either of these will result in a fatal error. Classically, it was very difficult for PHP scripts to implement UTF-8 support completely correctly. Many implementations check that the UTF-8 is well-formed, but neglect to strip out null-bytes and the like. I consider validation/filtering against the XML char production (or perhaps even more restrictive, as that allows some control characters not allowed in HTML). How should we go about making this easy in PHP 6? Perhaps a web_encoding (terrible name, I know) function is in order? -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]