Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:37998 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 74786 invoked from network); 29 May 2008 17:01:22 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 29 May 2008 17:01:22 -0000 Authentication-Results: pb1.pair.com header.from=chrisstocktonaz@gmail.com; sender-id=pass; domainkeys=bad Authentication-Results: pb1.pair.com smtp.mail=chrisstocktonaz@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.46.30 as permitted sender) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: chrisstocktonaz@gmail.com X-Host-Fingerprint: 74.125.46.30 yw-out-2324.google.com Received: from [74.125.46.30] ([74.125.46.30:15004] helo=yw-out-2324.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id BB/9C-03918-E51EE384 for ; Thu, 29 May 2008 13:01:19 -0400 Received: by yw-out-2324.google.com with SMTP id 5so2096923ywb.83 for ; Thu, 29 May 2008 10:00:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; bh=tnZkSathtUzehdnqBIH16YgTFRF37gKqKzAx8uwtsPw=; b=D8YOumi7WMuNP2zhxhFqPV4JzZixxd3hsHOgh9YjS6oIg+s6S+JIiGnDuW6MUDmck/s2mlVRS1aRc6DpBbdleoPkpLnV4X+4WQ5kEuWDintGW4AZ3zuss2y61XHSEc+43ynTNV750MU8mDK4R0BFQyGrqkCaudYCeKbC0b3mIa4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; b=YtCuIqRyDhp9jAQFNJYH1R533tguSJk04HA+RYBsS8ddQFnIWVp+cKJiptHfTTMXCznlN9JzxlM+ey6S2gJtyhxoaNiftsiIhHFRbdqnvcSMKcc+886Uc5IFhXF2o08SzvL3+MkbpA4CE0/tpGjWx8q49umEaGHfWewQ+qVrVlE= Received: by 10.142.142.16 with SMTP id p16mr1688210wfd.123.1212080436229; Thu, 29 May 2008 10:00:36 -0700 (PDT) Received: by 10.142.127.4 with HTTP; Thu, 29 May 2008 10:00:36 -0700 (PDT) Message-ID: Date: Thu, 29 May 2008 10:00:36 -0700 To: "Edward Z. Yang" Cc: internals@lists.php.net In-Reply-To: <8A.30.24593.DBF2E384@pb1.pair.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_21847_11609592.1212080436216" References: <8A.30.24593.DBF2E384@pb1.pair.com> Subject: Re: [PHP-DEV] Unicode and XML From: chrisstocktonaz@gmail.com ("Chris Stockton") ------=_Part_21847_11609592.1212080436216 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I think that internal string handling so be very respective to the specification as you said. Perhaps code points which are not valid for a separate specification, protocol etc, the conversion should be done in the functions dealing with those formats. Like if extension family xmlfoo does not like null bytes or bom or high surrogates, whatever, then have xmlfoo_strip_invalid (bad name too ;p). -Chris On Wed, May 28, 2008 at 9:23 PM, Edward Z. Yang < edwardzyang@thewritingpot.com> wrote: > In PHP 6, incoming user data will automatically be in (unicode) form. > (That is, assuming that the JIT functionality for converting gets > implemented). > > One of the implementation details I'd like to consider involves non-XML > and/or non-SGML codepoints inside markup. As per the Unicode > specification, it is perfectly valid for a Unicode string to contain the > codepoints U+0000 (null byte), U+FFFF (non-character) and friends. > However, it is not valid for an XML document to contain these > characters; either of these will result in a fatal error. > > Classically, it was very difficult for PHP scripts to implement UTF-8 > support completely correctly. Many implementations check that the UTF-8 > is well-formed, but neglect to strip out null-bytes and the like. I > consider validation/filtering against the XML char production (or > perhaps even more restrictive, as that allows some control characters > not allowed in HTML). > > How should we go about making this easy in PHP 6? Perhaps a web_encoding > (terrible name, I know) function is in order? > -- > Edward Z. Yang GnuPG: 0x869C48DA > HTML Purifier Anti-XSS Filter > [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php > > ------=_Part_21847_11609592.1212080436216--