In PHP 6, incoming user data will automatically be in (unicode) form.
(That is, assuming that the JIT functionality for converting gets
implemented).
One of the implementation details I'd like to consider involves non-XML
and/or non-SGML codepoints inside markup. As per the Unicode
specification, it is perfectly valid for a Unicode string to contain the
codepoints U+0000 (null byte), U+FFFF (non-character) and friends.
However, it is not valid for an XML document to contain these
characters; either of these will result in a fatal error.
Classically, it was very difficult for PHP scripts to implement UTF-8
support completely correctly. Many implementations check that the UTF-8
is well-formed, but neglect to strip out null-bytes and the like. I
consider validation/filtering against the XML char production (or
perhaps even more restrictive, as that allows some control characters
not allowed in HTML).
How should we go about making this easy in PHP 6? Perhaps a web_encoding
(terrible name, I know) function is in order?
Edward Z. Yang GnuPG: 0x869C48DA
HTML Purifier http://htmlpurifier.org Anti-XSS Filter
[[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
I think that internal string handling so be very respective to the
specification as you said. Perhaps code points which are not valid for a
separate specification, protocol etc, the conversion should be done in the
functions dealing with those formats. Like if extension family xmlfoo does
not like null bytes or bom or high surrogates, whatever, then have
xmlfoo_strip_invalid (bad name too ;p).
-Chris
On Wed, May 28, 2008 at 9:23 PM, Edward Z. Yang <
edwardzyang@thewritingpot.com> wrote:
In PHP 6, incoming user data will automatically be in (unicode) form.
(That is, assuming that the JIT functionality for converting gets
implemented).One of the implementation details I'd like to consider involves non-XML
and/or non-SGML codepoints inside markup. As per the Unicode
specification, it is perfectly valid for a Unicode string to contain the
codepoints U+0000 (null byte), U+FFFF (non-character) and friends.
However, it is not valid for an XML document to contain these
characters; either of these will result in a fatal error.Classically, it was very difficult for PHP scripts to implement UTF-8
support completely correctly. Many implementations check that the UTF-8
is well-formed, but neglect to strip out null-bytes and the like. I
consider validation/filtering against the XML char production (or
perhaps even more restrictive, as that allows some control characters
not allowed in HTML).How should we go about making this easy in PHP 6? Perhaps a web_encoding
(terrible name, I know) function is in order?Edward Z. Yang GnuPG: 0x869C48DA
HTML Purifier http://htmlpurifier.org Anti-XSS Filter
[[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
Chris Stockton wrote:
I think that internal string handling so be very respective to the
specification as you said. Perhaps code points which are not valid for a
separate specification, protocol etc, the conversion should be done in the
functions dealing with those formats. Like if extension family xmlfoo does
not like null bytes or bom or high surrogates, whatever, then have
xmlfoo_strip_invalid (bad name too ;p).
The trouble is that no-such function exists for HTML output. ;-) We'd be
adding another function to the htmlspecialchars($var) cadre. (A
counter-argument is that most people have defined a function _()
or the
like for this sort of thing. I think PHP can do things out of the box,
though.)
SOLUTION?
Before I propose my solution, I believe we should distinguish between
functions like strip_tags()
and the conjectured xml_strip_invalid().
Here are the primary differences:
strip_tags()
- Most appropriate on outbound, when the original data is preserved
- Makes clear changes to what the user sees
- Only used some of the time, universal application (i.e. magic quotes)
is not a good idea
xml_strip_invalid()
- Most appropriate on inbound, as these codepoints are not supposed to
be used at all. - Most of these forbidden characters are invisible, if/when they show up
and don't cause fatal errors] - What works for XML almost works for everything, except binary data
(notably), which shouldn't be in Unicode anyway.
My proposal is to introduce a new filter (for the filter extension)
which performs codepoint sanitization appropriate for HTML/XML contexts
(alternatively, this could be an option on the FILTER_DEFAULT
filter,
which would be for Unicode strings, I assume). This filter would be
turned ON by default, and users could turn it off using a special
option. Thus, codepoint sanitization would work invisibly for users who
don't care, and would be accessible to users who do (i.e. those who
don't mind mucking around with unpaired surrogates or the like. This [1]
gives quite a good explanation about what this is all about).
The filter would also work auto-magically on traditional retrieval of
values using the $_VAR super-globals. It would hook in with the regular
JIT decoding of GPC (as described here [2]) and cannot be turned off,
except by reading in by binary (which I do not know how to do).
As some extra functionality, filter should make it easy for users to
sanitize inputs to only contain codepoints of certain ranges. Because
this functionality would hook into the decoding process, it would be
much faster than using a TextIterator and hand-screening out codepoints.
Of course, this functionality should support Unicode properties. [3]
It would be interesting to survey what other languages (such as Python)
do in said situations, although PHP is in somewhat of a unique position
due to its legacy. Let's do this, and let's do this right.
DISCLAIMER: I'm not sure anyone even cares about this issue. I mean,
surely, the PHP devs have bigger fish to fry. But I think it is
important, and I'll keeping squeaking about it. I can RFC-ize this if
desired. Thanks all for reading this far.
[1] http://xml.coverpages.org/unicode30Ann19990918.html
[2] http://marc.info/?l=php-internals&m=116631089122369&w=2
[3]
http://docs.php.net/manual/en/regexp.reference.php#regexp.reference.unicode
--
Edward Z. Yang GnuPG: 0x869C48DA
HTML Purifier http://htmlpurifier.org Anti-XSS Filter
[[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
Edward Z. Yang wrote:
My proposal is to introduce a new filter (for the filter extension)
which performs codepoint sanitization appropriate for HTML/XML contexts
(alternatively, this could be an option on theFILTER_DEFAULT
filter,
which would be for Unicode strings, I assume). This filter would be
turned ON by default, and users could turn it off using a special
option. Thus, codepoint sanitization would work invisibly for users who
don't care, and would be accessible to users who do (i.e. those who
don't mind mucking around with unpaired surrogates or the like. This [1]
gives quite a good explanation about what this is all about).
Time to squeak. Are there any comments on this proposal?
--
Edward Z. Yang GnuPG: 0x869C48DA
HTML Purifier http://htmlpurifier.org Anti-XSS Filter
[[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]