Providing improved functionality for escaping html (and other) output.

12 years ago by Adam Jon Richardson — view source

unread

It's important to escape output according to context. PHP provides
functions such as htmlspecialchars() to escape output when the context
is HTML. However, one often desires to allow some subset of HTML
through without escaping (e.g., , , etc.)

Functions such as strip_tags() do allow whitelisting, but their usage
poses security risks due to lingering attributes (e.g., strip_tags('click me', ''.)

One can develop a more robust mechanism in userland that first escapes
input using htmlspecialchars() and then unescapes whitelisted
sequences. Because of the variance in html tags due to potential
attributes (e.g., optionally including various classes, img src
attributes, etc), offering the ability to optionally specify a
whitelist sequence through use of a regex could also offer significant
benefits (e.g., any string sequence starting and ending with '/' will
be handled as a regex.) However, the common nature of this need,
coupled with the performance benefits of implementing this internally
prompts my interest in two options.

Add a fifth parameter to htmlspecialchars() that takes an array of
whitelisted sequences. Even though this seems like a terribly long
function to call, one could easily wrap the call in a facade function.
Add a new function called str_escape(), but this introduces
potential BC issues.

There are of course other options (e.g., integrate this as an
additional filter, etc.)

I've built an extension that, while focused on an old web framework of
mine, contains a function that can serve as a proof-of-concept that
implements the functionality I've outlined above (see
nephtali_str_escape_html):

https://github.com/AdamJonR/nephtali-php-ext/blob/master/nephtali.c

I've tossed out the idea on this list before, but it was only
tangentially related to the discussion at the time. At this point, I'd
really like to focus on this idea directly to see what approach might
seem wisest (including doing nothing, if the frequency of use does not
justify bringing the functionality into the core.)

Thoughts?

Adam

12 years ago by Stas Malyshev — view source

unread

Hi!

It's important to escape output according to context. PHP provides
functions such as htmlspecialchars() to escape output when the context
is HTML. However, one often desires to allow some subset of HTML
through without escaping (e.g., , , etc.)

I think what you are looking for is HtmlPurifier and such. Doing it in
the core properly would be pretty hard.

https://github.com/AdamJonR/nephtali-php-ext/blob/master/nephtali.c

Could you describe in detail what that function actually does, with
examples?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Adam Jon Richardson — view source

unread

Hi!

It's important to escape output according to context. PHP provides
functions such as htmlspecialchars() to escape output when the context
is HTML. However, one often desires to allow some subset of HTML
through without escaping (e.g., , , etc.)

I think what you are looking for is HtmlPurifier and such. Doing it in
the core properly would be pretty hard.

Hi Stas,

HtmlPurifier is a fantastic tool, but it offers far more than I
typically need/want (e.g., CSS validation, attempting to remove
malicious code, fixes to illegal nesting, etc.) I guess I'm wondering
about something that allows more through than htmlspecialchars()
through some form of whitelisting, but not as much through as
strip_tags() (as noted, attributes can be problematic.)

https://github.com/AdamJonR/nephtali-php-ext/blob/master/nephtali.c

Could you describe in detail what that function actually does, with
examples?

Sure!

Function:
// Example userland code commented to explain the flow:
function str_escape_html($string, $allowed_html = array(), $charset = 'UTF-8')
{
// use htmlspecialchars because it only does the 5 most important
chars, and doesn't mess them up
// start out safely by ensureing everything is html escaped
(whitelisting approache)
$escaped_string = htmlspecialchars($string, ENT_QUOTES, $charset);
// check if there are whitelisted sequences which, if present, we
can safely revert
if ($allowed_html) {
// cycle through the whitelisted sequences
foreach($allowed_html as $sequence) {
// Save escaped version of sequence so we know what to revert safely
// This also works for regexes fairly well because <, >, &,
', ", don't have special meaning in regexes, but character sets cause
trouble, something I've just learned to work around
// http://php.net/manual/en/regexp.reference.meta.php
$escaped_sequence = htmlspecialchars($sequence, ENT_QUOTES, $charset);
// if the sequence begins and ends with a '/', treat it as a regex
if (($sequence[0] == '/') && ($sequence[strlen($sequence) -
1] == '/')) {
// revert regex matches
$escaped_string = preg_replace_callback($escaped_sequence,
function($matches){return html_entity_decode($matches[0]);},
$escaped_string);
// otherwise, treat it as a standard string sequence
} else {
// revert string sequences
$escaped_string = str_replace($escaped_sequence,
$sequence, $escaped_string);
}
}
}

return $escaped_string;
}

$input = '<div class="expected">click
me do not click the other
bolded text</div>';
$draconian_bold_tag_regex = '/<b( class="([a-z]+)")?>[a-zA-Z_.,! ]+/';
echo "strip tags: " . strip_tags($input, '<div> ') . " ";
echo "htmlspecialchars: " . htmlspecialchars($input, ENT_QUOTES,
'UTF-8') . " ";
echo "str_escape_html: " . str_escape_html($input,
array($draconian_bold_tag_regex, ' ', '<div class="expected">',
'</div>'), 'UTF-8');

Problems with above implementation:
Regexing large chunks of HTML is a pain, and easy to get wrong.
Additionally, you have to enter both the opening and closing tag in
the whitelist separately for literals, and character sets can break
things due to the escaping (e.g., [^<>].)

Solutions:
What I'm looking to build is functionality that adds to
htmlspecialchars by implementing whitelisting through some primitive
parsing (identify the tags without regard for validating HTML, similar
to strip_tags) and some regexing (validate attribute contents) using a
similar approach to the above function. The new function would better
break up the tag components, making the required regexes much easier
to work with.

For example:
$new = str_escape_html("<a class='important' href='test'>Test</a>", array(
'a' => [
'href' => '/^(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w
.-])/?$/' // strings beginning and ending with '/' are considered
regexes
'class' => 'important' // other strings are just evaluated as literals
],
'br' => [] // has no allowed attributes
), "UTF-8");

Conclusion:
Bridging the gap between strip_tags and htmlspecialchars seems like a
reasonable consideration for PHP's core. While I do use HTMLPurifier
or other tools for more involved HTML processing, I often wish the
core had something just a bit more flexible than htmlspecialchars, but
just a bit more protective than strip_tags that I could use for my
typical escaping needs. I'm going to spend some time refining this
approach in an extension, but I was looking for feedback before
proceeding further.

Thanks,

Adam

12 years ago by Stas Malyshev — view source

unread

Hi!

if ($allowed_html) {
// cycle through the whitelisted sequences
foreach($allowed_html as $sequence) {

What is supposed to be in $allowed_html? If those are simple fixed
strings and such, why can't you just do preg_split with
PREG_SPLIT_DELIM_CAPTURE and encode each other element of the result, or
PREG_SPLIT_OFFSET_CAPTURE if you need something more interesting?

I would seriously advise though against trying to do HTML parsing with
regexps unless they are very simple, since browsers will accept a lot of
broken HTML and will happily run scripts in it, etc.

Bridging the gap between strip_tags and htmlspecialchars seems like a
reasonable consideration for PHP's core. While I do use HTMLPurifier

I think with level of complexity that is needed to cover anything but
the most primitive cases, you need a full-blown HTML/XML parser there.
Which we do have, so why not use any of them instead of reinventing
them, if that's what you need?

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Adam Jon Richardson — view source

unread

What is supposed to be in $allowed_html? If those are simple fixed
strings and such, why can't you just do preg_split with
PREG_SPLIT_DELIM_CAPTURE and encode each other element of the result, or
PREG_SPLIT_OFFSET_CAPTURE if you need something more interesting?

I like to start out conservatively, with everything in a "safe" (i.e.,
fully escaped) state, and then revert from there. If something slips
through, it slips through in a conservative state.

I would seriously advise though against trying to do HTML parsing with
regexps unless they are very simple, since browsers will accept a lot of
broken HTML and will happily run scripts in it, etc.

I agree, that's why I proposed an improved approach in which regexes
would only optionally be used to validate attributes.

I think with level of complexity that is needed to cover anything but
the most primitive cases, you need a full-blown HTML/XML parser there.
Which we do have, so why not use any of them instead of reinventing
them, if that's what you need?

I don't think the simple state machine utilized by strip_tags() is a
full-blown HTML/XML parser, yet I find it does provide practical
value. Merging this type of state machine with the ability to check
attributes (via literal string or regex) would be an incremental step
beyond what is present now, and would prove practically beneficial.

I gave this example as one way to implement this approach through an API:

$new = str_escape_html("<a class='important' href='test'>Test</a>", array(
'a' => [
'href' =>
'/^(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w.-])/?$/',
'class' => 'important'
],
'br' => []
), "UTF-8");

The idea is that a string would first be escaped using
htmlspecialchars, a state machine similar to that used by strip_tags
would parse the text for the escaped form of tags. When whitelisted
tags are encountered, their attributes are checked against string
literals or regexes. If the tag and its attributes match the
whitelisted form, the tag sequence is unescaped.

One could also augment the strip_tags() function so the whitelist
items could include the ability to only allow specific attributes
through:

$new = strip_tags("<a class='important' href='test'>Test</a>", "<a :class :url>");

The colon-prepended symbols could allow predefined attributes
according to a regex. Any unlisted attributes would be stripped.

Thank you for the feedback, Stas.

Adam

Providing improved functionality for escaping html (and other) output.

I think with level of complexity that is needed to cover anything but the most primitive cases, you need a full-blown HTML/XML parser there. Which we do have, so why not use any of them instead of reinventing them, if that's what you need?

I think with level of complexity that is needed to cover anything but
the most primitive cases, you need a full-blown HTML/XML parser there.
Which we do have, so why not use any of them instead of reinventing
them, if that's what you need?