Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:64602
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <50E8B6B2.1030404@sugarcrm.com>
References: <CAJzDObd8hNVFQkek1LvAtpEhtCMUPERgovNfWfxw1p=or-JKXQ@mail.gmail.com>
	<50E8B6B2.1030404@sugarcrm.com>
Date: Sun, 6 Jan 2013 04:46:39 -0500
Message-ID: <CAJzDObcxHOE96Q0B3atZYAii9UGMgem0x6HP2GWTRSar6nFhsw@mail.gmail.com>
To: Stas Malyshev <smalyshev@sugarcrm.com>
Cc: "internals@lists.php.net" <internals@lists.php.net>
Content-Type: text/plain; charset=ISO-8859-1
Subject: Re: [PHP-DEV] Providing improved functionality for escaping html (and
 other) output.
From: adamjonr@gmail.com (Adam Jon Richardson)

On Sat, Jan 5, 2013 at 6:26 PM, Stas Malyshev <smalyshev@sugarcrm.com> wrote:
> Hi!
>
>> It's important to escape output according to context. PHP provides
>> functions such as htmlspecialchars() to escape output when the context
>> is HTML. However, one often desires to allow some subset of HTML
>> through without escaping (e.g., <br />, <b></b>, etc.)
>
> I think what you are looking for is HtmlPurifier and such. Doing it in
> the core properly would be pretty hard.

Hi Stas,

HtmlPurifier is a fantastic tool, but it offers far more than I
typically need/want (e.g., CSS validation, attempting to remove
malicious code, fixes to illegal nesting, etc.) I guess I'm wondering
about something that allows more through than htmlspecialchars()
through some form of whitelisting, but not as much through as
strip_tags() (as noted, attributes can be problematic.)

>> https://github.com/AdamJonR/nephtali-php-ext/blob/master/nephtali.c
>>
>
> Could you describe in detail what that function actually does, with
> examples?

Sure!

Function:
// Example userland code commented to explain the flow:
function str_escape_html($string, $allowed_html = array(), $charset = 'UTF-8')
{
   // use htmlspecialchars because it only does the 5 most important
chars, and doesn't mess them up
   // start out safely by ensureing everything is html escaped
(whitelisting approache)
   $escaped_string = htmlspecialchars($string, ENT_QUOTES, $charset);
    // check if there are whitelisted sequences which, if present, we
can safely revert
   if ($allowed_html) {
      // cycle through the whitelisted sequences
      foreach($allowed_html as $sequence) {
         // Save escaped version of sequence so we know what to revert safely
         // This also works for regexes fairly well because <, >, &,
', ", don't have special meaning in regexes, but character sets cause
trouble, something I've just learned to work around
         // http://php.net/manual/en/regexp.reference.meta.php
         $escaped_sequence = htmlspecialchars($sequence, ENT_QUOTES, $charset);
         // if the sequence begins and ends with a '/', treat it as a regex
         if (($sequence[0] == '/') && ($sequence[strlen($sequence) -
1] == '/')) {
            // revert regex matches
            $escaped_string = preg_replace_callback($escaped_sequence,
function($matches){return html_entity_decode($matches[0]);},
$escaped_string);
            // otherwise, treat it as a standard string sequence
         } else {
            // revert string sequences
            $escaped_string = str_replace($escaped_sequence,
$sequence, $escaped_string);
         }
      }
   }

   return $escaped_string;
}

$input = '<div class="expected"><b onclick="alert(\'Oh no!\')">click
me</b><br id="whyIdMe" /><b class="emphasize">do not click the other
bolded text</b></div>';
$draconian_bold_tag_regex = '/<b( class="([a-z]+)")?>[a-zA-Z_.,! ]+<\/b>/';
echo "strip tags: " . strip_tags($input, '<b><div><br>') . "<br /><br />";
echo "htmlspecialchars: " . htmlspecialchars($input, ENT_QUOTES,
'UTF-8') . "<br /><br />";
echo "str_escape_html: " . str_escape_html($input,
array($draconian_bold_tag_regex, '<br />', '<div class="expected">',
'</div>'), 'UTF-8');

Problems with above implementation:
Regexing large chunks of HTML is a pain, and easy to get wrong.
Additionally, you have to enter both the opening and closing tag in
the whitelist separately for literals, and character sets can break
things due to the escaping (e.g., [^<>].)

Solutions:
What I'm looking to build is functionality that adds to
htmlspecialchars by implementing whitelisting through some primitive
parsing (identify the tags without regard for validating HTML, similar
to strip_tags) and some regexing (validate attribute contents) using a
similar approach to the above function. The new function would better
break up the tag components, making the required regexes much easier
to work with.

For example:
$new = str_escape_html("<a class='important' href='test'>Test</a>", array(
   'a' => [
      'href' => '/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w
\.-]*)*\/?$/' // strings beginning and ending with '/' are considered
regexes
      'class' => 'important' // other strings are just evaluated as literals
   ],
   'br' => []   // has no allowed attributes
), "UTF-8");

Conclusion:
Bridging the gap between strip_tags and htmlspecialchars seems like a
reasonable consideration for PHP's core. While I do use HTMLPurifier
or other tools for more involved HTML processing, I often wish the
core had something just a bit more flexible than htmlspecialchars, but
just a bit more protective than strip_tags that I could use for my
typical escaping needs. I'm going to spend some time refining this
approach in an extension, but I was looking for feedback before
proceeding further.

Thanks,

Adam