Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:64608
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <50E95C4E.3060609@sugarcrm.com>
References: <CAJzDObd8hNVFQkek1LvAtpEhtCMUPERgovNfWfxw1p=or-JKXQ@mail.gmail.com>
	<50E8B6B2.1030404@sugarcrm.com>
	<CAJzDObcxHOE96Q0B3atZYAii9UGMgem0x6HP2GWTRSar6nFhsw@mail.gmail.com>
	<50E95C4E.3060609@sugarcrm.com>
Date: Sun, 6 Jan 2013 15:14:10 -0500
Message-ID: <CAJzDObcgtnJTX=hQxbS2hz1cq_ob1gOMckxptMKZDaWnf_XqoQ@mail.gmail.com>
To: Stas Malyshev <smalyshev@sugarcrm.com>
Cc: "internals@lists.php.net" <internals@lists.php.net>
Content-Type: text/plain; charset=ISO-8859-1
Subject: Re: [PHP-DEV] Providing improved functionality for escaping html (and
 other) output.
From: adamjonr@gmail.com (Adam Jon Richardson)

On Sun, Jan 6, 2013 at 6:13 AM, Stas Malyshev <smalyshev@sugarcrm.com> wrote:
> What is supposed to be in $allowed_html? If those are simple fixed
> strings and such, why can't you just do preg_split with
> PREG_SPLIT_DELIM_CAPTURE and encode each other element of the result, or
> PREG_SPLIT_OFFSET_CAPTURE if you need something more interesting?

I like to start out conservatively, with everything in a "safe" (i.e.,
fully escaped) state, and then revert from there. If something slips
through, it slips through in a conservative state.

> I would seriously advise though against trying to do HTML parsing with
> regexps unless they are very simple, since browsers will accept a lot of
> broken HTML and will happily run scripts in it, etc.

I agree, that's why I proposed an improved approach in which regexes
would only optionally be used to validate attributes.

> I think with level of complexity that is needed to cover anything but
> the most primitive cases, you need a full-blown HTML/XML parser there.
> Which we do have, so why not use any of them instead of reinventing
> them, if that's what you need?

I don't think the simple state machine utilized by strip_tags() is a
full-blown HTML/XML parser, yet I find it does provide practical
value. Merging this type of state machine with the ability to check
attributes (via literal string or regex) would be an incremental step
beyond what is present now, and would prove practically beneficial.

I gave this example as one way to implement this approach through an API:

$new = str_escape_html("<a class='important' href='test'>Test</a>", array(
   'a' => [
      'href' =>
'/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?$/',
      'class' => 'important'
   ],
   'br' => []
), "UTF-8");

The idea is that a string would first be escaped using
htmlspecialchars, a state machine similar to that used by strip_tags
would parse the text for the escaped form of tags. When whitelisted
tags are encountered, their attributes are checked against string
literals or regexes. If the tag and its attributes match the
whitelisted form, the tag sequence is unescaped.

One could also augment the strip_tags() function so the whitelist
items could include the ability to only allow specific attributes
through:

$new = strip_tags("<a class='important' href='test'>Test</a>", "<a
:class :url>");

The colon-prepended symbols could allow predefined attributes
according to a regex. Any unlisted attributes would be stripped.

Thank you for the feedback, Stas.

Adam