Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:64602 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 98185 invoked from network); 6 Jan 2013 09:46:43 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 6 Jan 2013 09:46:43 -0000 Authentication-Results: pb1.pair.com smtp.mail=adamjonr@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=adamjonr@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.172 as permitted sender) X-PHP-List-Original-Sender: adamjonr@gmail.com X-Host-Fingerprint: 209.85.216.172 mail-qc0-f172.google.com Received: from [209.85.216.172] ([209.85.216.172:51779] helo=mail-qc0-f172.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id D4/B8-62408-10849E05 for ; Sun, 06 Jan 2013 04:46:42 -0500 Received: by mail-qc0-f172.google.com with SMTP id b25so10766426qca.31 for ; Sun, 06 Jan 2013 01:46:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=FTUnrDoc4+AxOQDV3ir0V2f4gIIX7ewVfem5IcjLVyg=; b=klnc+ySpZ2C9YOYmd830JhaCpaSjKBvwHTdULLV3pUQTIqRnYRHeDP+Q23E9GEhphC 1xpwkL8K08PUN6g1Dj0douVNgLDNvPVM4p3myq7F2y6utRtV1zx9siXAmWlD5+ApCylx sgC4RYYxX5g8TvGtge+A8aWk8DK4etiFBoa3tHiZaMWdLpUbXzGduL19X3ePBaPrCmid GfIjyBnz1HTwNEAUXkLrKYERRqjAFUVc/g9Iul5RaOcYmp4QWe76EagwUlpCjEZ0sEma 4Ra9MT7n/U8MKgQJwtm2YcNyQZGGb4+LCjkwMWKn9Cd0Xk98WC1abtrEUsSkRmz5l8eL GIxg== MIME-Version: 1.0 Received: by 10.49.118.138 with SMTP id km10mr44531914qeb.18.1357465599495; Sun, 06 Jan 2013 01:46:39 -0800 (PST) Received: by 10.229.22.133 with HTTP; Sun, 6 Jan 2013 01:46:39 -0800 (PST) In-Reply-To: <50E8B6B2.1030404@sugarcrm.com> References: <50E8B6B2.1030404@sugarcrm.com> Date: Sun, 6 Jan 2013 04:46:39 -0500 Message-ID: To: Stas Malyshev Cc: "internals@lists.php.net" Content-Type: text/plain; charset=ISO-8859-1 Subject: Re: [PHP-DEV] Providing improved functionality for escaping html (and other) output. From: adamjonr@gmail.com (Adam Jon Richardson) On Sat, Jan 5, 2013 at 6:26 PM, Stas Malyshev wrote: > Hi! > >> It's important to escape output according to context. PHP provides >> functions such as htmlspecialchars() to escape output when the context >> is HTML. However, one often desires to allow some subset of HTML >> through without escaping (e.g.,
, , etc.) > > I think what you are looking for is HtmlPurifier and such. Doing it in > the core properly would be pretty hard. Hi Stas, HtmlPurifier is a fantastic tool, but it offers far more than I typically need/want (e.g., CSS validation, attempting to remove malicious code, fixes to illegal nesting, etc.) I guess I'm wondering about something that allows more through than htmlspecialchars() through some form of whitelisting, but not as much through as strip_tags() (as noted, attributes can be problematic.) >> https://github.com/AdamJonR/nephtali-php-ext/blob/master/nephtali.c >> > > Could you describe in detail what that function actually does, with > examples? Sure! Function: // Example userland code commented to explain the flow: function str_escape_html($string, $allowed_html = array(), $charset = 'UTF-8') { // use htmlspecialchars because it only does the 5 most important chars, and doesn't mess them up // start out safely by ensureing everything is html escaped (whitelisting approache) $escaped_string = htmlspecialchars($string, ENT_QUOTES, $charset); // check if there are whitelisted sequences which, if present, we can safely revert if ($allowed_html) { // cycle through the whitelisted sequences foreach($allowed_html as $sequence) { // Save escaped version of sequence so we know what to revert safely // This also works for regexes fairly well because <, >, &, ', ", don't have special meaning in regexes, but character sets cause trouble, something I've just learned to work around // http://php.net/manual/en/regexp.reference.meta.php $escaped_sequence = htmlspecialchars($sequence, ENT_QUOTES, $charset); // if the sequence begins and ends with a '/', treat it as a regex if (($sequence[0] == '/') && ($sequence[strlen($sequence) - 1] == '/')) { // revert regex matches $escaped_string = preg_replace_callback($escaped_sequence, function($matches){return html_entity_decode($matches[0]);}, $escaped_string); // otherwise, treat it as a standard string sequence } else { // revert string sequences $escaped_string = str_replace($escaped_sequence, $sequence, $escaped_string); } } } return $escaped_string; } $input = '
click me
do not click the other bolded text
'; $draconian_bold_tag_regex = '/[a-zA-Z_.,! ]+<\/b>/'; echo "strip tags: " . strip_tags($input, '

') . "

"; echo "htmlspecialchars: " . htmlspecialchars($input, ENT_QUOTES, 'UTF-8') . "

"; echo "str_escape_html: " . str_escape_html($input, array($draconian_bold_tag_regex, '
', '
', '
'), 'UTF-8'); Problems with above implementation: Regexing large chunks of HTML is a pain, and easy to get wrong. Additionally, you have to enter both the opening and closing tag in the whitelist separately for literals, and character sets can break things due to the escaping (e.g., [^<>].) Solutions: What I'm looking to build is functionality that adds to htmlspecialchars by implementing whitelisting through some primitive parsing (identify the tags without regard for validating HTML, similar to strip_tags) and some regexing (validate attribute contents) using a similar approach to the above function. The new function would better break up the tag components, making the required regexes much easier to work with. For example: $new = str_escape_html("Test", array( 'a' => [ 'href' => '/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/' // strings beginning and ending with '/' are considered regexes 'class' => 'important' // other strings are just evaluated as literals ], 'br' => [] // has no allowed attributes ), "UTF-8"); Conclusion: Bridging the gap between strip_tags and htmlspecialchars seems like a reasonable consideration for PHP's core. While I do use HTMLPurifier or other tools for more involved HTML processing, I often wish the core had something just a bit more flexible than htmlspecialchars, but just a bit more protective than strip_tags that I could use for my typical escaping needs. I'm going to spend some time refining this approach in an extension, but I was looking for feedback before proceeding further. Thanks, Adam