Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:64608 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 52075 invoked from network); 6 Jan 2013 20:14:14 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 6 Jan 2013 20:14:14 -0000 Authentication-Results: pb1.pair.com smtp.mail=adamjonr@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=adamjonr@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.172 as permitted sender) X-PHP-List-Original-Sender: adamjonr@gmail.com X-Host-Fingerprint: 209.85.216.172 mail-qc0-f172.google.com Received: from [209.85.216.172] ([209.85.216.172:48225] helo=mail-qc0-f172.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 8A/B6-04877-51BD9E05 for ; Sun, 06 Jan 2013 15:14:14 -0500 Received: by mail-qc0-f172.google.com with SMTP id b25so11342780qca.3 for ; Sun, 06 Jan 2013 12:14:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=z24ab/GsF57niW/xSgev1kNS2wAuDT3YndAUMctYqNc=; b=AbScbdbs3ho6NcZEeGRFw0JhoTnnNwI+ogLDFRREfqXZDQBWvZaq3FF4ZQ1mvwNTU0 XJV6ZyNybmU7oIDmMMxLXVO/6NiWob6N5jr4RQu/UlFJpL10JNri5DaR/liYNTi2pNAw tpGjwSldHdk/7TWyDiwPBpCT3t70k8s0PPFB008EorQuZ+3lGM6Js/fEG3GPerEQnV0X fUfgcCJHa72TL43gUa5B9lRVTwoxNduFExgTE7XFGydebpbZWaR8swuJ3rGS+jTCaQVK C2x11Ud5GHO4+N7E2oUvkgISQG48ZkSuGodJi0Wk2OenPtNAON8QNUj98ZjTT13xlJx5 oskg== MIME-Version: 1.0 Received: by 10.224.199.70 with SMTP id er6mr40389004qab.19.1357503250362; Sun, 06 Jan 2013 12:14:10 -0800 (PST) Received: by 10.229.22.133 with HTTP; Sun, 6 Jan 2013 12:14:10 -0800 (PST) In-Reply-To: <50E95C4E.3060609@sugarcrm.com> References: <50E8B6B2.1030404@sugarcrm.com> <50E95C4E.3060609@sugarcrm.com> Date: Sun, 6 Jan 2013 15:14:10 -0500 Message-ID: To: Stas Malyshev Cc: "internals@lists.php.net" Content-Type: text/plain; charset=ISO-8859-1 Subject: Re: [PHP-DEV] Providing improved functionality for escaping html (and other) output. From: adamjonr@gmail.com (Adam Jon Richardson) On Sun, Jan 6, 2013 at 6:13 AM, Stas Malyshev wrote: > What is supposed to be in $allowed_html? If those are simple fixed > strings and such, why can't you just do preg_split with > PREG_SPLIT_DELIM_CAPTURE and encode each other element of the result, or > PREG_SPLIT_OFFSET_CAPTURE if you need something more interesting? I like to start out conservatively, with everything in a "safe" (i.e., fully escaped) state, and then revert from there. If something slips through, it slips through in a conservative state. > I would seriously advise though against trying to do HTML parsing with > regexps unless they are very simple, since browsers will accept a lot of > broken HTML and will happily run scripts in it, etc. I agree, that's why I proposed an improved approach in which regexes would only optionally be used to validate attributes. > I think with level of complexity that is needed to cover anything but > the most primitive cases, you need a full-blown HTML/XML parser there. > Which we do have, so why not use any of them instead of reinventing > them, if that's what you need? I don't think the simple state machine utilized by strip_tags() is a full-blown HTML/XML parser, yet I find it does provide practical value. Merging this type of state machine with the ability to check attributes (via literal string or regex) would be an incremental step beyond what is present now, and would prove practically beneficial. I gave this example as one way to implement this approach through an API: $new = str_escape_html("Test", array( 'a' => [ 'href' => '/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?$/', 'class' => 'important' ], 'br' => [] ), "UTF-8"); The idea is that a string would first be escaped using htmlspecialchars, a state machine similar to that used by strip_tags would parse the text for the escaped form of tags. When whitelisted tags are encountered, their attributes are checked against string literals or regexes. If the tag and its attributes match the whitelisted form, the tag sequence is unescaped. One could also augment the strip_tags() function so the whitelist items could include the ability to only allow specific attributes through: $new = strip_tags("Test", ""); The colon-prepended symbols could allow predefined attributes according to a regex. Any unlisted attributes would be stripped. Thank you for the feedback, Stas. Adam