[RFC] Decoding HTML and the Ambiguous Ampersand

1 year ago by Dennis Snell — view source — reply

unread

Hi Dennis,

Overall it sounds like a reasonable RFC.

Dennis:

Niels:

I'm not so sure that the name "decode_html" is self-descriptive enough, it sounds very generic.

The name is not very important to me. For the sake of history, the reason I have chosen “decode HTML” is because, unlike an HTML parser, this is focused on taking a snippet of HTML “text” content and decoding it into a “plain PHP string.”

Why not make it two methods called "decode_html_text" and "decode_html_attribute"?
Consider the following reasons:

The function doesn't actually decode html as such, it decodes either an html text node string or an html attribute string.

Thanks Jakob. In WordPress I did just this.
https://developer.wordpress.org/reference/classes/wp_html_decoder/

Part of the reason for that was the inability to require something like an enum (due to PHP version support requirements). The Enum solution feels very nice too.

Saves the $context parameter and the constants/enums, making the call significantly shorter.

In my PR I’ve actually expanded the Enum to include a few other contexts. I feel like there’s a balance we have to do if we want to ride the line between fully reliable and fully convenient. On one hand, we could say “don’t send the text content of a SCRIPT element to this function!” But on the other hand, that kind of forces people to expect that SCRIPT content is different.

With the Enum there is that in-built training material when someone looks and finds Attribute | BodyText | ForeignText | Script | Style (the contexts I’ve explored in my PR).

We could make the same argument for decode_html_script() and decode_foreign_text_node() and decode_html_style(). Somehow the context feels cleaner to me, and like a single entry point for learning instead of five.

Yes. With 5 different contexts it's starting to shift in favor of a single function :-)
I only saw the RFC which from what I can tell still only features 2 of them. I haven't seen the PR (RFC Implementation section says "Yet to come").

Oops, I’ll get to this!

It feels like decoding either text or attribute are two significantly different things. I admit I could be wrong, if code like decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContext::Text, $e->getContent()) is likely to be seen.

None of these contexts are significantly different, which is one of the major dangers of using html_entity_decode(). The results will look just about right most of the time. It’s the subtle differences that matter most, I suppose.

Well, that was kind of what I meant - even if the differences are usually absent or subtle, they are significant (i.e. not necessarily big, but meaningful), meaning using it wrong would give the wrong result, right? Saying that they are not significantly different to me means that the result would just be a little less good sometimes, not directly wrong.

In hindsight I think I misunderstood what you were saying and got it backwards. I meant that the algorithms are subtly different, but as you point out, yes, the outcomes can be significant. ln the better cases we get data corruption, but these do lead to misidentification of unsafe content.

For example, “&#x6aa\x00avascript” should decode as “javascript” when rendered by a browser when found inside the BODY of a page, but an attribute should read “ja�vascript.”

The lesson I have drawn is that people frequently have what they understand to be a text node or an attribute value, but they aren’t aware that they are supposed to decode differently, and they also aren’t reaching to interact with a full parser to get these values. If PHP could train people as they use these functions, purely through their interfaces, I think that could help elevate the level of reliability out there in the wild, as long as they aren’t too cumbersome (hence explicitly no default context argument or using separately-named functions).

Having the Enum I think enhances the ease with which people can reliably also decode things like SCRIPT and STYLE nodes. “I know html_decode_text() but I don’t know what the rules for SCRIPT are or if they’re different so I’ll just stick with that.” vs “My IDE suggests that Script is a different context, that’s interesting, I’ll try that and see how it’s different."

That is a good point and using enums favours that learning push since they are inherently grouped together.

Best,
Jakob

Thanks for your input. I’m grateful for the discussions and that people are sharing.

Cheers!

Warmly,
Dennis Snell