[RFC] Decoding HTML and the Ambiguous Ampersand

10 months ago by Dennis Snell — view source

unread

Hi Dennis,

Overall it sounds like a reasonable RFC.

Dennis:

Niels:

I'm not so sure that the name "decode_html" is self-descriptive enough, it sounds very generic.

The name is not very important to me. For the sake of history, the reason I have chosen “decode HTML” is because, unlike an HTML parser, this is focused on taking a snippet of HTML “text” content and decoding it into a “plain PHP string.”

Why not make it two methods called "decode_html_text" and "decode_html_attribute"?
Consider the following reasons:

The function doesn't actually decode html as such, it decodes either an html text node string or an html attribute string.

Thanks Jakob. In WordPress I did just this.
https://developer.wordpress.org/reference/classes/wp_html_decoder/

Part of the reason for that was the inability to require something like an enum (due to PHP version support requirements). The Enum solution feels very nice too.

Saves the $context parameter and the constants/enums, making the call significantly shorter.

In my PR I’ve actually expanded the Enum to include a few other contexts. I feel like there’s a balance we have to do if we want to ride the line between fully reliable and fully convenient. On one hand, we could say “don’t send the text content of a SCRIPT element to this function!” But on the other hand, that kind of forces people to expect that SCRIPT content is different.

With the Enum there is that in-built training material when someone looks and finds Attribute | BodyText | ForeignText | Script | Style (the contexts I’ve explored in my PR).

We could make the same argument for decode_html_script() and decode_foreign_text_node() and decode_html_style(). Somehow the context feels cleaner to me, and like a single entry point for learning instead of five.

It feels like decoding either text or attribute are two significantly different things. I admit I could be wrong, if code like decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContext::Text, $e->getContent()) is likely to be seen.

None of these contexts are significantly different, which is one of the major dangers of using html_entity_decode(). The results will look just about right most of the time. It’s the subtle differences that matter most, I suppose. Thankfully, in most places I’ve seen them blurred together, the intent of the code someone is writing understands which is which.

preg_replace_callback(
    ‘~<a[^>]+href=“([^”]+)”[^>]*>([^<]+)</a>~’,
    function ( $m ) {
        $title = str_replace( ‘]’, ‘\]’, html_entity_decode( $m[2] ) );
        $url = str_replace( ‘)’, ‘\)’, html_entity_decode( $m[1] ) );
        return “[{$title}]({$url})”;
    }
    $post_content
);

The lesson I have drawn is that people frequently have what they understand to be a text node or an attribute value, but they aren’t aware that they are supposed to decode differently, and they also aren’t reaching to interact with a full parser to get these values. If PHP could train people as they use these functions, purely through their interfaces, I think that could help elevate the level of reliability out there in the wild, as long as they aren’t too cumbersome (hence explicitly no default context argument or using separately-named functions).

Having the Enum I think enhances the ease with which people can reliably also decode things like SCRIPT and STYLE nodes. “I know html_decode_text() but I don’t know what the rules for SCRIPT are or if they’re different so I’ll just stick with that.” vs “My IDE suggests that Script is a different context, that’s interesting, I’ll try that and see how it’s different."

But I somehow don't foresee a lot of situations where text and attribute strings end up in the same code path?

The underlying reason I started this work was in support of building an HTML parser. We have a streaming parser which relies on a different parsing model than those built purely on the state machine in the specification, taking advantage of what we can to eek out performance in PHP code. For this, the strings are in the same path, and in this work I’ve come across a number of other common use-cases where the flow is the same but the decoder needs to know the context.

Normalizing HTML from “tag soup” to standard serialized form.
Sanitizing code wanting to inspect values from different parts of the markup.
Sanitizing rules engines providing configurations or DSLs for sanitization.
Live optimizers or analyzers to improve the output HTML leaving a server.

It’s one of those things that when it becomes trivial to start getting reliable transforms from the HTML syntax to the decoded text, more opportunities appear that never seemed practical before.

A couple of other options that would silence anyone opposed to implicitly favouring utf-8:
html_text_to_utf8 and html_attribute_to_utf8

The names started with these 😀. I do agree that it gets a bit excessive though to the point where it risks people not adopting them purely because they don’t want to type that long of a name every time they use it. Perhaps some of these 🙃

str_from_html( HtmlContext $context, string $html ): string {}

utf8_from_html( HtmlContext $context, string $html ): string {}

html_to_utf8( HtmlContext $context, string $html ): string {}

Best,
Jakob

Thanks for your input. I’m grateful for the discussions and that people are sharing.

Dennis Snell