[RFC] Decoding HTML and the Ambiguous Ampersand

10 months ago by Jakob Givoni — view source

unread

On Sat, Aug 24, 2024 at 10:31 PM Dennis Snell dennis.snell@automattic.com
wrote:

Hi Dennis,

Overall it sounds like a reasonable RFC.

Dennis:

Niels:

I'm not so sure that the name "decode_html" is self-descriptive
enough, it sounds very generic.

The name is not very important to me. For the sake of history, the
reason I have chosen “decode HTML” is because, unlike an HTML parser, this
is focused on taking a snippet of HTML “text” content and decoding it into
a “plain PHP string.”

Why not make it two methods called "decode_html_text" and
"decode_html_attribute"?
Consider the following reasons:

The function doesn't actually decode html as such, it decodes either an
html text node string or an html attribute string.

Thanks Jakob. In WordPress I did just this.
https://developer.wordpress.org/reference/classes/wp_html_decoder/

Part of the reason for that was the inability to require something like an
enum (due to PHP version support requirements). The Enum solution feels
very nice too.

Saves the $context parameter and the constants/enums, making the call
significantly shorter.

In my PR I’ve actually expanded the Enum to include a few other contexts.
I feel like there’s a balance we have to do if we want to ride the line
between fully reliable and fully convenient. On one hand, we could
say “don’t send the text content of a SCRIPT element to this function!” But
on the other hand, that kind of forces people to expect that SCRIPT content
is different.

With the Enum there is that in-built training material when someone looks
and finds Attribute | BodyText | ForeignText | Script | Style (the
contexts I’ve explored in my PR).

We could make the same argument for decode_html_script() and
decode_foreign_text_node() and decode_html_style(). Somehow the context
feels cleaner to me, and like a single entry point for learning instead of
five.

Yes. With 5 different contexts it's starting to shift in favor of a single
function :-)
I only saw the RFC which from what I can tell still only features 2 of
them. I haven't seen the PR (RFC Implementation section says "Yet to
come").

It feels like decoding either text or attribute are two significantly
different things. I admit I could be wrong, if code like
decode_html($e->isAttritbute() ? HtmlContext::Attribute :
HtmlContext::Text, $e->getContent()) is likely to be seen.

None of these contexts are significantly different, which is one of the
major dangers of using html_entity_decode(). The results will look just
about right most of the time. It’s the subtle differences that matter most,
I suppose.

Well, that was kind of what I meant - even if the differences are usually
absent or subtle, they are significant (i.e. not necessarily big, but
meaningful), meaning using it wrong would give the wrong result, right?
Saying that they are not significantly different to me means that the
result would just be a little less good sometimes, not directly wrong.

The lesson I have drawn is that people frequently have what they
understand to be a text node or an attribute value, but they aren’t aware
that they are supposed to decode differently, and they also aren’t reaching
to interact with a full parser to get these values. If PHP could train
people as they use these functions, purely through their interfaces, I
think that could help elevate the level of reliability out there in the
wild, as long as they aren’t too cumbersome (hence explicitly no
default context argument or using separately-named functions).

Having the Enum I think enhances the ease with which people can reliably
also decode things like SCRIPT and STYLE nodes. “I know
html_decode_text() but I don’t know what the rules for SCRIPT are or if
they’re different so I’ll just stick with that.” vs “My IDE suggests that
Script is a different context, that’s interesting, I’ll try that and see
how it’s different."

That is a good point and using enums favours that learning push since they
are inherently grouped together.

Best,
Jakob

Thanks for your input. I’m grateful for the discussions and that people
are sharing.

Cheers!