[RFC] Decoding HTML and the Ambiguous Ampersand

1 year ago by Jakob Givoni — view source — reply

unread

Hi Dennis,

Overall it sounds like a reasonable RFC.

Dennis:

Niels:

I'm not so sure that the name "decode_html" is self-descriptive enough,
it sounds very generic.

The name is not very important to me. For the sake of history, the reason
I have chosen “decode HTML” is because, unlike an HTML parser, this is
focused on taking a snippet of HTML “text” content and decoding it into a
“plain PHP string.”

Why not make it two methods called "decode_html_text" and
"decode_html_attribute"?
Consider the following reasons:

The function doesn't actually decode html as such, it decodes either an
html text node string or an html attribute string.
Saves the $context parameter and the constants/enums, making the call
significantly shorter.
It feels like decoding either text or attribute are two significantly
different things. I admit I could be wrong, if code like
decode_html($e->isAttritbute() ? HtmlContext::Attribute :
HtmlContext::Text, $e->getContent()) is likely to be seen. But I somehow
don't foresee a lot of situations where text and attribute strings end up
in the same code path?

A couple of other options that would silence anyone opposed to implicitly
favouring utf-8:
html_text_to_utf8 and html_attribute_to_utf8

Best,
Jakob