[RFC] Re: Decoding HTML and the Ambiguous Ampersand

10 months ago by Rob Landers — view source

unread

Hey Dennis,

This looks like top posting because you’ve got a lot to read — and well written — but I want to reply to some points inline.

On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote

Thanks for the question, Rob, I hope this finds you well!

The RFC mentions that encoding must be utf-8. How are programmers supposed to work with this if the php file itself isn’t utf-8

From my experience it’s the opposite case that is more important to consider. That is, what happens when we mix UTF-8 source code with latin1 or UTF-8 source HTML with the system-set locale. I tried to hint at this scenario in the "Character encodings and UTF-8” section.

Let’s examine the fundamental breakdown case:
“é” === decode_html( “&#xe9;” );
If the source is UTF-8 there’s no problem. If the source is ISO-8859-1 this will fail because xE9 is on the left while xC3 xA9 is on the right. Except if zend.multibyte=1 and (zend.script_encoding=iso-8859-1 or if declare(encoding=‘iso-8859-1’) is set). The source code may or may not be converted into a different encoding based on configurations that most developers won’t have access to, or won’t examine.

Even with source code in ISO-8859-1, the zend.script_encoding and zend.multibyte set, html_entity_decode() still reports UTF-8 unless zend.default_charset is set or one of the iconv or mbstring internal charsets is set.

I just want to pause here and say, “holy crap.” That is quite complex and those edges seem sharp!

My point I’m trying to make is that the current situation today is a minefield due to a dizzying array of system-dependent settings. Most modern code will either be running UTF-8 source code or will be converting source code to UTF-8 or many other things will already be helplessly broken beyond this one issue.

Unfortunately, we don’t always get to choose the code we work on. There is someone on this list using SHIFT_JIS. They probably know more about the ins and outs of dealing with utf-8 centric systems from that encoding. Hopefully they can comment more about why this would or would not be a bad idea.

UTF-8 is the unifier that lets us escape this by having a defined and explicit encoding at the input and output.

Utf-8 is pretty good, right now, but I don’t think we should marry the language to it. Will it be “the standard” in 10 years, 20 years, 100 years? Languages change, cultures change. Some people I know use a font to change triple equals from a literal === to ≡. How long until php recognizes that as a literal operator?

But anyway, to get back on topic; I, personally, would rather see something more flexible, with sane defaults for utf-8.

or the input is meaningless in utf-8 or if changing it to utf-8 and back would result in invalid text?

There shouldn't be input that’s meaningless in UTF-8 if it’s valid in any other encoding. Indeed, I have placed the burden on the calling code to convert into UTF-8 beforehand, but that’s not altogether different than asking someone to declare into what encoding the character references ought to be decoded.

There’s a huge performance difference between converting a string from/to different encodings and instructing a function what to parse in the current encoding and also be useful when the page itself is not utf8.

-html_entity_decode( $html, `ENT_QUOTES` | `ENT_SUBSTITUTE` | ENT_HTML5, ‘ISO-8859-1’ );
+$html = mb_convert_encoding( $html, ‘UTF-8’, ‘ISO-8859-1’ );
+$html = decode_html( HTML_TEXT, $html );
+$html = mb_convert_encoding( $html, ‘ISO-8859-1’, ‘UTF-8’ );
If an encoding can go into UTF-8 (which it should) then it should also be able to return for all supported inputs. That is, we cannot convert into UTF-8 and produce a character that is unrepresentable in the source encoding, because that would imply it was there in the source to begin with. Furthermore, if the HTML decodes into a code point unsupported in the destination encoding, it would be invalid either directly via decoding, or indirectly via conversion.
-“\x1A” === html_entity_decode( “&#x1f170;”, `ENT_QUOTES` | `ENT_SUBSTITUTE` | ENT_HTML5, ‘ISO-8859-1’ );
+”?” === mb_convert_encoding( decode_html( HTML_TEXT, “&#x1f170;” ), ‘ISO-8859-1’, ‘UTF-8’ );
This gets really confusing because neither of these outputs is a proper decoding, as character encodings that don’t support the full Unicode code space cannot adequately represent all valid HTML inputs. HTML is a Unicode decoding by specification, so even in a browser with <meta charset=“ISO-8859-1”>🅰 the text content will still be 🅰, not ? or the invisible ASCII control code SUB.

I was of the understanding that meta charset was too late to set the encoding (but it’s been awhile since I’ve read the html5 spec) and the charset needed to be set in the html tag itself. I suppose browsers simply rewind upon hitting meta charset, but browsers have to deal with all kinds of shenanigans.

That being said, there is nothing in the spec (that I remember seeing) stating it was Unicode only; just that it was the default.

Further, html may be stored in the database of a certain encoding (such as content systems like WordPress or Drupal) where it may not be straightforward (or desirable) to convert to utf8.

—

I’m sorry for being long-winded but I think it’s necessary to frame these questions in the context of the problem today. We have very frequent errors that result from having the wrong defaults and a confusion of text encodings. I’ve seen far more problems from source code being UTF-8 and assuming the input is, rather than being anything else (likely ISO-8859-1 if not UTF-8) assuming the the input isn’t.

It should be possible to convert any string into UTF-8 regardless of its origin character set, and then transitively, if it originated there, it should be able to convert back if the HTML represents text that is representable in the original character set.

There are a number of scripts/languages not yet supported (especially on older machines) that would result in “�” and cannot be transcribed back to its original encoding. For example, there are still new scripts being added as late as two years ago: https://www.unicode.org/standard/supported.html

Converting at the boundaries of the application is the way to escape the confusion of wrestling an arbitrary number of different character sets.

I totally agree with this statement, but we should provide tools instead of dictating a policy.

Proper HTML decoding requires a character set capable of representing all of Unicode, as the code points in numeric character references refer to Unicode Code Points and not any particular code units or byte sequences in any particular encoding.

Almost every other character set is ASCII compatible, including UTF-8, making the domain of problems where this arises even smaller than it might otherwise seem. For example, & is & in all of the common character sets.

Have a lovely weekend! And sorry for the potentially mis-threaded reply. I couldn’t figure out how to reply to your message directly because the digest emails were still stuck in 2020 for my account and I didn’t switch subscriptions until after your email went out, meaning I didn’t have a copy of your email.

— Rob

— Rob