[RFC] Re: Decoding HTML and the Ambiguous Ampersand

10 months ago by Dennis Snell — view source

unread

Hey Dennis,

This looks like top posting because you’ve got a lot to read — and well written — but I want to reply to some points inline.

Rob, no worries! I love your questions and I love being able to work together again, even in some limited fashion. Let me prefix this for you and for everyone on the list: this is really hairy stuff, and can at times require concentrated focus. When I started down this path a long time ago I knew very little about it. I’ve been knee-deep in it for years and now I feel like I learn something new every day that I didn’t know before.

On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote

Thanks for the question, Rob, I hope this finds you well!

The RFC mentions that encoding must be utf-8. How are programmers supposed to work with this if the php file itself isn’t utf-8

From my experience it’s the opposite case that is more important to consider. That is, what happens when we mix UTF-8 source code with latin1 or UTF-8 source HTML with the system-set locale. I tried to hint at this scenario in the "Character encodings and UTF-8” section.

Let’s examine the fundamental breakdown case:
“é” === decode_html( “&#xe9;” );
If the source is UTF-8 there’s no problem. If the source is ISO-8859-1 this will fail because xE9 is on the left while xC3 xA9 is on the right. Except if zend.multibyte=1 and (zend.script_encoding=iso-8859-1 or if declare(encoding=‘iso-8859-1’) is set). The source code may or may not be converted into a different encoding based on configurations that most developers won’t have access to, or won’t examine.

Even with source code in ISO-8859-1, the zend.script_encoding and zend.multibyte set, html_entity_decode() still reports UTF-8 unless zend.default_charset is set or one of the iconv or mbstring internal charsets is set.
I just want to pause here and say, “holy crap.” That is quite complex and those edges seem sharp!

My point I’m trying to make is that the current situation today is a minefield due to a dizzying array of system-dependent settings. Most modern code will either be running UTF-8 source code or will be converting source code to UTF-8 or many other things will already be helplessly broken beyond this one issue.

Unfortunately, we don’t always get to choose the code we work on. There is someone on this list using SHIFT_JIS. They probably know more about the ins and outs of dealing with utf-8 centric systems from that encoding. Hopefully they can comment more about why this would or would not be a bad idea.

This is, in fact, one of my primary motivations for standardizing on UTF-8. Keep in mind that HTML not only has a set of character encodings that must be supported, but also a requirement that parsers not support additional encodings https://html.spec.whatwg.org/#character-encodings outside of that list. This is based on security grounds, for good and even more complicated reasons.

Of all of the required supported character sets, all roundtrip through UTF-8 as long as they aren’t modified. In fact, almost every character set out there should round-trip in this way, because the Unicode Consortium’s goal as far as I understand it is to capture every possible character in writing in a single universal character set. This appears first in the introduction to the HTML specification https://html.spec.whatwg.org/#suggested-reading and is reiterated throughout the document: HTML requires the use of UTF-8 https://html.spec.whatwg.org/#charset, though allows legacy encodings (there really are no “invalid” HTML documents because parse errors have deterministic resolutions).

UTF-8 is the unifier that lets us escape this by having a defined and explicit encoding at the input and output.

Utf-8 is pretty good, right now, but I don’t think we should marry the language to it. Will it be “the standard” in 10 years, 20 years, 100 years? Languages change, cultures change. Some people I know use a font to change triple equals from a literal === to ≡. How long until php recognizes that as a literal operator?

But anyway, to get back on topic; I, personally, would rather see something more flexible, with sane defaults for utf-8.

To guard against a future where UTF-8 is replaced is planning for the most extremely unlikely scenario. UTF-8 is the most universal standard for interchange of text content, prevalent in software, systems, and programming languages, even those with UTF-16 internals.

It’s a good moment to remind ourselves, however, that Unicode defines a tables of character “code points” which are a mapping from a natural number to a character. UTF-8 is an algorithm for storing those natural numbers in byte sequences.

We absolutely can plan for over-extensibility, and this is what I’ve seen happen with the existing HTML functions in PHP (with options to choose what to decode, which entities to use, into which encoding to decode, etc). There’s an appearance of an awareness of text encoding, but the design of the function interfaces lead people to make decisions that open up all sorts of doors to corruption and security exploits.

So it wouldn’t matter to my RFC if another encoding were standardized as long as one encoding is standardized. Today, I see no legitimate competition to UTF-8. The only encodings that come close are the two UTF-16 variants because of their prevalence in Java, JavaScript, and ObjectiveC strings, but the UTF-16 variable-width encoding suffers a number of shortcomings compared to UTF-8 without providing much value in exchange.

When the day comes that UTF-8 is deprecated or replaced, major swaths of the internet will need overhaul far beyond PHP. Or at least, I have a hard time imaging that going any other way.

or the input is meaningless in utf-8 or if changing it to utf-8 and back would result in invalid text?

There shouldn't be input that’s meaningless in UTF-8 if it’s valid in any other encoding. Indeed, I have placed the burden on the calling code to convert into UTF-8 beforehand, but that’s not altogether different than asking someone to declare into what encoding the character references ought to be decoded.

There’s a huge performance difference between converting a string from/to different encodings and instructing a function what to parse in the current encoding and also be useful when the page itself is not utf8.

It definitely seems this way when examining a single function in isolation, but I would challenge folks to look out in the wild in practice how these functions are used. Typically I see strings transcoded multiple times and usually based on the wrong encoding. For example, WordPress currently looks at its defined “blog_charset” to perform decoding, but most of the time it gets HTML input that input isn’t encoded in the blog charset.

What would be a performance win would be to decode and encode text at application boundaries so it can be converted once, processed in a pipeline where everything agrees on the encoding, and finally once more on output. In a UTF-8 world this requires no conversion at all, and UTF-8 is the overwhelming majority of code in web applications today.

We can keep in mind too that there are two encodings in the picture. The HTML source document may be encoded in one encoding while the output might need to appear in another. Consider HTML stored in a database as latin1/ISO-8859-1. It stores “é is é”, except that unlike in this email, the leading character é is the single byte xE9.

This output likely should be sent to a browser as UTF-8. It’s acceptable to send latin1, but most pages will have characters unrepresentable in latin1. The backend then in decoding the HTML must go ahead and internally convert the input character encoding so that the é becomes the two byte sequence xC3 xA9 and then decode é as xC3 xA9.

Were the input “&#1f170;” it simply could not decode into any single-byte encoding, failing to be able to decode the HTML. html_entity_decode() simply leaves that encoding in place. This kind of behavior tends to lead to double-encoding of the character references, and what the browser gets is &#x1f170; instead of 🅰.

-html_entity_decode( $html, `ENT_QUOTES` | `ENT_SUBSTITUTE` | ENT_HTML5, ‘ISO-8859-1’ );
+$html = mb_convert_encoding( $html, ‘UTF-8’, ‘ISO-8859-1’ );
+$html = decode_html( HTML_TEXT, $html );
+$html = mb_convert_encoding( $html, ‘ISO-8859-1’, ‘UTF-8’ );
If an encoding can go into UTF-8 (which it should) then it should also be able to return for all supported inputs. That is, we cannot convert into UTF-8 and produce a character that is unrepresentable in the source encoding, because that would imply it was there in the source to begin with. Furthermore, if the HTML decodes into a code point unsupported in the destination encoding, it would be invalid either directly via decoding, or indirectly via conversion.
-“\x1A” === html_entity_decode( “&#x1f170;”, `ENT_QUOTES` | `ENT_SUBSTITUTE` | ENT_HTML5, ‘ISO-8859-1’ );
+”?” === mb_convert_encoding( decode_html( HTML_TEXT, “&#x1f170;” ), ‘ISO-8859-1’, ‘UTF-8’ );
This gets really confusing because neither of these outputs is a proper decoding, as character encodings that don’t support the full Unicode code space cannot adequately represent all valid HTML inputs. HTML is a Unicode decoding by specification, so even in a browser with <meta charset=“ISO-8859-1”>🅰 the text content will still be 🅰, not ? or the invisible ASCII control code SUB.
I was of the understanding that meta charset was too late to set the encoding (but it’s been awhile since I’ve read the html5 spec) and the charset needed to be set in the html tag itself. I suppose browsers simply rewind upon hitting meta charset, but browsers have to deal with all kinds of shenanigans.

The algorithm for determining a document character set is straightforward, albeit with many steps. META elements within the first kilobyte of a document may determine the inferred encoding if one is not provided externally or from an HTTP header. Fun fact, if you find <meta charset=“utf16”> then a browser will properly set the document encoding to UTF-8 (just as it ignores DOCTYPE declarations and treats all text/html content as HTML5).

A while ago I ran some analysis on roughly 300,000 pages from a list of top-ranked domains. You can examine the raw data https://github.com/WordPress/wordpress-develop/files/15171826/charset-detections.csv and find some interesting bits in there. For instance, many HTML documents claim multiple incompatible character sets. Thankfully the HTML specification is clear on how to handle these situations. There really aren’t any shenanigans since 2008 when HTML5 formalized the parsing error modes.

That being said, there is nothing in the spec (that I remember seeing) stating it was Unicode only; just that it was the default.

See the above note on character encoding vs. Unicode character set. Unicode is in the introduction to the HTML spec and mentioned throughout. It’s in the “big picture” at the start. Even encodings like ISO-2022-JP and GB18030 map to Unicode code points and represent different ways to represent those in sequences of bytes.

Further, html may be stored in the database of a certain encoding (such as content systems like WordPress or Drupal) where it may not be straightforward (or desirable) to convert to utf8.

See above again: this is actually one of the most dangerous parts of suggesting in a function signature that a developer pick a character encoding, particularly since it invites incompatible decoding of the source document. It’s completely fine to store content in a database in another encoding, and many legacy systems do. Those are best served by converting when reading from the database into UTF-8 and then encoding from UTF-8 when saving into the database. The database character set confusions make HTML’s look simple, but those are out of scope for this RFC. Stating clearly that the function expects UTF-8 is about the best way I’ve seen in practice to partner with application developers both to educate them and help them accomplish their goals.

The primary point to consider here is that these legacy systems unintentionally oversimplify the state of encoded text. Typically they are running UTF-8 source code matching against a mixture of encodings from various inputs, only one of which is the database. For example, these systems will often assume that the encoding of the content in the database is the same encoding outbound to a browser, inbound in $_POST parameters, and escaped in $_GET query arguments. If the database is not using UTF-8 these assumptions are almost always wrong, and thus security issues abound.

—

I’m sorry for being long-winded but I think it’s necessary to frame these questions in the context of the problem today. We have very frequent errors that result from having the wrong defaults and a confusion of text encodings. I’ve seen far more problems from source code being UTF-8 and assuming the input is, rather than being anything else (likely ISO-8859-1 if not UTF-8) assuming the the input isn’t.

It should be possible to convert any string into UTF-8 regardless of its origin character set, and then transitively, if it originated there, it should be able to convert back if the HTML represents text that is representable in the original character set.

There are a number of scripts/languages not yet supported (especially on older machines) that would result in “�” and cannot be transcribed back to its original encoding. For example, there are still new scripts being added as late as two years ago: https://www.unicode.org/standard/supported.html

It’s absolutely true that new scripts are added, and someone else can confirm or correct me, but typically these appear first in Unicode, since Unicode has attempted to already swallow up all recorded digital text. When new scripts appear, it’s usually because someone found evidence of their use in physical writings and there was no previous digital record of them.

Do you have examples of languages that have digital records which are supported in the HTML specification which would result in substitution when decoding? Since HTML only encodes Unicode code points, I think the problem is that HTML cannot represent these characters, if they exist.

This is unrelated to UTF-8 because new scripts and characters and emoji get assigned the natural numbers - the code points. It’s up to text encodings to represent those indices into the character database tables.

Converting at the boundaries of the application is the way to escape the confusion of wrestling an arbitrary number of different character sets.

I totally agree with this statement, but we should provide tools instead of dictating a policy.

It’s my intention never to take control from a developer. In this situation that freedom appears by converting before and after. For most situations, for most systems, the most reliable, safe, and convenient thing to do will be to assume UTF-8, or check if a string is UTF-8 and reject otherwise. In those situations doing nothing is the right behavior, and preserves the correct parse within the domain in which decode_html() operates (which again, preservation or proper decoding cannot happen if it attempts to decode into latin1, as HTML documents are entirely Unicode documents and every single-byte encoding is unable to capture this).

This is why I personally feel strongly about having safe defaults instead of dangerous ones, as is the unfortunate case with html_entity_decode(). All the better if we can educate each other through the function interfaces to clarify what is happening and what the expectations are or need to be.

Proper HTML decoding requires a character set capable of representing all of Unicode, as the code points in numeric character references refer to Unicode Code Points and not any particular code units or byte sequences in any particular encoding.

Almost every other character set is ASCII compatible, including UTF-8, making the domain of problems where this arises even smaller than it might otherwise seem. For example, & is & in all of the common character sets.

Have a lovely weekend! And sorry for the potentially mis-threaded reply. I couldn’t figure out how to reply to your message directly because the digest emails were still stuck in 2020 for my account and I didn’t switch subscriptions until after your email went out, meaning I didn’t have a copy of your email.

— Rob

— Rob

Warmly,
Dennis Snell