[RFC] Decoding HTML and the Ambiguous Ampersand

1 year ago by Dennis Snell — view source — reply

unread

If we could have a single implementation, that would be great. I do understand of course your concern that DOM is not a required extension, and therefore basing the internals on Lexbor makes it tied to the DOM extension which may not be available. I however suspect that a large chunk of people needing a function like this have DOM available (as DOM is required by many HTML-processing-related packages). I can also look into it sometime soon if you want; anyway feel free to ping me.

I’m also very open to lexbor-based approaches but I’ve so-far found it more complicated than I expected. In some part this is because it involves setting up the parser and state machine for the HTML specification and much of the actual decoding can be safely done without this.

The other part is the extension aspect. I hear you, that you would expect calling code to have the DOM extensions available, but that’s simply not the case when developing a platform like WordPress, which I do. We don’t have control over the servers or environments where people are deploying this, and the availability of the DOM extensions is low enough that WordPress code simply cannot use DOMDocument (even though it shouldn’t because of the wild problems that has for attempting to parse HTML).

People resort to html_entity_decode() because that’s the only option. In WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP its performance is far below what’s possible.

I’d love your help in setting up lexbor’s state machine to decode text nodes. I’d love it even more if this could be part of the PHP language. It constantly surprises me that the language of the web (PHP) doesn’t have the tools to speak the language of the web (HTML). This RFC is all about taking a step towards ensuring that PHP developers can rely on PHP to be a reliable middle-man between the HTML domain and the PHP domain.

In other words, requiring the DOM extension or DOM\HtmlDocument would be such a non-starter for WordPress (accounting for 43% of the web today) that it would completely unavailable.

Well, I don't think it would be a big deal to move the bundled lexbor to
somewhere where it is always available. I mean, so far it's only used
by ext/dom so it's bundled there, but if other parts of the php-src code
base would use it, we could put it elsewhere.

Having a DOM parser for HTML in PHP itself without requiring an extension would open up many new possibilities. For example, WordPress test suites don’t have any functional “assertEquivalentMarkup()” functions because there’s no spec-compliant parser in PHP. We finally wrote our own user-space HTML parser, but relying on DOM\HtmlDocument would be much easier.

These test suites need to run on a variety of environments and PHP versions, so it’s moot thinking we could hasten the use of a native class to get the job done, but if it remains locked inside an optional extension, it may be borderline impossible to ever migrate to it.

Christoph

Dennis Snell