Pre-RFC: Fixing spec bugs in the DOM extension

1 year ago by Niels Dossche — view source

unread

Hi internals

The DOM extension in PHP is used to parse, query and manipulate XML/HTML documents. The DOM extension is based on the DOM specification.
Originally this was the DOM Core Level 3 specification, but nowadays, that specification has evolved into the current "Living Specification" maintained by WHATWG.

Unfortunately, there are many bugs in PHP's DOM extension. Most of those bugs are related to namespace and attribute handling. This leads to people trying to work around those bugs by relying on more bugs, or on undocumented side-effects of incorrect behaviour, leading to even more issues in the end. Furthermore, some of these bugs may have security implications [1].

Some of these bugs are caused because the method or property was implemented incorrectly back in the day, or because the original specification used to be unclear. A smaller part of this is because the specification has made breaking changes when HTML 5 first came along and the specification creators had to unify what browsers implemented into a single specification that everyone agreed on.

It's not possible to "just fix" these bugs because people actually rely on these bugs. They are also often unaware that what they're doing is actually incorrect or causes the internal document state to be inconsistent. We therefore have to fix this in a backwards-compatible way: i.e. a hard requirement is that all code written for the current DOM extension keeps working without requiring changes.
In short: the main problem is that 20 years of buggy behaviour means that the bugs have become ingrained into the system.

Some people have implemented userland DOM libraries on top of the existing DOM extension. However, even userland solutions can't fully work around issues caused by PHP's DOM extension. The real solution is to provide a BC-preserving fix at PHP's side.

Roughly 1.5 months ago I merged my HTML 5 RFC [2] into the PHP 8.4 development branch. This RFC introduced new document classes: DOM\HTMLDocument and DOM\XMLDocument. The idea here was to preserve backwards compatibility: if the user wants to keep using HTML 4, they can keep using the DOMDocument class. Also, when the user wants to work with HTML 5 and are currently using workarounds, they can migrate on their own pace (without deprecations or anything) to the new classes. New code can use DOM{HTML,XML}Document from the start without touching the old classes.

The HTML 5 RFC has left us with an interesting opportunity to also introduce the spec bugfixes in a BC-preserving way. The idea is that when the new DOM{HTML,XML}Document classes are used, then the DOM extension will follow the DOM specification and therefore get rid of bugs. When you are using the DOMDocument class, the old implementations will be used. This means that backwards compatibility is kept.

For the past 2.5 weeks I've been working on getting all spec bugs that I know of fixed. The full list of bugs that this proposal fixes can be found here: https://github.com/nielsdos/php-src/blob/dom-spec-compliance-pub/bugs.md. I also found some discussion [3] from some years ago where C. Scott shared a list of problems they encountered at Wikimedia [4]. All behavioural issues are fixed in my PR [5], although my PR could always use more testing. Currently I have tested that existing DOM code does not break (I have tested veewee's XML library, Mensbeam library, some SimpleSAML libraries). I have added tests to test the new spec-compliant behaviour. I also ported some of the WHATWG's WPT DOM tests (DOM spec-compliance testsuite) to PHP and those that I've ported all pass [6].

Implementation PR can be found here: https://github.com/php/php-src/pull/13031

Note that this is not a new extension, but an improvement to the existing DOM extension. As for "why not an entirely new extension?", please see the reasoning in my HTML 5 RFC. All interactions with SimpleXML, XSL, XPath etc will remain possible like you are used to. Implementation-wise, a lot of code internally is shared between the spec-compliant and old implementations.

I intend to put this up for RFC. There is however one last detail that needs to be cleared up: what about "type issues"?
To give an example of a "type issue": there is a string DOMNode::$prefix property. DOM spec tells us that this should be nullable: when there is no prefix for a node, the prefix should return NULL. However, because the property is a string, this currently returns an empty string instead in PHP. Not a big deal maybe, but there's many of these subtle inconsistencies: null vs false return value, arguments that should accept ?string instead of string, etc.
Sadly, it's not possible to fix the typing issues for properties and methods for DOMNode, DOMElement, ... because of BC: properties and methods can be overridden.
Or is it?

Currently, as a result of the HTML 5 RFC, the new DOM{HTML,XML}Document classes keep using the DOMNode, DOMElement, ... classes.
For consistency, the DOMNode etc class were aliased to the DOM namespace, i.e. DOM\Node is an alias for DOMNode, DOM\Element an alias for DOMElement etc.
Being an alias, this means that fixing types for DOM\Node is not possible because it's really just another name for DOMNode, so changing it for DOM\Node means changing it for DOMNode.
Unless we no longer alias the classes but make them proper classes instead. This means we can fix the typing for DOM\Node while keeping DOMNode untouched, preserving BC. The downside is that it becomes more difficult for interoperability. One of the reasons the HTML 5 RFC introduced aliases instead of proper classes is so that code taking a DOMNode as an argument could also be passed a DOM\Node. However, if we make it a proper class instead, such code has to either transition fully to the new DOM classes or use a type union, e.g. DOMNode|DOM\Node.
In my opinion, having them become proper classes instead of aliases has my preference: either we fix everything in one go now while we have the opportunity, or never.

Let me know what you think, especially regarding the type issues.

Kind regards
Niels

[1] https://github.com/php/php-src/issues/8388
[2] https://wiki.php.net/rfc/domdocument_html5_parser
[3] https://externals.io/message/104687
[4] https://www.mediawiki.org/wiki/Parsoid/PHP/Help_wanted
[5] https://github.com/php/php-src/pull/13031
[6] https://github.com/nielsdos/wpt/tree/master/dom/php-out (yes, this is a dirty port)

1 year ago by tim@bastelstu.be — view source

unread

In my opinion, having them become proper classes instead of aliases has my preference: either we fix everything in one go now while we have the opportunity, or never.

As I've already told you in private, I'm in favor of using this opportunity.

Let me know what you think, especially regarding the type issues.

Will the classes be made final if they are no longer aliases? That
should (hopefully) make similar changes somewhat easier in the future.

Best regards
Tim Düsterhus

1 year ago by Niels Dossche — view source

unread

Hi Tim

Hi

In my opinion, having them become proper classes instead of aliases has my preference: either we fix everything in one go now while we have the opportunity, or never.

As I've already told you in private, I'm in favor of using this opportunity.

Let me know what you think, especially regarding the type issues.

Will the classes be made final if they are no longer aliases? That should (hopefully) make similar changes somewhat easier in the future.

I've been thinking about that as well, but I'm not sure.
We still have the registerNodeClass() feature, and I've seen people ask to bring this even further to allow custom Element classes (e.g. MyHTMLScriptElement etc).
I'd like to hear from more people on this matter.

Best regards
Tim Düsterhus

Kind regards
Niels

1 year ago by Robert Landers — view source

unread

Hi Tim

Hi

In my opinion, having them become proper classes instead of aliases has my preference: either we fix everything in one go now while we have the opportunity, or never.

As I've already told you in private, I'm in favor of using this opportunity.

Let me know what you think, especially regarding the type issues.

Will the classes be made final if they are no longer aliases? That should (hopefully) make similar changes somewhat easier in the future.

I've been thinking about that as well, but I'm not sure.
We still have the registerNodeClass() feature, and I've seen people ask to bring this even further to allow custom Element classes (e.g. MyHTMLScriptElement etc).
I'd like to hear from more people on this matter.

Best regards
Tim Düsterhus

Kind regards
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php

Hi Niels,

We still have the registerNodeClass() feature, and I've seen people ask to bring this even further to allow custom Element classes (e.g. MyHTMLScriptElement etc).
I'd like to hear from more people on this matter.

Custom element classes would be really nice! I ended up having to
write a custom html5 parser in pure php due to the shortcomings of
php's extension. Having the ability to create custom elements can make
the semantics much more clear (a HeaderElement class, for example).

Robert Landers
Software Engineer
Utrecht NL

1 year ago by G. P. B. — view source

unread

Hi internals

The DOM extension in PHP is used to parse, query and manipulate XML/HTML
documents. The DOM extension is based on the DOM specification.
Originally this was the DOM Core Level 3 specification, but nowadays, that
specification has evolved into the current "Living Specification"
maintained by WHATWG.

Unfortunately, there are many bugs in PHP's DOM extension. Most of those
bugs are related to namespace and attribute handling. This leads to people
trying to work around those bugs by relying on more bugs, or on
undocumented side-effects of incorrect behaviour, leading to even more
issues in the end. Furthermore, some of these bugs may have security
implications [1].

Some of these bugs are caused because the method or property was
implemented incorrectly back in the day, or because the original
specification used to be unclear. A smaller part of this is because the
specification has made breaking changes when HTML 5 first came along and
the specification creators had to unify what browsers implemented into a
single specification that everyone agreed on.

It's not possible to "just fix" these bugs because people actually rely
on these bugs. They are also often unaware that what they're doing is
actually incorrect or causes the internal document state to be
inconsistent. We therefore have to fix this in a backwards-compatible way:
i.e. a hard requirement is that all code written for the current DOM
extension keeps working without requiring changes.
In short: the main problem is that 20 years of buggy behaviour means that
the bugs have become ingrained into the system.

Some people have implemented userland DOM libraries on top of the existing
DOM extension. However, even userland solutions can't fully work around
issues caused by PHP's DOM extension. The real solution is to provide a
BC-preserving fix at PHP's side.

Roughly 1.5 months ago I merged my HTML 5 RFC [2] into the PHP 8.4
development branch. This RFC introduced new document classes:
DOM\HTMLDocument and DOM\XMLDocument. The idea here was to preserve
backwards compatibility: if the user wants to keep using HTML 4, they can
keep using the DOMDocument class. Also, when the user wants to work with
HTML 5 and are currently using workarounds, they can migrate on their own
pace (without deprecations or anything) to the new classes. New code can
use DOM{HTML,XML}Document from the start without touching the old classes.

The HTML 5 RFC has left us with an interesting opportunity to also
introduce the spec bugfixes in a BC-preserving way. The idea is that when
the new DOM{HTML,XML}Document classes are used, then the DOM extension
will follow the DOM specification and therefore get rid of bugs. When you
are using the DOMDocument class, the old implementations will be used. This
means that backwards compatibility is kept.

For the past 2.5 weeks I've been working on getting all spec bugs that I
know of fixed. The full list of bugs that this proposal fixes can be found
here:
https://github.com/nielsdos/php-src/blob/dom-spec-compliance-pub/bugs.md.
I also found some discussion [3] from some years ago where C. Scott shared
a list of problems they encountered at Wikimedia [4]. All behavioural
issues are fixed in my PR [5], although my PR could always use more
testing. Currently I have tested that existing DOM code does not break (I
have tested veewee's XML library, Mensbeam library, some SimpleSAML
libraries). I have added tests to test the new spec-compliant behaviour. I
also ported some of the WHATWG's WPT DOM tests (DOM spec-compliance
testsuite) to PHP and those that I've ported all pass [6].

Implementation PR can be found here:
https://github.com/php/php-src/pull/13031

Note that this is not a new extension, but an improvement to the existing
DOM extension. As for "why not an entirely new extension?", please see the
reasoning in my HTML 5 RFC. All interactions with SimpleXML, XSL, XPath etc
will remain possible like you are used to. Implementation-wise, a lot of
code internally is shared between the spec-compliant and old
implementations.

I intend to put this up for RFC. There is however one last detail that
needs to be cleared up: what about "type issues"?
To give an example of a "type issue": there is a string DOMNode::$prefix
property. DOM spec tells us that this should be nullable: when there is no
prefix for a node, the prefix should return NULL. However, because the
property is a string, this currently returns an empty string instead in
PHP. Not a big deal maybe, but there's many of these subtle
inconsistencies: null vs false return value, arguments that should accept
?string instead of string, etc.
Sadly, it's not possible to fix the typing issues for properties and
methods for DOMNode, DOMElement, ... because of BC: properties and methods
can be overridden.
Or is it?

Currently, as a result of the HTML 5 RFC, the new DOM{HTML,XML}Document
classes keep using the DOMNode, DOMElement, ... classes.
For consistency, the DOMNode etc class were aliased to the DOM namespace,
i.e. DOM\Node is an alias for DOMNode, DOM\Element an alias for DOMElement
etc.
Being an alias, this means that fixing types for DOM\Node is not possible
because it's really just another name for DOMNode, so changing it for
DOM\Node means changing it for DOMNode.
Unless we no longer alias the classes but make them proper classes
instead. This means we can fix the typing for DOM\Node while keeping
DOMNode untouched, preserving BC. The downside is that it becomes more
difficult for interoperability. One of the reasons the HTML 5 RFC
introduced aliases instead of proper classes is so that code taking a
DOMNode as an argument could also be passed a DOM\Node. However, if we make
it a proper class instead, such code has to either transition fully to the
new DOM classes or use a type union, e.g. DOMNode|DOM\Node.
In my opinion, having them become proper classes instead of aliases has my
preference: either we fix everything in one go now while we have the
opportunity, or never.

Let me know what you think, especially regarding the type issues.

Kind regards
Niels

[1] https://github.com/php/php-src/issues/8388
[2] https://wiki.php.net/rfc/domdocument_html5_parser
[3] https://externals.io/message/104687
[4] https://www.mediawiki.org/wiki/Parsoid/PHP/Help_wanted
[5] https://github.com/php/php-src/pull/13031
[6] https://github.com/nielsdos/wpt/tree/master/dom/php-out (yes, this is
a dirty port)

Thank you for the work!

I agree that making them proper classes instead of aliases is the better
proposition here.
I'm not fully informed about the DOM spec, and I don't know if the current
class/interface hierarchy is in the best shape, but maybe we should also
consider having a look a this?

About making those new classes finals, this would require reconsidering the
class hierarchy anyway, as nearly everything inherits from DOMNode, and
other classes (namely Comment/Text/CData nodes) extend other classes.
However, I would not necessarily be against it, especially if we add the
required interfaces, as the current mechanism of registering a custom class
is not very powerful and rather cumbersome to use as the constructor is
never called.
As such, I'm not sure if I would support adding the current mechanism to
customize the node classes returned by the extension. Indeed, the current
mechanism doesn't play nicely at all with static analysis and this is
something I stopped trying to integrate when writing my DocBook renderer
project. [1]

Best regards,

Gina P. Banyard

[1] https://gitlab.com/Girgias/docbook-renderer

1 year ago by Niels Dossche — view source

unread

Hi Gina

Thank you for the work!

I agree that making them proper classes instead of aliases is the better proposition here.
I'm not fully informed about the DOM spec, and I don't know if the current class/interface hierarchy is in the best shape, but maybe we should also consider having a look a this?

Yeah, our current class hierarchy is wrong, but not "overly wrong". The incorrectness comes from the design of the pre-HTML5 era.

This is how it's supposed to be:
CharacterData extends Node (Actually an interface, but PHP does not have interfaces with properties)
Text extends CharacterData
CDATASection extends Text
ProcessingInstruction extends CharacterData
Comment extends CharacterData

However in the current implementation, the ProcessingInstruction class extends Node instead of CharacterData.
Also CharacterData is a class instead of an interface in the current implementation.

So nothing too bad, but not correct either.
There's also some functionality that should be on the Element class instead of the Node class.

About making those new classes finals, this would require reconsidering the class hierarchy anyway, as nearly everything inherits from DOMNode, and other classes (namely Comment/Text/CData nodes) extend other classes.
However, I would not necessarily be against it, especially if we add the required interfaces, as the current mechanism of registering a custom class is not very powerful and rather cumbersome to use as the constructor is never called.

I'm already reconsidering the class hierarchy :-).
As for the constructor problem: I can fix that for the new classes, I can make sure the constructor is called which would already solve a pain point.

As such, I'm not sure if I would support adding the current mechanism to customize the node classes returned by the extension. Indeed, the current mechanism doesn't play nicely at all with static analysis and this is something I stopped trying to integrate when writing my DocBook renderer project. [1]

I'm also not entirely sure, but in the JS world we do have custom elements that you can register and get an instance from back, so it has been done before at least.

Best regards,

Gina P. Banyard

[1] https://gitlab.com/Girgias/docbook-renderer https://gitlab.com/Girgias/docbook-renderer

Kind regards
Niels

1 year ago by Larry Garfield — view source

unread

Hi internals

The DOM extension in PHP is used to parse, query and manipulate
XML/HTML documents. The DOM extension is based on the DOM specification.
Originally this was the DOM Core Level 3 specification, but nowadays,
that specification has evolved into the current "Living Specification"
maintained by WHATWG.

Unfortunately, there are many bugs in PHP's DOM extension. Most of
those bugs are related to namespace and attribute handling. This leads
to people trying to work around those bugs by relying on more bugs, or
on undocumented side-effects of incorrect behaviour, leading to even
more issues in the end. Furthermore, some of these bugs may have
security implications [1].

Some of these bugs are caused because the method or property was
implemented incorrectly back in the day, or because the original
specification used to be unclear. A smaller part of this is because the
specification has made breaking changes when HTML 5 first came along
and the specification creators had to unify what browsers implemented
into a single specification that everyone agreed on.

It's not possible to "just fix" these bugs because people actually
rely on these bugs. They are also often unaware that what they're
doing is actually incorrect or causes the internal document state to be
inconsistent. We therefore have to fix this in a backwards-compatible
way: i.e. a hard requirement is that all code written for the current
DOM extension keeps working without requiring changes.
In short: the main problem is that 20 years of buggy behaviour means
that the bugs have become ingrained into the system.

Some people have implemented userland DOM libraries on top of the
existing DOM extension. However, even userland solutions can't fully
work around issues caused by PHP's DOM extension. The real solution is
to provide a BC-preserving fix at PHP's side.

Roughly 1.5 months ago I merged my HTML 5 RFC [2] into the PHP 8.4
development branch. This RFC introduced new document classes:
DOM\HTMLDocument and DOM\XMLDocument. The idea here was to preserve
backwards compatibility: if the user wants to keep using HTML 4, they
can keep using the DOMDocument class. Also, when the user wants to work
with HTML 5 and are currently using workarounds, they can migrate on
their own pace (without deprecations or anything) to the new classes.
New code can use DOM{HTML,XML}Document from the start without touching
the old classes.

The HTML 5 RFC has left us with an interesting opportunity to also
introduce the spec bugfixes in a BC-preserving way. The idea is that
when the new DOM{HTML,XML}Document classes are used, then the DOM
extension will follow the DOM specification and therefore get rid of
bugs. When you are using the DOMDocument class, the old implementations
will be used. This means that backwards compatibility is kept.

For the past 2.5 weeks I've been working on getting all spec bugs that
I know of fixed. The full list of bugs that this proposal fixes can be
found here:
https://github.com/nielsdos/php-src/blob/dom-spec-compliance-pub/bugs.md.
I also found some discussion [3] from some years ago where C. Scott
shared a list of problems they encountered at Wikimedia [4]. All
behavioural issues are fixed in my PR [5], although my PR could always
use more testing. Currently I have tested that existing DOM code does
not break (I have tested veewee's XML library, Mensbeam library, some
SimpleSAML libraries). I have added tests to test the new
spec-compliant behaviour. I also ported some of the WHATWG's WPT DOM
tests (DOM spec-compliance testsuite) to PHP and those that I've ported
all pass [6].

Implementation PR can be found here: https://github.com/php/php-src/pull/13031

Note that this is not a new extension, but an improvement to the
existing DOM extension. As for "why not an entirely new extension?",
please see the reasoning in my HTML 5 RFC. All interactions with
SimpleXML, XSL, XPath etc will remain possible like you are used to.
Implementation-wise, a lot of code internally is shared between the
spec-compliant and old implementations.

I intend to put this up for RFC. There is however one last detail that
needs to be cleared up: what about "type issues"?
To give an example of a "type issue": there is a string DOMNode::$prefix property. DOM spec tells us that this should be
nullable: when there is no prefix for a node, the prefix should return
NULL. However, because the property is a string, this currently returns
an empty string instead in PHP. Not a big deal maybe, but there's many
of these subtle inconsistencies: null vs false return value, arguments
that should accept ?string instead of string, etc.
Sadly, it's not possible to fix the typing issues for properties and
methods for DOMNode, DOMElement, ... because of BC: properties and
methods can be overridden.
Or is it?

Currently, as a result of the HTML 5 RFC, the new
DOM{HTML,XML}Document classes keep using the DOMNode, DOMElement, ...
classes.
For consistency, the DOMNode etc class were aliased to the DOM
namespace, i.e. DOM\Node is an alias for DOMNode, DOM\Element an alias
for DOMElement etc.
Being an alias, this means that fixing types for DOM\Node is not
possible because it's really just another name for DOMNode, so changing
it for DOM\Node means changing it for DOMNode.
Unless we no longer alias the classes but make them proper classes
instead. This means we can fix the typing for DOM\Node while keeping
DOMNode untouched, preserving BC. The downside is that it becomes more
difficult for interoperability. One of the reasons the HTML 5 RFC
introduced aliases instead of proper classes is so that code taking a
DOMNode as an argument could also be passed a DOM\Node. However, if we
make it a proper class instead, such code has to either transition
fully to the new DOM classes or use a type union, e.g.
DOMNode|DOM\Node.
In my opinion, having them become proper classes instead of aliases has
my preference: either we fix everything in one go now while we have the
opportunity, or never.

Let me know what you think, especially regarding the type issues.

Kind regards
Niels

[1] https://github.com/php/php-src/issues/8388
[2] https://wiki.php.net/rfc/domdocument_html5_parser
[3] https://externals.io/message/104687
[4] https://www.mediawiki.org/wiki/Parsoid/PHP/Help_wanted
[5] https://github.com/php/php-src/pull/13031
[6] https://github.com/nielsdos/wpt/tree/master/dom/php-out (yes, this
is a dirty port)

I am also on team "yes, let's just do it right." If that means the new classes are only 99% drop ins for the old ones, I'm OK with that. People can switch over when they're ready and do all the clean up at once.

I'm not sure about making things final. I don't know the domain space well enough to have a strong opinion at the moment, but my main concern would be ensuring that it's still extensible in reasonable ways. Eg, if I wanted to add a Web Component element to a page, I want to do that without fugly workarounds. I don't have a strong opinion at this point on what the right way to do that is.

--Larry Garfield

1 year ago by Niels Dossche — view source

unread

Hi Larry

I am also on team "yes, let's just do it right." If that means the new classes are only 99% drop ins for the old ones, I'm OK with that. People can switch over when they're ready and do all the clean up at once.

They are indeed going to be very similar, but at least having better return types would be good to give one particular example.
e.g. we currently have a lot of methods that can return an object or false. The current living DOM spec always throws exceptions instead of returning false on error which is a much cleaner API.
Furthermore, we have the DOMNameSpaceNode that can be returned by some methods and has been a point of confusion for static analysis tools (I did a PR on psalm to fix one of those issues).
That node type won't be special cased in the new classes API so the (inconsistent use of the) union of DOMAttr|DOMNameSpaceNode will go away.

I'm not sure about making things final. I don't know the domain space well enough to have a strong opinion at the moment, but my main concern would be ensuring that it's still extensible in reasonable ways. Eg, if I wanted to add a Web Component element to a page, I want to do that without fugly workarounds. I don't have a strong opinion at this point on what the right way to do that is.

Yeah indeed.

--Larry Garfield

Kind regards
Niels

1 year ago by Robert Landers — view source

unread

Hi Niels,

They are indeed going to be very similar, but at least having better return types would be good to give one particular example.
e.g. we currently have a lot of methods that can return an object or false. The current living DOM spec always throws exceptions instead of returning false on error which is a much cleaner API.
Furthermore, we have the DOMNameSpaceNode that can be returned by some methods and has been a point of confusion for static analysis tools (I did a PR on psalm to fix one of those issues).
That node type won't be special cased in the new classes API so the (inconsistent use of the) union of DOMAttr|DOMNameSpaceNode will go away.

Actually, I'm not sure it is supposed to be throwing exceptions (if we
look at https://html.spec.whatwg.org/multipage/parsing.html#parse-errors);
in fact, I'd argue there are three different ways to handle errors
(from some experience in writing a parser from scratch):

Acting as a user-agent: in this case, errors should be handled as
described in the spec for a user-agent, e.g., switching to Text-Mode
in some cases and gobbling up the rest of the document.
Acting as a conformance checker: in this case, a list of errors
should be available to the programmer instead of bailing when parsing
(e.g., not switching to Text-Mode, but trying to continue parsing the
document, as described in the parser spec for conformance checking).
Acting as a document builder: Putting the document into an invalid
state should emit at least a warning. However, it's likely better to
let the user-agent handle the invalid DOM (as this is probably more
forward-thinking for new HTML that currently doesn't exist). This is
actually one of the biggest draw-backs to the current implementation
as it requires a number of "hacks" to build valid HTML.

1 year ago by Niels Dossche — view source

unread

Hi Robert

Hi Niels,

They are indeed going to be very similar, but at least having better return types would be good to give one particular example.
e.g. we currently have a lot of methods that can return an object or false. The current living DOM spec always throws exceptions instead of returning false on error which is a much cleaner API.
Furthermore, we have the DOMNameSpaceNode that can be returned by some methods and has been a point of confusion for static analysis tools (I did a PR on psalm to fix one of those issues).
That node type won't be special cased in the new classes API so the (inconsistent use of the) union of DOMAttr|DOMNameSpaceNode will go away.

Actually, I'm not sure it is supposed to be throwing exceptions (if we
look at https://html.spec.whatwg.org/multipage/parsing.html#parse-errors);
in fact, I'd argue there are three different ways to handle errors
(from some experience in writing a parser from scratch):

I'm not talking about handling parser errors.
Parser errors indeed should not be handled via exceptions, they emit a warning and continue with error recovery as described in spec.
This was part of my HTML 5 RFC: https://wiki.php.net/rfc/domdocument_html5_parser

I'm talking about methods like createElement, setAttributeNode, ... that can fail due to errors.
In DOM 3 (and therefore PHP too), there was a "strictErrorChecking" boolean option.
When enabled, exceptions were thrown when constraints were not met of such methods.
When disabled, no exception is thrown but a warning is emit and false is returned instead.
The DOM living spec no longer has that option and always uses exceptions.

In the new classes I would also only use exceptions and not include the strictErrorChecking option, as spec demands.
This cleans up return types.

For example: $doc->createElement("") should throw.
Or $element->setAttributeNode($attr) should throw when $attr is already used by another element.
Etc.

Acting as a user-agent: in this case, errors should be handled as
described in the spec for a user-agent, e.g., switching to Text-Mode
in some cases and gobbling up the rest of the document.

The HTML 5 RFC follows the spec error recovery rules for user agents.

Acting as a conformance checker: in this case, a list of errors
should be available to the programmer instead of bailing when parsing
(e.g., not switching to Text-Mode, but trying to continue parsing the
document, as described in the parser spec for conformance checking).

Acting as a document builder: Putting the document into an invalid
state should emit at least a warning. However, it's likely better to
let the user-agent handle the invalid DOM (as this is probably more
forward-thinking for new HTML that currently doesn't exist). This is
actually one of the biggest draw-backs to the current implementation
as it requires a number of "hacks" to build valid HTML.

Kind regards
Niels

1 year ago by Sebastian Bergmann — view source

unread

Am 29.12.2023 um 17:58 schrieb Larry Garfield:

I am also on team "yes, let's just do it right." If that means the new classes are only 99% drop ins for the old ones, I'm OK with that. People can switch over when they're ready and do all the clean up at once.