[RFC] [Discussion] DOM HTML5 parsing and serialization support

1 year ago by naitsirch@e.mail.de — view source

unread

Am 02-Sep-2023 21:41:50 +0200 schrieb dossche.niels@gmail.com:

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php

Hi Niels,

thank you for your proposal and your work on this.

This proposal introduces the DOM\HTML5Document class that extends the
DOMDocument class. The reason we introduce a new class instead of replacing
the methods of the existing class is to ensure full backwards compatibility.

Although I do not dislike your idea with a new class for HTML5 parsing I have one question. Why not make the decision, which parser to use, dependent from the doctype declaration at the start of the html document?

Best regards
Christian

1 year ago by Niels Dossche — view source

unread

Hey Christian

Thank you for going through my proposal.

Am 02-Sep-2023 21:41:50 +0200 schrieb dossche.niels@gmail.com:

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php

Hi Niels,

thank you for your proposal and your work on this.

This proposal introduces the DOM\HTML5Document class that extends the
DOMDocument class. The reason we introduce a new class instead of replacing
the methods of the existing class is to ensure full backwards compatibility.

Although I do not dislike your idea with a new class for HTML5 parsing I have one question. Why not make the decision, which parser to use, dependent from the doctype declaration at the start of the html document?
There's three major reasons to not depend solely on the doctype:

Not all documents have a doctype. So in that case, do we have to pick the old parser or the HTML5 parser?
People create DOM documents from scratch, without loading some code or file first.
In that case, we can't choose upfront if the document is an HTML5 document or a legacy document.
I believe that by using a class, and thus having a clear choice upfront, this minimizes surprises.
Having a separate class allows us to use the type system to restrict some users to only allow HTML5 documents.

Best regards
Christian

Kind regards
Niels

1 year ago by Dennis Snell via internals — view source

unread

Thanks for the proposal Niels,

I’ve dealt with my own grief working through issues in DOMDocument and wanting it to work but finding it inadequate.

HTML5

This would be a great starting point; I would love it if we took the opportunity to fix named character reference decoding, as PHP has (to my knowledge) never respected (at least in HTML5) that they decode differently inside attributes as they do inside markup, considering rules such as the ambiguous ampersand and decode errors.

It’s also been frustrating that DOMDocument parses tags in RCDATA sections where they don’t exist, such as in TITLE or TEXTAREA elements, escapes certain types of invalid comments so that they appear rendered in the saved document, and misses basic semantic rules (e.g. creating a BUTTON element as a child of a BUTTON element instead of closing out the already-open BUTTON).

I’d like to share some what a few of us have been working on inside WordPress, which is to build a conformant streaming HTML5 parser:

It’s just food for thought right now because adding HTML5 support to DOMDocument would benefit everyone, but we decided we had common need in PHP to work with HTML not in a DOM, but in a streaming fashion, one with very little runtime overhead. My long-term plan has been to get a good grasp for the interface needs and thoroughly test it within the WordPress community and then propose its inclusion into PHP. It’s been incredibly handy so far, and on my laptop runs at around 20 MB/s, which is not great, but good enough for many needs. My naive C port runs on the same laptop at around 80 MB/s and I believe that we can likely triple or quadruple that speed again if any of us working on it knew how to take advantage of SIMD instrinsics.

It tries to accomplish a few goals:

be fast enough
interpret HTML as an HTML5-compliant browser will
find specific locations within an HTML document and then read or modify them
pass through any invalid HTML it encounters for the browser to resolve/fix unless modifying the part of the document containing those invalid constructions

I only bring up this different interface because once we started digging deep into DOMDocument we found that the problems with it were far from superficial; that there is a host of problems and a mismatched interface to our common needs. It has surprised me that PHP, the language of the web, has had such trouble handling HTML, the language of the web, and we wanted to completely resolve this issue once and for all within WordPress so we can clean up decades’ old problems with encoding, decoding, security, and sanitization.

Warmly,
Dennis Snell

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

1 year ago by Niels Dossche — view source

unread

Hi Dennis

Thanks for the proposal Niels,

I’ve dealt with my own grief working through issues in DOMDocument and wanting it to work but finding it inadequate.

HTML5

This would be a great starting point; I would love it if we took the opportunity to fix named character reference decoding, as PHP has (to my knowledge) never respected (at least in HTML5) that they decode differently inside attributes as they do inside markup, considering rules such as the ambiguous ampersand and decode errors.

It’s also been frustrating that DOMDocument parses tags in RCDATA sections where they don’t exist, such as in TITLE or TEXTAREA elements, escapes certain types of invalid comments so that they appear rendered in the saved document, and misses basic semantic rules (e.g. creating a BUTTON element as a child of a BUTTON element instead of closing out the already-open BUTTON).

With this proposal: a real HTML5 parser, these above mentioned problems will fortunately be a problem from the past :)

I’d like to share some what a few of us have been working on inside WordPress, which is to build a conformant streaming HTML5 parser:
- https://developer.wordpress.org/reference/classes/wp_html_tag_processor/ https://developer.wordpress.org/reference/classes/wp_html_tag_processor/
- https://make.wordpress.org/core/2023/08/19/progress-report-html-api/ https://make.wordpress.org/core/2023/08/19/progress-report-html-api/

It’s just food for thought right now because adding HTML5 support to DOMDocument would benefit everyone, but we decided we had common need in PHP to work with HTML not in a DOM, but in a streaming fashion, one with very little runtime overhead. My long-term plan has been to get a good grasp for the interface needs and thoroughly test it within the WordPress community and then propose its inclusion into PHP. It’s been incredibly handy so far, and on my laptop runs at around 20 MB/s, which is not great, but good enough for many needs. My naive C port runs on the same laptop at around 80 MB/s and I believe that we can likely triple or quadruple that speed again if any of us working on it knew how to take advantage of SIMD instrinsics.

It tries to accomplish a few goals:
- be fast enough
- interpret HTML as an HTML5-compliant browser will
- find specific locations within an HTML document and then read or modify them
- pass through any invalid HTML it encounters for the browser to resolve/fix unless modifying the part of the document containing those invalid constructions

I've seen someone link this on Reddit today, it's a really nice project!
It reminds me of Cloudflare's lol-html, which is also a streaming parser used to modify and sanitize documents linearly.
I believe this could be a great addition, it solves a different problem that the ext/dom extension solves. So I think it would be a great complementary addition.

I only bring up this different interface because once we started digging deep into DOMDocument we found that the problems with it were far from superficial; that there is a host of problems and a mismatched interface to our common needs. It has surprised me that PHP, the language of the web, has had such trouble handling HTML, the language of the web, and we wanted to completely resolve this issue once and for all within WordPress so we can clean up decades’ old problems with encoding, decoding, security, and sanitization.

Yes, I was also quite surprised of the lacking support for modern web features, and also the problems with spec compliance.
I only recently got into maintaining ext/dom. So there's still a lot of work to do.
I had already started with adding more DOM APIs in the 8.3 release cycle and plan to continue that effort in 8.4.
Another major project I want to do for 8.4, besides HTML5 support, is fixing the spec compliance issues in an opt-in manner. This would help with security & sanitization problems (HTML5 should help with the encoding&decoding).

Warmly,
Dennis Snell

Kind regards
Niels

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

1 year ago by Ben Ramsey — view source

unread

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Thanks, Niels. This is much needed.

This proposal introduces the DOM\HTML5Document class that extends the DOMDocument class.

In light of the recent discussion regarding casing of acronyms in class
and method names 1, perhaps this proposal should consider
Dom\Html5Document instead? I know this isn't consistent with the other
DOMDocument classes, so maybe remaining consistent is better in this case.

I propose to use Lexbor.

Lexbor is licensed under the Apache-2.0 license 2. In general, I don't
think this will present any problems, as long as we adhere to the terms
of the license (i.e., include the NOTICE file, etc.), but I did want to
call attention to it in case bundling the source code of an Apache-2.0
project within a PHP-3.01 project is a cause for concern for anyone.

Cheers,
Ben

1 year ago by Niels Dossche — view source

unread

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

Hi internals

I'd like to announce a change to the RFC. The new RFC version is 0.5.1, the old one was 0.4.0.
The diff can be viewed via the revision history button on the right.

I had a productive discussion with Tim and Arne about the class hierarchy.
Here's a summary of the changes and the rationale.

Until now, the RFC specified that DOM\HTML5Document extends DOMDocument.
However, as we're introducing a new class anyway, we believe we should take the opportunity to improve the API.
We have the following concerns:
a) It's a bit of an awkward class hierarchy. If we hypothetically would want to get rid of DOMDocument in the far far future, we can't easily do that.
b) API is messy. Some methods are useless for HTML5Document. E.g.: validate(), loadXML(), loadXMLFile(). They can be a source of confusion.
c) The fact that you can pass HTML5Document to methods accepting DOMDocument may result in unexpected behaviour when the method expects a particular behaviour. It would be better if developers could "opt-in" to accepting both DOMDocument and HTML5Document in a method using a common base class.
d) The properties set by DOMDocument's constructor are overridden by load methods, which is surprising. That's even mentioned as the second top comment on https://www.php.net/manual/en/domdocument.loadxml.php. Furthermore, the XML version argument of the constructor is even useless for HTML5 documents.

So we propose the following changes to the RFC.

We'll add a common abstract base class DOM\Document (name taken from the DOM spec & Javascript world).
DOM\Document contains the properties and abstract methods common to both HTML and XML documents.
Examples of what it includes/excludes:

includes: firstElementChild, lastElementChild, ...
excludes: xmlStandalone, xmlVersion, validate(), ...
Then we'll have two subclasses: DOM\HTMLDocument (previously we called this DOM\HTML5Document) and DOM\XMLDocument. We dropped the 5 from the name to be more resilient to version changes and match the DOM spec name.
DOMDocument will also use DOM\Document as a base class to make it interchangeable with the new classes.

The above would solve points a, b, and c.
To solve point d, we can use "factory methods":
This means HTMLDocument's constructor will be made private, and instead we'll have three static methods that create a new instance:

HTMLDocument::fromHTMLString(string $xml): HTMLDocument;
HTMLDocument::fromHTMLFile(string $filename): HTMLDocument;
HTMLDocument::fromEmptyDocument(string $encoding="UTF-8"): HTMLDocument;

Or to put it in PHP code:

namespace DOM {
	// The base abstract document class
	abstract class Document extends DOM\Node implements DOM\ParentNode {
		/* all properties and methods that are common and sensible for both XML & HTML documents */
	}
	
	class XMLDocument extends Document {
		/* insert specific XML methods and properties (e.g. xmlVersion, validate(), ...) here */

		private function __construct() {}
		
		public static function fromEmptyDocument(string $version = "1.0", string $encoding = "UTF-8");
		public static function fromFile(string $path);
		public static function fromString(string $source);
	}
	
	class HTMLDocument extends Document {
		/* insert specific Html methods and properties here */

		private function __construct() {}
		
		public static function fromEmptyDocument(string $encoding = "UTF-8");
		public static function fromFile(string $path);
		public static function fromString(string $source);
	}
}

class DOMDocument extends DOM\Document {
	/* Keep methods, properties, and constructor the same as they are now */
}

We're only adding XMLDocument for completeness and API parity. It's a drop-in replacement for DOMDocument, and behaves the exact same.
The difference is that the API is on par with HTMLDocument, and the construction is designed to be more misuse-resistant.
DOMDocument will NOT change, and remains for the foreseeable future.

We also have to change the $ownerDocument field in DOMNode to have type ?DOM\Document instead of ?DOMDocument.
Problem is that this breaks BC (but only a minor break): https://3v4l.org/El7Ve.
Overriding properties is kind of useless, but if someone does it, then the compiler will complain loudly during compilation and it should be easy to fix.

Of course, these changes means that the discussion period will run a bit longer than originally foreseen.

Kind regards
Niels

1 year ago by Larry Garfield — view source

unread

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

Hi internals

I'd like to announce a change to the RFC. The new RFC version is 0.5.1,
the old one was 0.4.0.
The diff can be viewed via the revision history button on the right.

I had a productive discussion with Tim and Arne about the class hierarchy.
Here's a summary of the changes and the rationale.

Until now, the RFC specified that DOM\HTML5Document extends DOMDocument.
However, as we're introducing a new class anyway, we believe we should
take the opportunity to improve the API.
We have the following concerns:
a) It's a bit of an awkward class hierarchy. If we hypothetically
would want to get rid of DOMDocument in the far far future, we can't
easily do that.
b) API is messy. Some methods are useless for HTML5Document. E.g.:
validate(), loadXML(), loadXMLFile(). They can be a source of confusion.
c) The fact that you can pass HTML5Document to methods accepting
DOMDocument may result in unexpected behaviour when the method expects
a particular behaviour. It would be better if developers could "opt-in"
to accepting both DOMDocument and HTML5Document in a method using a
common base class.
d) The properties set by DOMDocument's constructor are overridden by
load methods, which is surprising. That's even mentioned as the second
top comment on https://www.php.net/manual/en/domdocument.loadxml.php.
Furthermore, the XML version argument of the constructor is even
useless for HTML5 documents.

So we propose the following changes to the RFC.

We'll add a common abstract base class DOM\Document (name taken from
the DOM spec & Javascript world).
DOM\Document contains the properties and abstract methods common to
both HTML and XML documents.
Examples of what it includes/excludes:

includes: firstElementChild, lastElementChild, ...

excludes: xmlStandalone, xmlVersion, validate(), ...
Then we'll have two subclasses: DOM\HTMLDocument (previously we called
this DOM\HTML5Document) and DOM\XMLDocument. We dropped the 5 from the
name to be more resilient to version changes and match the DOM spec
name.
DOMDocument will also use DOM\Document as a base class to make it
interchangeable with the new classes.

The above would solve points a, b, and c.
To solve point d, we can use "factory methods":
This means HTMLDocument's constructor will be made private, and instead
we'll have three static methods that create a new instance:

HTMLDocument::fromHTMLString(string $xml): HTMLDocument;

That should be string $html, yes?

HTMLDocument::fromHTMLFile(string $filename): HTMLDocument;

HTMLDocument::fromEmptyDocument(string $encoding="UTF-8"):
HTMLDocument;

Or to put it in PHP code:
namespace DOM {
	// The base abstract document class
	abstract class Document extends DOM\Node implements DOM\ParentNode {
		/* all properties and methods that are common and sensible for both 
XML & HTML documents */
	}
  
	class XMLDocument extends Document {
		/* insert specific XML methods and properties (e.g. xmlVersion, 
validate(), ...) here */

		private function __construct() {}
  	
		public static function fromEmptyDocument(string $version = "1.0", 
string $encoding = "UTF-8");
		public static function fromFile(string $path);
		public static function fromString(string $source);
	}
  
	class HTMLDocument extends Document {
		/* insert specific Html methods and properties here */

		private function __construct() {}
  	
		public static function fromEmptyDocument(string $encoding = "UTF-8");
		public static function fromFile(string $path);
		public static function fromString(string $source);
	}
}

class DOMDocument extends DOM\Document {
	/* Keep methods, properties, and constructor the same as they are now */
}
We're only adding XMLDocument for completeness and API parity. It's a
drop-in replacement for DOMDocument, and behaves the exact same.
The difference is that the API is on par with HTMLDocument, and the
construction is designed to be more misuse-resistant.
DOMDocument will NOT change, and remains for the foreseeable future.

We also have to change the $ownerDocument field in DOMNode to have type
?DOM\Document instead of ?DOMDocument.
Problem is that this breaks BC (but only a minor break):
https://3v4l.org/El7Ve.
Overriding properties is kind of useless, but if someone does it, then
the compiler will complain loudly during compilation and it should be
easy to fix.

Of course, these changes means that the discussion period will run a
bit longer than originally foreseen.

Kind regards
Niels

This all makes sense to me, and I like this as a way forward. Nice work!

--Larry Garfield

1 year ago by Niels Dossche — view source

unread

Hi Larry

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

Hi internals

I'd like to announce a change to the RFC. The new RFC version is 0.5.1,
the old one was 0.4.0.
The diff can be viewed via the revision history button on the right.

I had a productive discussion with Tim and Arne about the class hierarchy.
Here's a summary of the changes and the rationale.

Until now, the RFC specified that DOM\HTML5Document extends DOMDocument.
However, as we're introducing a new class anyway, we believe we should
take the opportunity to improve the API.
We have the following concerns:
a) It's a bit of an awkward class hierarchy. If we hypothetically
would want to get rid of DOMDocument in the far far future, we can't
easily do that.
b) API is messy. Some methods are useless for HTML5Document. E.g.:
validate(), loadXML(), loadXMLFile(). They can be a source of confusion.
c) The fact that you can pass HTML5Document to methods accepting
DOMDocument may result in unexpected behaviour when the method expects
a particular behaviour. It would be better if developers could "opt-in"
to accepting both DOMDocument and HTML5Document in a method using a
common base class.
d) The properties set by DOMDocument's constructor are overridden by
load methods, which is surprising. That's even mentioned as the second
top comment on https://www.php.net/manual/en/domdocument.loadxml.php.
Furthermore, the XML version argument of the constructor is even
useless for HTML5 documents.

So we propose the following changes to the RFC.

We'll add a common abstract base class DOM\Document (name taken from
the DOM spec & Javascript world).
DOM\Document contains the properties and abstract methods common to
both HTML and XML documents.
Examples of what it includes/excludes:

includes: firstElementChild, lastElementChild, ...

excludes: xmlStandalone, xmlVersion, validate(), ...
Then we'll have two subclasses: DOM\HTMLDocument (previously we called
this DOM\HTML5Document) and DOM\XMLDocument. We dropped the 5 from the
name to be more resilient to version changes and match the DOM spec
name.
DOMDocument will also use DOM\Document as a base class to make it
interchangeable with the new classes.

The above would solve points a, b, and c.
To solve point d, we can use "factory methods":
This means HTMLDocument's constructor will be made private, and instead
we'll have three static methods that create a new instance:

HTMLDocument::fromHTMLString(string $xml): HTMLDocument;

That should be string $html, yes?

Indeed, silly copy paste mistake on my part.
In fact, the bullet point list is a bit outdated: it uses an older proposed method name (i.e. it says fromHTMLString instead of fromString; the HTML part is unnecessary duplication as the method is in the HTMLDocument class). I forgot to update that in my email draft when the final name was decided.
Anyway, the code snippet from my email and the RFC text both have the right method name and parameter name.

HTMLDocument::fromHTMLFile(string $filename): HTMLDocument;

HTMLDocument::fromEmptyDocument(string $encoding="UTF-8"):
HTMLDocument;

Or to put it in PHP code:
namespace DOM {
	// The base abstract document class
	abstract class Document extends DOM\Node implements DOM\ParentNode {
		/* all properties and methods that are common and sensible for both 
XML & HTML documents */
	}
	
	class XMLDocument extends Document {
		/* insert specific XML methods and properties (e.g. xmlVersion, 
validate(), ...) here */

		private function __construct() {}
		
		public static function fromEmptyDocument(string $version = "1.0", 
string $encoding = "UTF-8");
		public static function fromFile(string $path);
		public static function fromString(string $source);
	}
	
	class HTMLDocument extends Document {
		/* insert specific Html methods and properties here */

		private function __construct() {}
		
		public static function fromEmptyDocument(string $encoding = "UTF-8");
		public static function fromFile(string $path);
		public static function fromString(string $source);
	}
}

class DOMDocument extends DOM\Document {
	/* Keep methods, properties, and constructor the same as they are now */
}
We're only adding XMLDocument for completeness and API parity. It's a
drop-in replacement for DOMDocument, and behaves the exact same.
The difference is that the API is on par with HTMLDocument, and the
construction is designed to be more misuse-resistant.
DOMDocument will NOT change, and remains for the foreseeable future.

We also have to change the $ownerDocument field in DOMNode to have type
?DOM\Document instead of ?DOMDocument.
Problem is that this breaks BC (but only a minor break):
https://3v4l.org/El7Ve.
Overriding properties is kind of useless, but if someone does it, then
the compiler will complain loudly during compilation and it should be
easy to fix.

Of course, these changes means that the discussion period will run a
bit longer than originally foreseen.

Kind regards
Niels
This all makes sense to me, and I like this as a way forward. Nice work!

--Larry Garfield

Glad to hear that :)

Cheers
Niels

1 year ago by drealecs@gmail.com — view source

unread

We'll add a common abstract base class DOM\Document (name taken from the
DOM spec & Javascript world).
DOM\Document contains the properties and abstract methods common to both
HTML and XML documents.

Hi,

Yes looks a lot better.
Great work overall! And thank you for taking on this effort.

I would have a small suggestion: to make the abstract class an interface.
This will allow even more flexibility on how things can be build further,
suggesting composition over inheritance.
In user land we cannot have interfaces with properties (yet) but in php
internal interfaces we have example with interface UnitEnum that has name
property, extendes by BackedEnum that adds value property.

Thank you,
Alex

1 year ago by Niels Dossche — view source

unread

Hi Alexandru

We'll add a common abstract base class DOM\Document (name taken from the
DOM spec & Javascript world).
DOM\Document contains the properties and abstract methods common to both
HTML and XML documents.

Hi,

Yes looks a lot better.
Great work overall! And thank you for taking on this effort.

I would have a small suggestion: to make the abstract class an interface.
This will allow even more flexibility on how things can be build further,
suggesting composition over inheritance.
In user land we cannot have interfaces with properties (yet) but in php
internal interfaces we have example with interface UnitEnum that has name
property, extendes by BackedEnum that adds value property.

Right, we discussed the use of an interface internally too.
Indeed as you suggest, we chose an abstract class over an interface because of the property limitation.
Looking at UnitEnum & BackedEnum (https://github.com/php/php-src/blob/bae30682b896b26f177f83648bd58c77ba3480a8/Zend/zend_enum.stub.php) I don't see the properties defined on the interface. In practice, all enums get the properties of course, just not via the interface.
So as we cannot represent the properties on the interfaces (yet), we'll stick with the abstract class for now.

Thank you,
Alex

Kind regards
Niels

1 year ago by Stephen Reay — view source

unread

Hi Alexandru

We'll add a common abstract base class DOM\Document (name taken from the
DOM spec & Javascript world).
DOM\Document contains the properties and abstract methods common to both
HTML and XML documents.

Hi,

Yes looks a lot better.
Great work overall! And thank you for taking on this effort.

I would have a small suggestion: to make the abstract class an interface.
This will allow even more flexibility on how things can be build further,
suggesting composition over inheritance.
In user land we cannot have interfaces with properties (yet) but in php
internal interfaces we have example with interface UnitEnum that has name
property, extendes by BackedEnum that adds value property.

Right, we discussed the use of an interface internally too.
Indeed as you suggest, we chose an abstract class over an interface because of the property limitation.
Looking at UnitEnum & BackedEnum (https://github.com/php/php-src/blob/bae30682b896b26f177f83648bd58c77ba3480a8/Zend/zend_enum.stub.php) I don't see the properties defined on the interface. In practice, all enums get the properties of course, just not via the interface.
So as we cannot represent the properties on the interfaces (yet), we'll stick with the abstract class for now.

Thank you,
Alex

Kind regards
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php
Hi Niels,

Can you expand on the reasoning for two of the decisions in your proposal? I'm not sure I really see the reason/benefit :

fromX() methods are on the individual classes, rather than the parent, which as I understand it, you're using as a poor-mans interface with properties. I'd have thought that at the very least the parent should declare those methods as abstract
Why "fromEmptyDocument()" rather than just allowing new DOM\{XML,HTML}Document via a public constructor?

The only mention of 'constructor' I see in the email or the RFC is the line below:

the properties set by DOMDocument's constructor are overridden by its load methods, which is surprising

But I don't really see how making the constructor private and forcing use of a static method changes this aspect, at all - it just introduces a different way to create an instance.

Otherwise, it's great to see some activity on DOM handling classes.

Cheers

Stephen

1 year ago by Niels Dossche — view source

unread

Hi Stephen

Hi Alexandru

We'll add a common abstract base class DOM\Document (name taken from the
DOM spec & Javascript world).
DOM\Document contains the properties and abstract methods common to both
HTML and XML documents.

Hi,

Yes looks a lot better.
Great work overall! And thank you for taking on this effort.

I would have a small suggestion: to make the abstract class an interface.
This will allow even more flexibility on how things can be build further,
suggesting composition over inheritance.
In user land we cannot have interfaces with properties (yet) but in php
internal interfaces we have example with interface UnitEnum that has name
property, extendes by BackedEnum that adds value property.

Right, we discussed the use of an interface internally too.
Indeed as you suggest, we chose an abstract class over an interface because of the property limitation.
Looking at UnitEnum & BackedEnum (https://github.com/php/php-src/blob/bae30682b896b26f177f83648bd58c77ba3480a8/Zend/zend_enum.stub.php https://github.com/php/php-src/blob/bae30682b896b26f177f83648bd58c77ba3480a8/Zend/zend_enum.stub.php) I don't see the properties defined on the interface. In practice, all enums get the properties of course, just not via the interface.
So as we cannot represent the properties on the interfaces (yet), we'll stick with the abstract class for now.

Thank you,
Alex

Kind regards
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php https://www.php.net/unsub.php
Hi Niels,

Can you expand on the reasoning for two of the decisions in your proposal? I'm not sure I really see the reason/benefit :

fromX() methods are on the individual classes, rather than the parent, which as I understand it, you're using as a poor-mans interface with properties. I'd have thought that at the very least the parent should declare those methods as abstract

At least the fromEmptyDocument signature differs between HTMLDocument & XMLDocument, so it's not possible to declare that method on the parent.
It's possible to define the other two on the parent. However, I choose against that because:

They have different behaviour if called on XML vs HTML, because they return their respective new instance. At least the other methods have the same behaviour between XML & HTML. Furthermore it would be weird to have 2 out of 3 factory methods on the parent, but the other one not.
It makes some sense at least to call e.g. createComment() while not knowing the concrete type of the document (e.g. when supporting both XML & HTML documents), so having that method in the parent makes sense. For example:
function addCopyright(DOM\Document $doc) {
$doc->append($doc->createComment("(c) foo 2023"));
}
However, you're not going to call the factory methods on a variable that can be either XML or HTML document, as you wouldn't know a priori what you're getting back.

Why "fromEmptyDocument()" rather than just allowing new DOM\{XML,HTML}Document via a public constructor?

While technically possible, I find it confusing to have both a normal constructor and factory methods.
When people see factory methods, they'll know - from experience - that every construction needs to go via factories.
Otherwise, people are going to call the constructor and search for a loadHTML/loadHTMLFile method like they do with DOMDocument.

The only mention of 'constructor' I see in the email or the RFC is the line below:

the properties set by DOMDocument's constructor are overridden by its load methods, which is surprising

But I don't really see how making the constructor private and forcing use of a static method changes this aspect, at all - it just introduces a different way to create an instance.

The problem with DOMDocument's approach is the following:
$dom = new DOMDocument(encoding: "bla");
$dom->loadXML(...); // Oops encoding gone!

With the new classes, you load the document at construction time (or no document at all).
Therefore, this confusion/API misuse is prevented by the API design.
You can see it as having 3 different constructors that disallow loading another document on the same instance afterwards.

Otherwise, it's great to see some activity on DOM handling classes.

Cheers

Stephen

Cheers
Niels

1 year ago by Stephen Reay — view source

unread

On 19 Sep 2023, at 01:00, Niels Dossche

Cheers
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php

(Resending with history removed due to apparent size limit)

Hi Niels,

Obviously, a method with different signatures (fromEmptyDocument) can't be in the parent, and I'll discuss that below, but having the other factory methods that do have the same signature in parent (even if only an abstract method, forcing concrete implementations to define it) means that it's possible to safely do meta-programming with these classes, where you load a string/file, using a classname from a variable. This use-case doesn't necessarily need to know which specific instance it's getting back, so long as it's an instance of the base class (or interface, if it had been one), but that information can still be presented, via the return type (either static, or self on the parent and the actual class name on each of the implementations).

Regarding the private constructor: I understand the issue with the old class being confusing - but your new class doesn't have that issue, because there are no "loadXXX" methods: as you said, if you're loading an existing document, you're forced to do it via the static factory methods. With that change alone, there's zero risk of confusion in allowing direct constructor calls, because once the object is instantiated there is no subsequent way to load a document and inadvertently change the encoding.

Having a regular constructor and then one or more factory methods for specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s createFromXXX as well as a constructor), and IMO needing to call a factory method to get an "empty" document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, where one just isn't required.

Cheers

Stephen

1 year ago by tim@bastelstu.be — view source

unread

Hi

Regarding the private constructor: I understand the issue with the old class being confusing - but your new class doesn't have that issue, because there are no "loadXXX" methods: as you said, if you're loading an existing document, you're forced to do it via the static factory methods. With that change alone, there's zero risk of confusion in allowing direct constructor calls, because once the object is instantiated there is no subsequent way to load a document and inadvertently change the encoding.

Having a regular constructor and then one or more factory methods for specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s createFromXXX as well as a constructor), and IMO needing to call a factory method to get an "empty" document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, where one just isn't required.

I was one of the persons who discussed this updated API with Niels in
private and looking up the discussion it was me who proposed making the
constructor private and just providing named constructors.

From the perspective of the user of the API, I like the symmetry
between all the named constructors:

Whenever I want to create a new document, I use one of the fromXyz()
methods. And when I use those, I get exactly what it says on the tin.

Making the regular constructor available for use would effectively make
whatever that constructor does a special case / the default case. This
makes sense for DateTimeImmutable, because the __construct() variant is
likely much more often used than the various createFromXyz() variants.
For the HtmlDocument I find it less obvious that creating an empty
document would be the default case compared to loading an existing
document from a file.

We should probably rename the named constructors to include the "create"
prefix for consistency with DTI though.

Best regards
Tim Düsterhus

1 year ago by Stephen Reay — view source

unread

Hi

Regarding the private constructor: I understand the issue with the old class being confusing - but your new class doesn't have that issue, because there are no "loadXXX" methods: as you said, if you're loading an existing document, you're forced to do it via the static factory methods. With that change alone, there's zero risk of confusion in allowing direct constructor calls, because once the object is instantiated there is no subsequent way to load a document and inadvertently change the encoding.
Having a regular constructor and then one or more factory methods for specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s createFromXXX as well as a constructor), and IMO needing to call a factory method to get an "empty" document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, where one just isn't required.

I was one of the persons who discussed this updated API with Niels in private and looking up the discussion it was me who proposed making the constructor private and just providing named constructors.

From the perspective of the user of the API, I like the symmetry between all the named constructors:

Whenever I want to create a new document, I use one of the fromXyz() methods. And when I use those, I get exactly what it says on the tin.

Making the regular constructor available for use would effectively make whatever that constructor does a special case / the default case. This makes sense for DateTimeImmutable, because the __construct() variant is likely much more often used than the various createFromXyz() variants. For the HtmlDocument I find it less obvious that creating an empty document would be the default case compared to loading an existing document from a file.

We should probably rename the named constructors to include the "create" prefix for consistency with DTI though.

Best regards
Tim Düsterhus

Hi Tim,

Thanks for providing context on where this idea came from.

The constructor + specialised factory methods pattern is not exactly new in PHP (e.g. it took about 3 minutes to find multiple Symphony classes doing this), and disabling the public constructor for purely cosmetic reasons sounds like a very weird, and ironic choice to me, when the stated goal is to make the API less surprising than the previous one.

Cheers

Stephen

1 year ago by Niels Dossche — view source

unread

Hi Stephen

Hi

Regarding the private constructor: I understand the issue with the old class being confusing - but your new class doesn't have that issue, because there are no "loadXXX" methods: as you said, if you're loading an existing document, you're forced to do it via the static factory methods. With that change alone, there's zero risk of confusion in allowing direct constructor calls, because once the object is instantiated there is no subsequent way to load a document and inadvertently change the encoding.
Having a regular constructor and then one or more factory methods for specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s createFromXXX as well as a constructor), and IMO needing to call a factory method to get an "empty" document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, where one just isn't required.

I was one of the persons who discussed this updated API with Niels in private and looking up the discussion it was me who proposed making the constructor private and just providing named constructors.

From the perspective of the user of the API, I like the symmetry between all the named constructors:

Whenever I want to create a new document, I use one of the fromXyz() methods. And when I use those, I get exactly what it says on the tin.

Making the regular constructor available for use would effectively make whatever that constructor does a special case / the default case. This makes sense for DateTimeImmutable, because the __construct() variant is likely much more often used than the various createFromXyz() variants. For the HtmlDocument I find it less obvious that creating an empty document would be the default case compared to loading an existing document from a file.

We should probably rename the named constructors to include the "create" prefix for consistency with DTI though.

Best regards
Tim Düsterhus

Hi Tim,

Thanks for providing context on where this idea came from.

The constructor + specialised factory methods pattern is not exactly new in PHP (e.g. it took about 3 minutes to find multiple Symphony classes doing this), and disabling the public constructor for purely cosmetic reasons sounds like a very weird, and ironic choice to me, when the stated goal is to make the API less surprising than the previous one.

Besides the points that have been mentioned by Tim and Larry, there's also the expectation of the programmer that migrates to the new classes to take into account.
They're used to calling the constructor on the old class and calling a load method afterwards.
As there is still a constructor, but no load method, this might be confusing for them.
So in my opinion, disabling it makes it less surprising than the previous one.
Also, just because Symfony does this doesn't mean it's automatically something we should follow.

Cheers

Stephen

Kind regards
Niels

1 year ago by Stephen Reay — view source

unread

Hi Stephen

Hi

Regarding the private constructor: I understand the issue with the old class being confusing - but your new class doesn't have that issue, because there are no "loadXXX" methods: as you said, if you're loading an existing document, you're forced to do it via the static factory methods. With that change alone, there's zero risk of confusion in allowing direct constructor calls, because once the object is instantiated there is no subsequent way to load a document and inadvertently change the encoding.
Having a regular constructor and then one or more factory methods for specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s createFromXXX as well as a constructor), and IMO needing to call a factory method to get an "empty" document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, where one just isn't required.

I was one of the persons who discussed this updated API with Niels in private and looking up the discussion it was me who proposed making the constructor private and just providing named constructors.

From the perspective of the user of the API, I like the symmetry between all the named constructors:

Whenever I want to create a new document, I use one of the fromXyz() methods. And when I use those, I get exactly what it says on the tin.

Making the regular constructor available for use would effectively make whatever that constructor does a special case / the default case. This makes sense for DateTimeImmutable, because the __construct() variant is likely much more often used than the various createFromXyz() variants. For the HtmlDocument I find it less obvious that creating an empty document would be the default case compared to loading an existing document from a file.

We should probably rename the named constructors to include the "create" prefix for consistency with DTI though.

Best regards
Tim Düsterhus

Hi Tim,

Thanks for providing context on where this idea came from.

The constructor + specialised factory methods pattern is not exactly new in PHP (e.g. it took about 3 minutes to find multiple Symphony classes doing this), and disabling the public constructor for purely cosmetic reasons sounds like a very weird, and ironic choice to me, when the stated goal is to make the API less surprising than the previous one.

Besides the points that have been mentioned by Tim and Larry, there's also the expectation of the programmer that migrates to the new classes to take into account.
They're used to calling the constructor on the old class and calling a load method afterwards.
As there is still a constructor, but no load method, this might be confusing for them.
So in my opinion, disabling it makes it less surprising than the previous one.
Also, just because Symfony does this doesn't mean it's automatically something we should follow.

Cheers

Stephen

Kind regards
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php

Hi Niels, Larry

To clarify my earlier message about having the factory methods on the faux-interface - when I said meta programming that was probably the wrong term to use, so I apologise for the confusion.
It's true that the developer needs to know the expected type of the string being parsed. It's not true that the code directly invoking the 'fromString' method necessarily needs to know - passing classnames as a string and calling static methods on the string is also an established concept in PHP, which works really well when the method being called is defined on a single top-level class or interface.
A real world example I had in mind, is a Request (e.g. a curl request, or some other http-fetching) class that can decode responses into native PHP objects, with options for the callee to define the object type that's wanted; with just the two provided classes there's already double the checks required to know if the specified class name is one the library knows how to use. If a hypothetical class to handle SVG, RSS, ODF (or any other specialised use of XML) class uses the faux-interface from this RFC as it's base, there's no way for the Request class to know that it can instantiate an instance from a string, without falling back to method_exists() checks.

The only reason I mentioned the DateTime or Symphony classes at all, is to highlight that people have been using classes with a regular constructor, and one or more static factory methods, for years, and if there has been any confusion about it thus far, it's not something I've ever heard complaints about, nor seen efforts to address.

Calling arbitrarily named static methods "named constructors" doesn't really make a convincing argument when someone suddenly realises that the names should be something else halfway through the same conversation. That doesn't mean there's something wrong specifically with having static factory methods when they're needed, but that's my whole point: the default way to instantiate an object in PHP is through its constructor, and classes that don't provide a public constructor are much less common, usually with an obvious reason why (e.g. they're a singleton, or they're objects migrated from resource handles).

I've had enough of these conversations to realise that you've made a decision and you're sticking to it, regardless of what I might write (well except for changing the completely arbitrary names, I guess), so I won't continue trying to make my point to you.

Please at least do one thing though: make it clear in the RFC that the public constructor is removed because ... well I don't know, pick one of the reasons you've thrown up I guess. But right now the only reason even vaguely mentioned in the RFC is that it prevents the "changed encoding" issue, but it doesn't do that, the lack of a load method achieves that, and IMO the lack of a regular constructor is a significant enough change from most PHP classes, that people should be aware of that when reading and voting on your proposal.

Cheers

Stephen

1 year ago by Niels Dossche — view source

unread

Hi Stephen

Hi Stephen

Hi

Regarding the private constructor: I understand the issue with the old class being confusing - but your new class doesn't have that issue, because there are no "loadXXX" methods: as you said, if you're loading an existing document, you're forced to do it via the static factory methods. With that change alone, there's zero risk of confusion in allowing direct constructor calls, because once the object is instantiated there is no subsequent way to load a document and inadvertently change the encoding.
Having a regular constructor and then one or more factory methods for specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s createFromXXX as well as a constructor), and IMO needing to call a factory method to get an "empty" document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, where one just isn't required.

I was one of the persons who discussed this updated API with Niels in private and looking up the discussion it was me who proposed making the constructor private and just providing named constructors.

From the perspective of the user of the API, I like the symmetry between all the named constructors:

Whenever I want to create a new document, I use one of the fromXyz() methods. And when I use those, I get exactly what it says on the tin.

Making the regular constructor available for use would effectively make whatever that constructor does a special case / the default case. This makes sense for DateTimeImmutable, because the __construct() variant is likely much more often used than the various createFromXyz() variants. For the HtmlDocument I find it less obvious that creating an empty document would be the default case compared to loading an existing document from a file.

We should probably rename the named constructors to include the "create" prefix for consistency with DTI though.

Best regards
Tim Düsterhus

Hi Tim,

Thanks for providing context on where this idea came from.

The constructor + specialised factory methods pattern is not exactly new in PHP (e.g. it took about 3 minutes to find multiple Symphony classes doing this), and disabling the public constructor for purely cosmetic reasons sounds like a very weird, and ironic choice to me, when the stated goal is to make the API less surprising than the previous one.

Besides the points that have been mentioned by Tim and Larry, there's also the expectation of the programmer that migrates to the new classes to take into account.
They're used to calling the constructor on the old class and calling a load method afterwards.
As there is still a constructor, but no load method, this might be confusing for them.
So in my opinion, disabling it makes it less surprising than the previous one.
Also, just because Symfony does this doesn't mean it's automatically something we should follow.

Cheers

Stephen

Kind regards
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php https://www.php.net/unsub.php

Hi Niels, Larry

To clarify my earlier message about having the factory methods on the faux-interface - when I said meta programming that was probably the wrong term to use, so I apologise for the confusion.
It's true that the developer needs to know the expected type of the string being parsed. It's not true that the code directly invoking the 'fromString' method necessarily needs to know - passing classnames as a string and calling static methods on the string is also an established concept in PHP, which works really well when the method being called is defined on a single top-level class or interface.
A real world example I had in mind, is a Request (e.g. a curl request, or some other http-fetching) class that can decode responses into native PHP objects, with options for the callee to define the object type that's wanted; with just the two provided classes there's already double the checks required to know if the specified class name is one the library knows how to use. If a hypothetical class to handle SVG, RSS, ODF (or any other specialised use of XML) class uses the faux-interface from this RFC as it's base, there's no way for the Request class to know that it can instantiate an instance from a string, without falling back to method_exists() checks.

You'd still have to know the signature, and what options it accepts and affect the parser to be able to use it in a meaningful way.
Anyway, you can still do this use-case without a method_exists check or something alike by using union types for example.
You'll likely still want to put some kind of limitations on, so limiting via type seems sensible.

The only reason I mentioned the DateTime or Symphony classes at all, is to highlight that people have been using classes with a regular constructor, and one or more static factory methods, for years, and if there has been any confusion about it thus far, it's not something I've ever heard complaints about, nor seen efforts to address.

Calling arbitrarily named static methods "named constructors" doesn't really make a convincing argument when someone suddenly realises that the names should be something else halfway through the same conversation. That doesn't mean there's something wrong specifically with having static factory methods when they're needed, but that's my whole point: the default way to instantiate an object in PHP is through its constructor, and classes that don't provide a public constructor are much less common, usually with an obvious reason why (e.g. they're a singleton, or they're objects migrated from resource handles).

I've had enough of these conversations to realise that you've made a decision and you're sticking to it, regardless of what I might write (well except for changing the completely arbitrary names, I guess), so I won't continue trying to make my point to you.

Please at least do one thing though: make it clear in the RFC that the public constructor is removed because ... well I don't know, pick one of the reasons you've thrown up I guess. But right now the only reason even vaguely mentioned in the RFC is that it prevents the "changed encoding" issue, but it doesn't do that, the lack of a load method achieves that, and IMO the lack of a regular constructor is a significant enough change from most PHP classes, that people should be aware of that when reading and voting on your proposal.

I will list the reasons more clearly in the RFC.

Cheers

Stephen

Kind regards
Niels

1 year ago by Larry Garfield — view source

unread

Hi

Regarding the private constructor: I understand the issue with the old class being confusing - but your new class doesn't have that issue, because there are no "loadXXX" methods: as you said, if you're loading an existing document, you're forced to do it via the static factory methods. With that change alone, there's zero risk of confusion in allowing direct constructor calls, because once the object is instantiated there is no subsequent way to load a document and inadvertently change the encoding.

Having a regular constructor and then one or more factory methods for specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s createFromXXX as well as a constructor), and IMO needing to call a factory method to get an "empty" document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, where one just isn't required.

I was one of the persons who discussed this updated API with Niels in
private and looking up the discussion it was me who proposed making the
constructor private and just providing named constructors.

From the perspective of the user of the API, I like the symmetry
between all the named constructors:

Whenever I want to create a new document, I use one of the fromXyz()
methods. And when I use those, I get exactly what it says on the tin.

Making the regular constructor available for use would effectively make
whatever that constructor does a special case / the default case. This
makes sense for DateTimeImmutable, because the __construct() variant is
likely much more often used than the various createFromXyz() variants.
For the HtmlDocument I find it less obvious that creating an empty
document would be the default case compared to loading an existing
document from a file.

We should probably rename the named constructors to include the "create"
prefix for consistency with DTI though.

Best regards
Tim Düsterhus

I agree with Tim here. If you have named constructors that are all "equally common," then it's non-obvious which one should be the "unnamed default." I know that fromX() requires an X, fromY() requires a Y, but what does __construct() require? I cannot tell from the call site. If __construct() is just an alias of on of the fromX() methods, then what purpose does it serve other than confusion?

In addition, although it's probably not super relevant here, static factory methods are chainable in a way that new is not. No extra parens are necessary. For classes I'm creating at hoc (rather than services that are always managed via DI), I often prefer a named constructor just for convenience.

I don't see it as a Java-ism at all. Java is (in)famous for excessive layers of factory classes, which is something different than what is going on here. (How fair that reputation is in 2023, I don't know.) This is just "named constructors," which has been a reasonably common pattern in PHP land for a long time.

--Larry Garfield

1 year ago by Niels Dossche — view source

unread

Hi Stephen

On 19 Sep 2023, at 01:00, Niels Dossche

Cheers
Niels

--

To unsubscribe, visit: https://www.php.net/unsub.php

(Resending with history removed due to apparent size limit)

Hi Niels,

Obviously, a method with different signatures (fromEmptyDocument) can't be in the parent, and I'll discuss that below, but having the other factory methods that do have the same signature in parent (even if only an abstract method, forcing concrete implementations to define it) means that it's possible to safely do meta-programming with these classes, where you load a string/file, using a classname from a variable. This use-case doesn't necessarily need to know which specific instance it's getting back, so long as it's an instance of the base class (or interface, if it had been one), but that information can still be presented, via the return type (either static, or self on the parent and the actual class name on each of the implementations).

I don't really understand this part. What kind of meta-programming are you thinking of? Can you give an example?
Calling the create methods on the abstract type does not seem sensible, because the input data is not interchangeable between the two subclasses.
E.g. passing an XML input into HTMLDocument will not go well (and vice versa).
So it seems to me that you need to know the type to be able to give sensible input.

Regarding the private constructor: I understand the issue with the old class being confusing - but your new class doesn't have that issue, because there are no "loadXXX" methods: as you said, if you're loading an existing document, you're forced to do it via the static factory methods. With that change alone, there's zero risk of confusion in allowing direct constructor calls, because once the object is instantiated there is no subsequent way to load a document and inadvertently change the encoding.

Having a regular constructor and then one or more factory methods for specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s createFromXXX as well as a constructor), and IMO needing to call a factory method to get an "empty" document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, where one just isn't required.

I agree with what Tim and Larry have said here.
I do agree too that it should be called createFromString and createFromFile to be consistent with the DateTime classes.
And perhaps instead of createFromEmpty it should be createEmpty because you're not creating an instance "from" something here, you're creating it with "nothing".

Cheers

Stephen

Kind regards
Niels

1 year ago by tim@bastelstu.be — view source

unread

Hi

[…]

Thank you for the fruitful discussion. This updated API is much better.

When reading the updated RFC yesterday, I initially wanted to complain
about the "The options argument" section still existing, but the
$options argument nowhere to be seen. But I see that you now added it.

One thing I now note that could also be improved, now that we make some
larger improvements to the API:

The $options argument still takes an integer and expects the developer
to use the bitwise-or operator to combine multiple constants to set the
appropriate flags. Despite this being a commonly used pattern in the
standard library, I think it provides for a bad developer experience and
makes it harder to extend the list of options for non-boolean options. I
would never design an API like this when creating a userland library.

I have two suggestions that might or might not be better:

Make the options explicit arguments with a default value and expect
the user to leverage named arguments to override specific arguments, e.g.

HtmlDocument::fromString(
string $string,
bool $ignoreErrors = false, // LIBXML_NOERROR
bool $reducedMemoryUsage = false, // LIBXML_COMPACT
bool $impliedTags = true, // LIBXML_HTML_NOIMPLIED
);

This would provide for improved discoverability and the options checking
would be implicit and easily available in IDEs.

Make the option parameter an array that takes an array of flags. This
could either be an associative array (like used with unserialize,
setcookie or password_hash) or a list of self-descriptive options. The
latter would allow for a clean(er) migration to tagged-unions if/when
they make it into PHP.

Example usage:

 HtmlDocument::fromString(
   '<html></html>',
   [
     LIBXML_NOERROR,
     LIBXML_COMPACT,
     LIBXML_HTML_NOIMPLIED,
   ],
 );

With tagged-unions it could look like this in a future version, note the
HtmlVersion option that takes a parameter.

 HtmlDocument::fromString(
   '<html></html>',
   [
     Dom\LoadingOption::IgnoreError,
     Dom\LoadingOption::ReducedMemoryUsage,
     Dom\LoadingOption::NoImpliedTags,
     Dom\LoadingOption::HtmlVersion(5),
   ],
 );

Even if not yet reaping the full benefits of the "list of options", I
think it more naturally splits across multiple lines compared to
'bitwise-or'ing and it can also be more useful in stack traces for error
handlers that are able to handle arrays, because there won't be some
non-descript integer you have to decode manually.

What do you think?

Best regards
Tim Düsterhus

1 year ago by Niels Dossche — view source

unread

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

Hi internals

After the discussion here, I have made the following changes to the RFC.
The RFC's version is now 0.6.0, previously it was 0.5.3.

Clarified the reasoning to use factory methods.
Changed the factory method names to be consistent with DateTimeImmutable etc (i.e. fromString -> createFromString, fromFile -> createFromFile, createEmptyDocument -> createEmpty).
Marked HTMLDocument & XMLDocument classes as final. This allows us to extend them in the future without BC breakage. This is also motivated by the fact that extending them would be problematic anyway (see https://www.php.net/manual/en/domdocument.registernodeclass for how to do it with DOMDocument and the problems it has).
Clarified that we also need to change the $document property of DOMXPath from DOMDocument to DOM\Document, and the constructor too. You can read about this in the BC section.

Kind regards
Niels

1 year ago by Niels Dossche — view source

unread

Hi internals

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

Some minor changes after a discussion with Tim:

The old class names are now written with a leading backslash to emphasize they are in the global namespace. Hopefully this will resolve confusion around them.
Fixed unclear wording, i.e. "type hint" -> "type declaration"
The alias for DOMException will be DOM\DOMException. Because that's the official name per DOM spec, and otherwise importing it and using it would be confusing with the global namespace <php>Exception</php> class (see also https://github.com/php/php-src/pull/9071#issuecomment-1193162754 for precedent).

Kind regards
Niels

1 year ago by Derick Rethans — view source

unread

I'm opening the discussion for my RFC "DOM HTML5 parsing and
serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Some minor changes after a discussion with Tim:

The old class names are now written with a leading backslash to
emphasize they are in the global namespace. Hopefully this will resolve
confusion around them.

Fixed unclear wording, i.e. "type hint" -> "type declaration"

The alias for DOMException will be DOM\DOMException. Because that's the
official name per DOM spec, and otherwise importing it and using it
would be confusing with the global namespace <php>Exception</php> class
(see also
https://github.com/php/php-src/pull/9071#issuecomment-1193162754 for
precedent).

I have just read the RFC, and I find it both well written and compelling
to accept. +1 from me.

cheers,
Derick

--
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news

mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug

1 year ago by Niels Dossche — view source

unread

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

Hi internals

Discussion seems to have died down.
Today, it's been 14 days since the last major change was done to the RFC (i.e. the class hierarchy update).
And it's also been close to 4 weeks since I first announced the RFC it on the mailing list.
I'd like to start the vote on Monday (20:00 PM GMT+2) and I intend to let it run for 2 weeks.
Any final complaints should be raised now.

Kind regards
Niels

1 year ago by Marc Bennewitz — view source

unread

Hi Niels,

Hi internals

Discussion seems to have died down.
Today, it's been 14 days since the last major change was done to the RFC (i.e. the class hierarchy update).
And it's also been close to 4 weeks since I first announced the RFC it on the mailing list.
I'd like to start the vote on Monday (20:00 PM GMT+2) and I intend to let it run for 2 weeks.
Any final complaints should be raised now.

Not much to complain but a question - not sure if it was discussed before.

Naming: XMLDocument::fromEmpty vs. HTMLDocument::createEmpty in the
PHP code section.

For both, XMLDocument::fromEmpty and HTMLDocument::createEmpty there
is an argument available to define the encoding but none of the other
createFrom* methods have this argument.

As far as I understand, in the these other cases the encoding gets
detected from the content of the passed source but what happens is the
source does not contain any information about the encoding?. E.g. you
load an XML/HTML document over HTTP, the encoding is defined via HTTP
header but the content itself doesn't contain it.

Kind regards
Niels

Best,
Marc

1 year ago by Niels Dossche — view source

unread

Hi Marc

Hi Niels,

Hi internals

Discussion seems to have died down.
Today, it's been 14 days since the last major change was done to the RFC (i.e. the class hierarchy update).
And it's also been close to 4 weeks since I first announced the RFC it on the mailing list.
I'd like to start the vote on Monday (20:00 PM GMT+2) and I intend to let it run for 2 weeks.
Any final complaints should be raised now.

Not much to complain but a question - not sure if it was discussed before.

Naming: XMLDocument::fromEmpty vs. HTMLDocument::createEmpty in the PHP code section.

Oops. Well spotted! This should be createEmpty everywhere.
I just checked and only in that class definition I used fromEmpty accidentally.
I fixed this now in the RFC text.
This happened when updating the method names, the emails from back then do refer to the right name though.

For both, XMLDocument::fromEmpty and HTMLDocument::createEmpty there is an argument available to define the encoding but none of the other createFrom* methods have this argument.

As far as I understand, in the these other cases the encoding gets detected from the content of the passed source but what happens is the source does not contain any information about the encoding?. E.g. you load an XML/HTML document over HTTP, the encoding is defined via HTTP header but the content itself doesn't contain it.

Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities.
If one option fails, it will fall through to the next one.

The Content-Type HTTP header from which you loaded the document.
BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding.
Meta tag in the content.

If it could not be determined at all, UTF-8 will be assumed as it's the default in HTML.

Kind regards
Niels

Best,
Marc

Kind regards
Niels

1 year ago by tim@bastelstu.be — view source

unread

Hi

Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities.
If one option fails, it will fall through to the next one.

The Content-Type HTTP header from which you loaded the document.

How would the new document classes make use of that? The HTTP header is
transmitted out-of-band with regard to the actual payload.

Is this referring to passing a http:// path to
HTMLDocument::createFromFile()? This would be unusable for everyone who
manually downloads the document, e.g. using a PSR-18 HTTP Client.

It might actually be necessary to add an encoding parameter to these
functions, but it would need to take priority over anything implicit.
The current $encoding of the global \DOMDocument has the problem that it
doesn't take priority/is ignored entirely. Manually converting the
document to UTF-8 before passing it to \DOMDocument has the problem that
the meta tag in the document takes priority.

In fact I've run into this issue before for the implementation of a rich
embed feature. We're downloading the websites using Guzzle and attempt
to make sense of them with \DOMDocument. However we can't reliably force
the encoding given within the 'content-type' response header, so in some
cases we obtain mojibake.

This encoding parameter would likely need to be ?string $encoding = null with everything non-null overwriting implicit detection and null
meaning implicit detection in the order of priorities you mentioned.

BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding.

Meta tag in the content.

Best regards
Tim Düsterhus

1 year ago by Niels Dossche — view source

unread

Hi Tim

Hi

Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities.
If one option fails, it will fall through to the next one.

The Content-Type HTTP header from which you loaded the document.

How would the new document classes make use of that? The HTTP header is transmitted out-of-band with regard to the actual payload.

Is this referring to passing a http:// path to HTMLDocument::createFromFile()? This would be unusable for everyone who manually downloads the document, e.g. using a PSR-18 HTTP Client.

When the stream wrapper contains header information that information is used indeed.
That would unfortunately indeed mean it's unusable when manually passed in.

It might actually be necessary to add an encoding parameter to these functions, but it would need to take priority over anything implicit. The current $encoding of the global \DOMDocument has the problem that it doesn't take priority/is ignored entirely. Manually converting the document to UTF-8 before passing it to \DOMDocument has the problem that the meta tag in the document takes priority.

In fact I've run into this issue before for the implementation of a rich embed feature. We're downloading the websites using Guzzle and attempt to make sense of them with \DOMDocument. However we can't reliably force the encoding given within the 'content-type' response header, so in some cases we obtain mojibake.

This encoding parameter would likely need to be ?string $encoding = null with everything non-null overwriting implicit detection and null meaning implicit detection in the order of priorities you mentioned.

I agree.
I'll add the optional arguments ?string $override_encoding = null to XML/HTMLDocument::createFromString and XML/HTMLDocument::createFromFile.
I'd call it override_encoding to emphasize it's about overriding the behaviour.

BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding.

Meta tag in the content.

Best regards
Tim Düsterhus

Kinds regards
Niels

1 year ago by Dennis Snell via internals — view source

unread

For both, XMLDocument::fromEmpty and HTMLDocument::createEmpty there is an argument available to define the encoding but none of the other createFrom* methods have this argument.

As far as I understand, in the these other cases the encoding gets detected from the content of the passed source but what happens is the source does not contain any information about the encoding?. E.g. you load an XML/HTML document over HTTP, the encoding is defined via HTTP header but the content itself doesn't contain it.

Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities.
If one option fails, it will fall through to the next one.

The Content-Type HTTP header from which you loaded the document.

BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding.

Meta tag in the content.

If it could not be determined at all, UTF-8 will be assumed as it's the default in HTML.

It may sound meticulous, but I’ve tried to emphasize createFragment() in what’s being built in WordPress because almost everything being done on HTML within WordPress, and I think within many frameworks, is processing fragments (and usually short ones at that). Formerly I didn’t realize there was much of a difference, but text encoding is one of those differences. It’s my understanding that when parsing a fragment we have to assume an encoding, unless the fragment is starting at a spot in the document before that’s discovered, presumably only if we’ve constructed a Document with a still-unknown encoding.

So manually setting the encoding of a fragment constructor is not so much overriding as it is supplying, or at least, that’s one of two normative situations. If we create a fragment with a context node carrying an encoding already, then we need to ignore any meta tag that specifies otherwise; likewise if the context node doesn’t carry that encoding we do need to heed it.

I know there’s a huge difference in needs here between people writing scripts to scrape full HTML documents, but it’s not a small fraction of cases where people want to use DOMDocument without having the full HTML from start to finish. In the world I work in it’s usually either for parsing a small fragment to add some attributes or replace a wrapping tag, or for constructing HTML programmatically to avoid escaping issues and make nesting easy. In both of these cases the text encoding is implicit unless the function signature makes it explicit. At this stage in development, we only support some of the “in body” parsing and only support UTF-8, but I thought that it was important enough to add these as arguments to the creator function so that there’s an awareness that these values govern how the parse occurs.

Surely for createFromString() and createEmpty() we can make the assumption that no character encoding is set, but I also suspect that a possible majority of the times people use these functions they are likely calling them when createFragment() is more appropriate, that they aren’t supplying HTML documents with in-band text encoding information, and so there’s a chance that de-emphasizing the parameter may be technically more accurate and practically less helpful.

Love seeing all the continued work on this!
Thank you so much for your dedication to it.

Dennis Snell

1 year ago by Niels Dossche — view source

unread

Hi Dennis

For both, XMLDocument::fromEmpty and HTMLDocument::createEmpty there is an argument available to define the encoding but none of the other createFrom* methods have this argument.

As far as I understand, in the these other cases the encoding gets detected from the content of the passed source but what happens is the source does not contain any information about the encoding?. E.g. you load an XML/HTML document over HTTP, the encoding is defined via HTTP header but the content itself doesn't contain it.

Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities.
If one option fails, it will fall through to the next one.

The Content-Type HTTP header from which you loaded the document.

BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding.

Meta tag in the content.

If it could not be determined at all, UTF-8 will be assumed as it's the default in HTML.

It may sound meticulous, but I’ve tried to emphasize createFragment() in what’s being built in WordPress because almost everything being done on HTML within WordPress, and I think within many frameworks, is processing fragments (and usually short ones at that). Formerly I didn’t realize there was much of a difference, but text encoding is one of those differences. It’s my understanding that when parsing a fragment we have to assume an encoding, unless the fragment is starting at a spot in the document before that’s discovered, presumably only if we’ve constructed a Document with a still-unknown encoding.

Just chiming in here to say that while we don't offer a createFragment() in this proposal, it's possible to parse fragments by passing the LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML which you could use then in conjunction with createDocumentFragment().

So manually setting the encoding of a fragment constructor is not so much overriding as it is supplying, or at least, that’s one of two normative situations. If we create a fragment with a context node carrying an encoding already, then we need to ignore any meta tag that specifies otherwise; likewise if the context node doesn’t carry that encoding we do need to heed it.

Sure, I agree it's not overriding in that specific case. In other cases it can be.
There may not be an ideal naming that works for all cases.

I know there’s a huge difference in needs here between people writing scripts to scrape full HTML documents, but it’s not a small fraction of cases where people want to use DOMDocument without having the full HTML from start to finish. In the world I work in it’s usually either for parsing a small fragment to add some attributes or replace a wrapping tag, or for constructing HTML programmatically to avoid escaping issues and make nesting easy. In both of these cases the text encoding is implicit unless the function signature makes it explicit. At this stage in development, we only support some of the “in body” parsing and only support UTF-8, but I thought that it was important enough to add these as arguments to the creator function so that there’s an awareness that these values govern how the parse occurs.

Surely for createFromString() and createEmpty() we can make the assumption that no character encoding is set, but I also suspect that a possible majority of the times people use these functions they are likely calling them when createFragment() is more appropriate, that they aren’t supplying HTML documents with in-band text encoding information, and so there’s a chance that de-emphasizing the parameter may be technically more accurate and practically less helpful.

Thanks for the insight.
To be honest, I don't have hard feelings about the naming of the parameter.
In a way you could say that override_encoding is still accurate, because the fallback default is UTF-8, so you override the fallback in a sense. As the documentation also emphasizes that the DOM extension works internally with UTF-8, this may align with expectations of programmers, but I'm not sure.
I think we should really document the parameter very well in the docs.

Love seeing all the continued work on this!
Thank you so much for your dedication to it.

Dennis Snell

Kind regards
Niels

1 year ago by Dennis Snell via internals — view source

unread

Just chiming in here to say that while we don't offer a createFragment() in this proposal, it's possible to parse fragments by passing the LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML which you could use then in conjunction with createDocumentFragment().

It’s not my understanding that this is right here, because fragment parsing implies more than having or not having the HTML and BODY elements implicitly.

Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.

The HTML5 spec defines fragment parsing as starting within a context node which exists within a broader document. For example, many people will parse a string of HTML that should form the contents of an LI element. They are grabbing that HTML from a database somewhere, from user input. If that HTML contains “</li>” then our behavior diverges. In a fragment parser it would close out the list we started with but in full document parsing mode the end tag would be ignored, a parse error. If the goal is to ensure that user input doesn’t break out and change the page, then it’s important to use fragment parsing and grab the inner contents of that LI context node.

This can be valuable to have as a tool to guard against injection attacks or against accidentally breaking the page someone is building, because the fragment parser is aware of its environment. It becomes even more important when parsing within RCDATA or RAWTEXT sections. For example, if wanting to parse and analyze or manipulate a web page’s title then the parser should treat everything as plaintext until it reaches the end or encounters a closing TITLE tag. If trying to do this with createFromString() then it’s up to the caller to remember to prepend and then remove the environment, createFromString( ‘<title>’ . $page_title . ‘</title>’ ). The fragment parser would be similar in practice, but more explicit and hard to misunderstand in these circumstances.

This is complicated stuff. I understand that the spec provides for a wide variety of use-cases and needs, and that it’s hard to pin down exactly what a spec-compliant parser is supposed to do in all situations (it depends), so I’m only wanting to share from the perspective of people doing a lot of small HTML manipulation. There’s not much code out there using the fragment parser, but I can’t help but think that part of the reason is because it’s not exposed where it ought to be.

Have a great weekend!
Dennis Snell

1 year ago by Niels Dossche — view source

unread

Hi Dennis

Just chiming in here to say that while we don't offer a createFragment() in this proposal, it's possible to parse fragments by passing the LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML which you could use then in conjunction with createDocumentFragment().

It’s not my understanding that this is right here, because fragment parsing implies more than having or not having the HTML and BODY elements implicitly.

Right. I plan on adding innerHTML/outerHTML in the near future. This RFC is a prerequisite for that. As those properties invoke the html fragment parser this somewhat accomplishes what you'd like.
Additionally in the future we might also expose the fragment parser in a more low-level API. Depends on the demand of users and other feature requests that come in.

Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.

The HTML5 spec defines fragment parsing as starting within a context node which exists within a broader document. For example, many people will parse a string of HTML that should form the contents of an LI element. They are grabbing that HTML from a database somewhere, from user input. If that HTML contains “</li>” then our behavior diverges. In a fragment parser it would close out the list we started with but in full document parsing mode the end tag would be ignored, a parse error. If the goal is to ensure that user input doesn’t break out and change the page, then it’s important to use fragment parsing and grab the inner contents of that LI context node.

This can be valuable to have as a tool to guard against injection attacks or against accidentally breaking the page someone is building, because the fragment parser is aware of its environment. It becomes even more important when parsing within RCDATA or RAWTEXT sections. For example, if wanting to parse and analyze or manipulate a web page’s title then the parser should treat everything as plaintext until it reaches the end or encounters a closing TITLE tag. If trying to do this with createFromString() then it’s up to the caller to remember to prepend and then remove the environment, createFromString( ‘<title>’ . $page_title . ‘</title>’ ). The fragment parser would be similar in practice, but more explicit and hard to misunderstand in these circumstances.

You're right, it is dangerous indeed to place the burden of dealing with wrapping and unwrapping on the user, as mistakes are bound to happen and they could result in very bad injection attacks.
innerHTML would help, a low-level fragment parser API maybe even more. I'd have to think about that, but that's for future work.

This is complicated stuff. I understand that the spec provides for a wide variety of use-cases and needs, and that it’s hard to pin down exactly what a spec-compliant parser is supposed to do in all situations (it depends), so I’m only wanting to share from the perspective of people doing a lot of small HTML manipulation. There’s not much code out there using the fragment parser, but I can’t help but think that part of the reason is because it’s not exposed where it ought to be.

Have a great weekend!
Dennis Snell

Thanks for the discussion and sharing your insight.

Likewise, have a great weekend.

Kind regards
Niels

1 year ago by Larry Garfield — view source

unread

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

Hi internals

Discussion seems to have died down.
Today, it's been 14 days since the last major change was done to the
RFC (i.e. the class hierarchy update).
And it's also been close to 4 weeks since I first announced the RFC it
on the mailing list.
I'd like to start the vote on Monday (20:00 PM GMT+2) and I intend to
let it run for 2 weeks.
Any final complaints should be raised now.

Kind regards
Niels

From the RFC:

\DOMDocument will also use DOM\Document as a base class to make it interchangeable with the new classes. We're only adding XMLDocument for completeness and API parity. It's a drop-in replacement for \DOMDocument, and behaves the exact same. The difference is that the API is on par with HTMLDocument, and the construction is designed to be more misuse-resistant. \DOMDocument will NOT change, and remains for the foreseeable future.

Would it make sense then for one of \DOMDocument and DOM\XMLDocument to extend the other, then? So that, eg, we can type against DOM\XMLDocument and then support both old and new classes? Or are the construction et al differences enough that is not viable?

Similarly, the constants would lose their DOM_ prefix in the namespace version, e.g. DOM\INDEX_SIZE_ERR will be an alias for DOM_INDEX_SIZE_ERR. For constants that begin with XML_ I propose to keep the prefix.

Unclear to me: Would the XML constants also be aliased into the namespace verbatim, or left globally?

Did you consider making the new classes throw exceptions rather than forcing people to remember to call another "was there an error" global function like it's still 1996? :-)

Otherwise looks good to me. Thanks!

--Larry Garfield

1 year ago by tim@bastelstu.be — view source

unread

Hi

\DOMDocument will also use DOM\Document as a base class to make it interchangeable with the new classes. We're only adding XMLDocument for completeness and API parity. It's a drop-in replacement for \DOMDocument, and behaves the exact same. The difference is that the API is on par with HTMLDocument, and the construction is designed to be more misuse-resistant. \DOMDocument will NOT change, and remains for the foreseeable future.

Would it make sense then for one of \DOMDocument and DOM\XMLDocument to extend the other, then? So that, eg, we can type against DOM\XMLDocument and then support both old and new classes? Or are the construction et al differences enough that is not viable?

As one of the persons who convinced Niels of this new API design:

\DOM\XMLDocument extends \DOMDocument doesn't work, because it would
bring in the crappy non-static load*() methods that the new API wants to
avoid, because their behavior is extremely surprising.

\DOMDocument extends \DOM\XMLDocument would probably work, but would not
provide an improvement in (backwards) compatibility.

If you want to support everything, update your type declarations to
either \DOM\Document (which is the new abstract base class) or use union
types.

Please also keep in mind that this is a new API that needs to live for
10+ years. It makes sense to make it as clean as possible and fully
opt-in, while not unnecessarily breaking compatibility. Any "extends"
hierarchy will likely result in sadness a few years down the road.
Better make a clean cut.

Best regards
Tim Düsterhus

1 year ago by Larry Garfield — view source

unread

Hi

\DOMDocument will also use DOM\Document as a base class to make it interchangeable with the new classes. We're only adding XMLDocument for completeness and API parity. It's a drop-in replacement for \DOMDocument, and behaves the exact same. The difference is that the API is on par with HTMLDocument, and the construction is designed to be more misuse-resistant. \DOMDocument will NOT change, and remains for the foreseeable future.

Would it make sense then for one of \DOMDocument and DOM\XMLDocument to extend the other, then? So that, eg, we can type against DOM\XMLDocument and then support both old and new classes? Or are the construction et al differences enough that is not viable?

As one of the persons who convinced Niels of this new API design:

\DOM\XMLDocument extends \DOMDocument doesn't work, because it would
bring in the crappy non-static load*() methods that the new API wants to
avoid, because their behavior is extremely surprising.

\DOMDocument extends \DOM\XMLDocument would probably work, but would not
provide an improvement in (backwards) compatibility.

If you want to support everything, update your type declarations to
either \DOM\Document (which is the new abstract base class) or use union
types.

Please also keep in mind that this is a new API that needs to live for
10+ years. It makes sense to make it as clean as possible and fully
opt-in, while not unnecessarily breaking compatibility. Any "extends"
hierarchy will likely result in sadness a few years down the road.
Better make a clean cut.

Fair enough, thanks for clarifying. I guess union types are "good enough" for this sort of case.

--Larry Garfield

1 year ago by Niels Dossche — view source

unread

Hi Larry

Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

Hi internals

Discussion seems to have died down.
Today, it's been 14 days since the last major change was done to the
RFC (i.e. the class hierarchy update).
And it's also been close to 4 weeks since I first announced the RFC it
on the mailing list.
I'd like to start the vote on Monday (20:00 PM GMT+2) and I intend to
let it run for 2 weeks.
Any final complaints should be raised now.

Kind regards
Niels

From the RFC:

\DOMDocument will also use DOM\Document as a base class to make it interchangeable with the new classes. We're only adding XMLDocument for completeness and API parity. It's a drop-in replacement for \DOMDocument, and behaves the exact same. The difference is that the API is on par with HTMLDocument, and the construction is designed to be more misuse-resistant. \DOMDocument will NOT change, and remains for the foreseeable future.

Would it make sense then for one of \DOMDocument and DOM\XMLDocument to extend the other, then? So that, eg, we can type against DOM\XMLDocument and then support both old and new classes? Or are the construction et al differences enough that is not viable?

I agree with Tim's answer here :) (Thanks Tim!)

Similarly, the constants would lose their DOM_ prefix in the namespace version, e.g. DOM\INDEX_SIZE_ERR will be an alias for DOM_INDEX_SIZE_ERR. For constants that begin with XML_ I propose to keep the prefix.

Unclear to me: Would the XML constants also be aliased into the namespace verbatim, or left globally?

I'll clarify this.
The intention is to alias them verbatim.

Did you consider making the new classes throw exceptions rather than forcing people to remember to call another "was there an error" global function like it's still 1996? :-)

I did think about it.
Using exceptions for the parser is not viable. This is because parse errors in HTML aren't actually hard errors.
The errors are recoverable, i.e. the parser spec tells us how to proceed when an error occurred. So in a sense, they're closer to warnings. Using an exception would abort parsing.
As a side note, a good amount of the web pages out there violates at least one parsing rule, but browsers know by-spec how to proceed in that case (which is probably also why they're often not fixed).

I thought about other options as well. E.g. providing a getParseErrors() method or letting the factory methods return parse errors optionally as well.
However, I think they're not significantly better than what we have now. Furthermore, I think overhauling how the parse errors are handled in ext/dom (and maybe by extension ext/simplexml to keep consistency) is a bit of a feature creep. See also the motivation in the RFC text.
Therefore, I would keep the error handling as it is described now in the RFC.

If accepted, this RFC would land early in the 8.4 development cycle.
Therefore, we can gather feedback very early on. If we do notice a major problem in how these things are handled, they can be changed by a hypothetical future RFC in the same development cycle. That would also require thinking about the other XML extensions though.

Otherwise looks good to me. Thanks!

--Larry Garfield

Cheers
Niels

1 year ago by Larry Garfield — view source

unread

Unclear to me: Would the XML constants also be aliased into the namespace verbatim, or left globally?

I'll clarify this.
The intention is to alias them verbatim.

:thumbs up emoji:

Did you consider making the new classes throw exceptions rather than forcing people to remember to call another "was there an error" global function like it's still 1996? :-)

I did think about it.
Using exceptions for the parser is not viable. This is because parse
errors in HTML aren't actually hard errors.
The errors are recoverable, i.e. the parser spec tells us how to
proceed when an error occurred. So in a sense, they're closer to
warnings. Using an exception would abort parsing.
As a side note, a good amount of the web pages out there violates at
least one parsing rule, but browsers know by-spec how to proceed in
that case (which is probably also why they're often not fixed).

I thought about other options as well. E.g. providing a
getParseErrors() method or letting the factory methods return parse
errors optionally as well.
However, I think they're not significantly better than what we have
now. Furthermore, I think overhauling how the parse errors are handled
in ext/dom (and maybe by extension ext/simplexml to keep consistency)
is a bit of a feature creep. See also the motivation in the RFC text.
Therefore, I would keep the error handling as it is described now in
the RFC.

If accepted, this RFC would land early in the 8.4 development cycle.
Therefore, we can gather feedback very early on. If we do notice a
major problem in how these things are handled, they can be changed by a
hypothetical future RFC in the same development cycle. That would also
require thinking about the other XML extensions though.

All valid, thanks. It would be helpful to include some mention of that in the Errors section to make it clear why exceptions right now are not an option, given the push elsewhere to get rid of the old error handling approaches in favor of exceptions.

--Larry Garfield

1 year ago by Niels Dossche — view source

unread

Unclear to me: Would the XML constants also be aliased into the namespace verbatim, or left globally?

I'll clarify this.
The intention is to alias them verbatim.

:thumbs up emoji:

Did you consider making the new classes throw exceptions rather than forcing people to remember to call another "was there an error" global function like it's still 1996? :-)

I did think about it.
Using exceptions for the parser is not viable. This is because parse
errors in HTML aren't actually hard errors.
The errors are recoverable, i.e. the parser spec tells us how to
proceed when an error occurred. So in a sense, they're closer to
warnings. Using an exception would abort parsing.
As a side note, a good amount of the web pages out there violates at
least one parsing rule, but browsers know by-spec how to proceed in
that case (which is probably also why they're often not fixed).

I thought about other options as well. E.g. providing a
getParseErrors() method or letting the factory methods return parse
errors optionally as well.
However, I think they're not significantly better than what we have
now. Furthermore, I think overhauling how the parse errors are handled
in ext/dom (and maybe by extension ext/simplexml to keep consistency)
is a bit of a feature creep. See also the motivation in the RFC text.
Therefore, I would keep the error handling as it is described now in
the RFC.

If accepted, this RFC would land early in the 8.4 development cycle.
Therefore, we can gather feedback very early on. If we do notice a
major problem in how these things are handled, they can be changed by a
hypothetical future RFC in the same development cycle. That would also
require thinking about the other XML extensions though.

All valid, thanks. It would be helpful to include some mention of that in the Errors section to make it clear why exceptions right now are not an option, given the push elsewhere to get rid of the old error handling approaches in favor of exceptions.

Ack, will do! :)

--Larry Garfield

Cheers
Niels