Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:121186
Message-ID: <574563c3-6970-43ce-ad4d-860975a8602f@gmail.com>
Date: Fri, 29 Sep 2023 23:18:08 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Content-Language: en-US
To: Dennis Snell <dennis.snell@automattic.com>
Cc: internals@lists.php.net
References: <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com>
 <5F88257F-58CB-45AA-B2CC-ABEA19AD7474@automattic.com>
In-Reply-To: <5F88257F-58CB-45AA-B2CC-ABEA19AD7474@automattic.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and
 serialization support
From: dossche.niels@gmail.com (Niels Dossche)

Hi Dennis

On 9/29/23 20:20, Dennis Snell wrote:
>>
>>>
>>> For both, `XMLDocument::fromEmpty` and `HTMLDocument::createEmpty` there is an argument available to define the encoding but none of the other `createFrom*` methods have this argument.
>>>
>>> As far as I understand, in the these other cases the encoding gets detected from the content of the passed source but what happens is the source does not contain any information about the encoding?. E.g. you load an XML/HTML document over HTTP, the encoding is defined via HTTP header but the content itself doesn't contain it.
>>>
>>
>> Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities.
>> If one option fails, it will fall through to the next one.
>> 1. The Content-Type HTTP header from which you loaded the document.
>> 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding.
>> 3. Meta tag in the content.
>>
>> If it could not be determined at all, UTF-8 will be assumed as it's the default in HTML.
> 
> It may sound meticulous, but I’ve tried to emphasize `createFragment()` in what’s being built in WordPress because almost everything being done on HTML within WordPress, and I think within many frameworks, is processing fragments (and usually short ones at that). Formerly I didn’t realize there was much of a difference, but text encoding is one of those differences. It’s my understanding that when parsing a fragment we have to assume an encoding, unless the fragment is starting at a spot in the document before that’s discovered, presumably only if we’ve constructed a Document with a still-unknown encoding.
> 

Just chiming in here to say that while we don't offer a createFragment() in this proposal, it's possible to parse fragments by passing the LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML which you could use then in conjunction with createDocumentFragment().

> So manually setting the encoding of a fragment constructor is not so much overriding as it is supplying, or at least, that’s one of two normative situations. If we create a fragment with a context node carrying an encoding already, then we need to ignore any meta tag that specifies otherwise; likewise if the context node doesn’t carry that encoding we do need to heed it.

Sure, I agree it's not overriding in that specific case. In other cases it can be.
There may not be an ideal naming that works for all cases.

> 
> I know there’s a huge difference in needs here between people writing scripts to scrape full HTML documents, but it’s not a small fraction of cases where people want to use DOMDocument without having the full HTML from start to finish. In the world I work in it’s usually either for parsing a small fragment to add some attributes or replace a wrapping tag, or for constructing HTML programmatically to avoid escaping issues and make nesting easy. In both of these cases the text encoding is implicit unless the function signature makes it explicit. At this stage in development, we only support some of the “in body” parsing and only support UTF-8, but I thought that it was important enough to add these as arguments to the creator function so that there’s an awareness that these values govern how the parse occurs.
> 
> Surely for `createFromString()` and `createEmpty()` we can make the assumption that no character encoding is set, but I also suspect that a possible majority of the times people use these functions they are likely calling them when `createFragment()` is more appropriate, that they aren’t supplying HTML documents with in-band text encoding information, and so there’s a chance that de-emphasizing the parameter may be technically more accurate and practically less helpful.

Thanks for the insight.
To be honest, I don't have hard feelings about the naming of the parameter.
In a way you could say that override_encoding is still accurate, because the fallback default is UTF-8, so you override the fallback in a sense. As the documentation also emphasizes that the DOM extension works internally with UTF-8, this may align with expectations of programmers, but I'm not sure.
I think we should really document the parameter very well in the docs.

> 
> Love seeing all the continued work on this!
> Thank you so much for your dedication to it.
> 
> Dennis Snell

Kind regards
Niels