Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:121188
Message-ID: <4a50ed71-ee63-4db3-86d6-e463b551aa34@gmail.com>
Date: Fri, 29 Sep 2023 23:55:00 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Content-Language: en-US
To: Dennis Snell <dennis.snell@automattic.com>
Cc: internals@lists.php.net
References: <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com>
 <5F88257F-58CB-45AA-B2CC-ABEA19AD7474@automattic.com>
 <574563c3-6970-43ce-ad4d-860975a8602f@gmail.com>
 <23E1FB16-8ED9-4EF5-B5E2-D9136AF638D2@automattic.com>
In-Reply-To: <23E1FB16-8ED9-4EF5-B5E2-D9136AF638D2@automattic.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and
 serialization support
From: dossche.niels@gmail.com (Niels Dossche)

Hi Dennis

On 9/29/23 23:38, Dennis Snell wrote:
>> Just chiming in here to say that while we don't offer a createFragment() in this proposal, it's possible to parse fragments by passing the LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML which you could use then in conjunction with createDocumentFragment().
> 
> 
> It’s not my understanding that this is right here, because fragment parsing implies more than having or not having the HTML and BODY elements implicitly.

Right. I plan on adding innerHTML/outerHTML in the near future. This RFC is a prerequisite for that. As those properties invoke the html fragment parser this somewhat accomplishes what you'd like.
Additionally in the future we might also expose the fragment parser in a more low-level API. Depends on the demand of users and other feature requests that come in.

> 
> 
>>  Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.
> 
> 
> The HTML5 spec defines fragment parsing as starting within a context node which exists within a broader document. For example, many people will parse a string of HTML that should form the contents of an LI element. They are grabbing that HTML from a database somewhere, from user input. If that HTML contains “</li>” then our behavior diverges. In a fragment parser it would close out the list we started with but in full document parsing mode the end tag would be ignored, a parse error. If the goal is to ensure that user input doesn’t break out and change the page, then it’s important to use fragment parsing and grab the inner contents of that LI context node.
> 
> 
> This can be valuable to have as a tool to guard against injection attacks or against accidentally breaking the page someone is building, because the fragment parser is aware of its environment. It becomes even more important when parsing within RCDATA or RAWTEXT sections. For example, if wanting to parse and analyze or manipulate a web page’s title then the parser should treat everything as plaintext until it reaches the end or encounters a closing TITLE tag. If trying to do this with `createFromString()` then it’s up to the caller to remember to prepend and then remove the environment, `createFromString( ‘<title>’ . $page_title . ‘</title>’ )`. The fragment parser would be similar in practice, but more explicit and hard to misunderstand in these circumstances.
> 

You're right, it is dangerous indeed to place the burden of dealing with wrapping and unwrapping on the user, as mistakes are bound to happen and they could result in very bad injection attacks.
innerHTML would help, a low-level fragment parser API maybe even more. I'd have to think about that, but that's for future work.

> 
> This is complicated stuff. I understand that the spec provides for a wide variety of use-cases and needs, and that it’s hard to pin down exactly what a spec-compliant parser is supposed to do in all situations (it depends), so I’m only wanting to share from the perspective of people doing a lot of small HTML manipulation. There’s not much code out there using the fragment parser, but I can’t help but think that part of the reason is because it’s not exposed where it ought to be.
> 
> 
> Have a great weekend!
> Dennis Snell
>>
> 

Thanks for the discussion and sharing your insight.

Likewise, have a great weekend.

Kind regards
Niels