Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:122060 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 19958 invoked from network); 29 Dec 2023 16:58:35 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Dec 2023 16:58:35 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id C0A8918006E for ; Fri, 29 Dec 2023 08:59:01 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 29 Dec 2023 08:59:01 -0800 (PST) Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id 80F5A5C00CB for ; Fri, 29 Dec 2023 11:58:32 -0500 (EST) Received: from imap50 ([10.202.2.100]) by compute1.internal (MEProxy); Fri, 29 Dec 2023 11:58:32 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= garfieldtech.com; h=cc:content-type:content-type:date:date:from :from:in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to; s=fm2; t=1703869112; x= 1703955512; bh=04wQkPtDwdGJI3TjO5iaw/84KG+jRq3QDDnFn5sSc3Q=; b=m txbs2Vu1j90gMoSMDzRGiFxm+fEMYPLlMerUV7uSfo0zoENLkMQtRdZ6KiW22DFg 5MZHAZXJNbPRnweyRIxyeOZ5IULazp+WUd6wngeBsE1NbSQ63nyh2iRi+VqBhw3K +d5DisLjNYT3+trdOkBU6PYbDSIyhYwPkZ8HqUYL9CtN15bTurjpt1ZLGT6tvVOD tVNPshL91Q0DtWuW/FiMi+oCxq7AGYuu0ooIyTniDYzCEtvYQuTwxIOoKULiyKLq xKuhqXF7/j89cEcfirs/JJzVIwwV0rPSHn5PMSUJJCmSU8EV22SJaFom3Jmiwp47 XfKd5RaQyLXQOGbAB2PrQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm2; t=1703869112; x=1703955512; bh=04wQkPtDwdGJI3TjO5iaw/84KG+j Rq3QDDnFn5sSc3Q=; b=EsgYnRMHi5Xp9lfKmeiHHenGKP2tUxwI2S2ZhsNN2l0P mSWEdC50qVvuKitNdI6hEceP2pr0v3cYHm+Eufzm7cH7yiYMDYH3ianEL9ADvC5X LeD2fKr5zpnXKmOOSEW9BdQxksz525vAdG15o3TmHYPcNZj2h5/JTW7uV7pz7p/n U57sWkdRwfLH4KtycBiV8tWbRB256YFBhVdVsGamFl6BvZlM+k+xOc11r6qOh7vL pH72ZYZVmI/NTbd4mIt5jhzmyb7RKEecmWWvMHRA6VCjstwbn7KPQhFXTBaDN64e M1NUmO86vZ422/NnVAvaO0T3IBQ4GrAD+kxvKp04fQ== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvkedrvdeffedgleeiucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepofgfggfkjghffffhvffutgesthdtredtreertdenucfhrhhomhepfdfnrghr rhihucfirghrfhhivghlugdfuceolhgrrhhrhiesghgrrhhfihgvlhguthgvtghhrdgtoh hmqeenucggtffrrghtthgvrhhnpeehfffgudehjeeglefhgeetkeevjeeigefhhfdvgfdt tdegkeeljeekveeklefhgeenucffohhmrghinhepghhithhhuhgsrdgtohhmpdgsuhhgsh drmhgupdhphhhprdhnvghtpdgvgihtvghrnhgrlhhsrdhiohdpmhgvughirgifihhkihdr ohhrghenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe hlrghrrhihsehgrghrfhhivghlughtvggthhdrtghomh X-ME-Proxy: Feedback-ID: i8414410d:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 34EE61700093; Fri, 29 Dec 2023 11:58:31 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.9.0-alpha0-1364-ga51d5fd3b7-fm-20231219.001-ga51d5fd3 MIME-Version: 1.0 Message-ID: <8632ff2a-0169-4cbb-b5d8-3bafb841f1ee@app.fastmail.com> In-Reply-To: <756bcf2b-f98d-4203-9004-1cbfd402337a@gmail.com> References: <756bcf2b-f98d-4203-9004-1cbfd402337a@gmail.com> Date: Fri, 29 Dec 2023 10:58:11 -0600 To: "php internals" Content-Type: text/plain Subject: Re: [PHP-DEV] Pre-RFC: Fixing spec bugs in the DOM extension From: larry@garfieldtech.com ("Larry Garfield") On Tue, Dec 26, 2023, at 3:45 PM, Niels Dossche wrote: > Hi internals > > The DOM extension in PHP is used to parse, query and manipulate > XML/HTML documents. The DOM extension is based on the DOM specification. > Originally this was the DOM Core Level 3 specification, but nowadays, > that specification has evolved into the current "Living Specification" > maintained by WHATWG. > > Unfortunately, there are many bugs in PHP's DOM extension. Most of > those bugs are related to namespace and attribute handling. This leads > to people trying to work around those bugs by relying on more bugs, or > on undocumented side-effects of incorrect behaviour, leading to even > more issues in the end. Furthermore, some of these bugs may have > security implications [1]. > > Some of these bugs are caused because the method or property was > implemented incorrectly back in the day, or because the original > specification used to be unclear. A smaller part of this is because the > specification has made breaking changes when HTML 5 first came along > and the specification creators had to unify what browsers implemented > into a single specification that everyone agreed on. > > It's not possible to "just fix" these bugs because people actually > _rely_ on these bugs. They are also often unaware that what they're > doing is actually incorrect or causes the internal document state to be > inconsistent. We therefore have to fix this in a backwards-compatible > way: i.e. a hard requirement is that all code written for the current > DOM extension keeps working without requiring changes. > In short: the main problem is that 20 years of buggy behaviour means > that the bugs have become ingrained into the system. > > Some people have implemented userland DOM libraries on top of the > existing DOM extension. However, even userland solutions can't fully > work around issues caused by PHP's DOM extension. The real solution is > to provide a BC-preserving fix at PHP's side. > > Roughly 1.5 months ago I merged my HTML 5 RFC [2] into the PHP 8.4 > development branch. This RFC introduced new document classes: > DOM\HTMLDocument and DOM\XMLDocument. The idea here was to preserve > backwards compatibility: if the user wants to keep using HTML 4, they > can keep using the DOMDocument class. Also, when the user wants to work > with HTML 5 and are currently using workarounds, they can migrate on > their own pace (without deprecations or anything) to the new classes. > New code can use DOM\{HTML,XML}Document from the start without touching > the old classes. > > The HTML 5 RFC has left us with an interesting opportunity to also > introduce the spec bugfixes in a BC-preserving way. The idea is that > when the new DOM\{HTML,XML}Document classes are used, then the DOM > extension will follow the DOM specification and therefore get rid of > bugs. When you are using the DOMDocument class, the old implementations > will be used. This means that backwards compatibility is kept. > > For the past 2.5 weeks I've been working on getting all spec bugs that > I know of fixed. The full list of bugs that this proposal fixes can be > found here: > https://github.com/nielsdos/php-src/blob/dom-spec-compliance-pub/bugs.md. > I also found some discussion [3] from some years ago where C. Scott > shared a list of problems they encountered at Wikimedia [4]. All > behavioural issues are fixed in my PR [5], although my PR could always > use more testing. Currently I have tested that existing DOM code does > not break (I have tested veewee's XML library, Mensbeam library, some > SimpleSAML libraries). I have added tests to test the new > spec-compliant behaviour. I also ported some of the WHATWG's WPT DOM > tests (DOM spec-compliance testsuite) to PHP and those that I've ported > all pass [6]. > > Implementation PR can be found here: https://github.com/php/php-src/pull/13031 > > Note that this is not a new extension, but an improvement to the > existing DOM extension. As for "why not an entirely new extension?", > please see the reasoning in my HTML 5 RFC. All interactions with > SimpleXML, XSL, XPath etc will remain possible like you are used to. > Implementation-wise, a lot of code internally is shared between the > spec-compliant and old implementations. > > I intend to put this up for RFC. There is however one last detail that > needs to be cleared up: what about "type issues"? > To give an example of a "type issue": there is a `string > DOMNode::$prefix` property. DOM spec tells us that this should be > nullable: when there is no prefix for a node, the prefix should return > NULL. However, because the property is a string, this currently returns > an empty string instead in PHP. Not a big deal maybe, but there's many > of these subtle inconsistencies: null vs false return value, arguments > that should accept `?string` instead of `string`, etc. > Sadly, it's not possible to fix the typing issues for properties and > methods for DOMNode, DOMElement, ... because of BC: properties and > methods can be overridden. > Or is it? > > Currently, as a result of the HTML 5 RFC, the new > DOM\{HTML,XML}Document classes keep using the DOMNode, DOMElement, ... > classes. > For consistency, the DOMNode etc class were aliased to the DOM > namespace, i.e. DOM\Node is an alias for DOMNode, DOM\Element an alias > for DOMElement etc. > Being an alias, this means that fixing types for DOM\Node is not > possible because it's really just another name for DOMNode, so changing > it for DOM\Node means changing it for DOMNode. > _Unless_ we no longer alias the classes but make them proper classes > instead. This means we can fix the typing for DOM\Node while keeping > DOMNode untouched, preserving BC. The downside is that it becomes more > difficult for interoperability. One of the reasons the HTML 5 RFC > introduced aliases instead of proper classes is so that code taking a > DOMNode as an argument could also be passed a DOM\Node. However, if we > make it a proper class instead, such code has to either transition > fully to the new DOM classes _or_ use a type union, e.g. > DOMNode|DOM\Node. > In my opinion, having them become proper classes instead of aliases has > my preference: either we fix everything in one go now while we have the > opportunity, or never. > > Let me know what you think, especially regarding the type issues. > > Kind regards > Niels > > [1] https://github.com/php/php-src/issues/8388 > [2] https://wiki.php.net/rfc/domdocument_html5_parser > [3] https://externals.io/message/104687 > [4] https://www.mediawiki.org/wiki/Parsoid/PHP/Help_wanted > [5] https://github.com/php/php-src/pull/13031 > [6] https://github.com/nielsdos/wpt/tree/master/dom/php-out (yes, this > is a dirty port) I am also on team "yes, let's just do it right." If that means the new classes are only 99% drop ins for the old ones, I'm OK with that. People can switch over when they're ready and do all the clean up at once. I'm not sure about making things final. I don't know the domain space well enough to have a strong opinion at the moment, but my main concern would be ensuring that it's still extensible in reasonable ways. Eg, if I wanted to add a Web Component element to a page, I want to do that without fugly workarounds. I don't have a strong opinion at this point on what the right way to do that is. --Larry Garfield