Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:122059 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 12317 invoked from network); 29 Dec 2023 14:41:01 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Dec 2023 14:41:01 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A817818005B for ; Fri, 29 Dec 2023 06:41:27 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 29 Dec 2023 06:41:27 -0800 (PST) Received: by mail-pj1-f43.google.com with SMTP id 98e67ed59e1d1-28c0536806fso4909606a91.0 for ; Fri, 29 Dec 2023 06:40:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703860858; x=1704465658; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=wiUAwWYoy7BQX9LKeWOsXvUgTl01Nnulvwu3BzIT4LE=; b=dyGfswYh+C5Ii/pfL+KQvVBWbjzbMPPLq95+2zs8VBWkd6nC8No8ckuDWLnF8aFe67 yu1QSzEeLGA6w/A0bgtgDQrQZztk0xUHoSjYgGKTGfzqAPYx137KDuDGoKdvrMZYU1tl PDXGx4aHyD+7JYbPENSs4iROSQRQ7O5K0Gws6/Dxzy6YBzv53Agj7kShrfEM0ubV864m 8390SrQi6co3nJF2qmmEKZ/jMhjTTF2UvMCth5rxzOR//yih+2sqRx7t1QGZ1jSDFUke uEelQ5iPQUWc0CUVodU3RfyqtQ/cK0KeldeYUZl14Eoj/OGClxQ8suXMTfB1aYBMt3Ef HgKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703860858; x=1704465658; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=wiUAwWYoy7BQX9LKeWOsXvUgTl01Nnulvwu3BzIT4LE=; b=CbvWgDeCkTpq8YWOqsmZN9RBtn+xyx/XeksPhTbMcQETYbqQ7REU9vMdRL65Jr9wSH Dbi5wyNTMvRKD5tKQrS3gPfzWlIyDAvINgBXEVJp2QfjWMe45GFI4/CJ71bPT+7BfKz1 x49/JkvnoIuwUHkWV6RzQ0NbZ7GYayLftTjCJDHoJkO/ct+8OSVdst7ZWLSsfnjemSyL a8osnOTSl19f72wnWFfdr2+SHza2OxJQ4UKUxux+Qg5DzWjD4ZwFPbWUWtXd+h+Jtq90 luK0Tj5hV4S+TI6nnmcPDPM/A5+jHf6xnjgq93gCkSU4vKUxFfWiS/ajVRjZMg9MOLX6 UWtQ== X-Gm-Message-State: AOJu0YxhIYuz6xDbfnznBoib3MizkXuPud91F+frsCjZ0g26h9C8eO6Z KLamHcBCvYE1PRuoN5f6+9kkafQ+T3Y3hyP4ULE= X-Google-Smtp-Source: AGHT+IET+Lw2m/9HGaUdekf+Rk/HXPum0fnNnQ02TSvHr7GWwwu+V4Vsnqkk5zZC30z5Kgz4K30am4Dw/0trpsjMCx8= X-Received: by 2002:a17:90a:6585:b0:28c:a437:5ce6 with SMTP id k5-20020a17090a658500b0028ca4375ce6mr1545022pjj.73.1703860858196; Fri, 29 Dec 2023 06:40:58 -0800 (PST) MIME-Version: 1.0 References: <756bcf2b-f98d-4203-9004-1cbfd402337a@gmail.com> In-Reply-To: <756bcf2b-f98d-4203-9004-1cbfd402337a@gmail.com> Date: Fri, 29 Dec 2023 14:40:46 +0000 Message-ID: To: Niels Dossche Cc: PHP internals Content-Type: multipart/alternative; boundary="000000000000331731060da705da" Subject: Re: [PHP-DEV] Pre-RFC: Fixing spec bugs in the DOM extension From: george.banyard@gmail.com ("G. P. B.") --000000000000331731060da705da Content-Type: text/plain; charset="UTF-8" On Tue, 26 Dec 2023 at 21:45, Niels Dossche wrote: > Hi internals > > The DOM extension in PHP is used to parse, query and manipulate XML/HTML > documents. The DOM extension is based on the DOM specification. > Originally this was the DOM Core Level 3 specification, but nowadays, that > specification has evolved into the current "Living Specification" > maintained by WHATWG. > > Unfortunately, there are many bugs in PHP's DOM extension. Most of those > bugs are related to namespace and attribute handling. This leads to people > trying to work around those bugs by relying on more bugs, or on > undocumented side-effects of incorrect behaviour, leading to even more > issues in the end. Furthermore, some of these bugs may have security > implications [1]. > > Some of these bugs are caused because the method or property was > implemented incorrectly back in the day, or because the original > specification used to be unclear. A smaller part of this is because the > specification has made breaking changes when HTML 5 first came along and > the specification creators had to unify what browsers implemented into a > single specification that everyone agreed on. > > It's not possible to "just fix" these bugs because people actually _rely_ > on these bugs. They are also often unaware that what they're doing is > actually incorrect or causes the internal document state to be > inconsistent. We therefore have to fix this in a backwards-compatible way: > i.e. a hard requirement is that all code written for the current DOM > extension keeps working without requiring changes. > In short: the main problem is that 20 years of buggy behaviour means that > the bugs have become ingrained into the system. > > Some people have implemented userland DOM libraries on top of the existing > DOM extension. However, even userland solutions can't fully work around > issues caused by PHP's DOM extension. The real solution is to provide a > BC-preserving fix at PHP's side. > > Roughly 1.5 months ago I merged my HTML 5 RFC [2] into the PHP 8.4 > development branch. This RFC introduced new document classes: > DOM\HTMLDocument and DOM\XMLDocument. The idea here was to preserve > backwards compatibility: if the user wants to keep using HTML 4, they can > keep using the DOMDocument class. Also, when the user wants to work with > HTML 5 and are currently using workarounds, they can migrate on their own > pace (without deprecations or anything) to the new classes. New code can > use DOM\{HTML,XML}Document from the start without touching the old classes. > > The HTML 5 RFC has left us with an interesting opportunity to also > introduce the spec bugfixes in a BC-preserving way. The idea is that when > the new DOM\{HTML,XML}Document classes are used, then the DOM extension > will follow the DOM specification and therefore get rid of bugs. When you > are using the DOMDocument class, the old implementations will be used. This > means that backwards compatibility is kept. > > For the past 2.5 weeks I've been working on getting all spec bugs that I > know of fixed. The full list of bugs that this proposal fixes can be found > here: > https://github.com/nielsdos/php-src/blob/dom-spec-compliance-pub/bugs.md. > I also found some discussion [3] from some years ago where C. Scott shared > a list of problems they encountered at Wikimedia [4]. All behavioural > issues are fixed in my PR [5], although my PR could always use more > testing. Currently I have tested that existing DOM code does not break (I > have tested veewee's XML library, Mensbeam library, some SimpleSAML > libraries). I have added tests to test the new spec-compliant behaviour. I > also ported some of the WHATWG's WPT DOM tests (DOM spec-compliance > testsuite) to PHP and those that I've ported all pass [6]. > > Implementation PR can be found here: > https://github.com/php/php-src/pull/13031 > > Note that this is not a new extension, but an improvement to the existing > DOM extension. As for "why not an entirely new extension?", please see the > reasoning in my HTML 5 RFC. All interactions with SimpleXML, XSL, XPath etc > will remain possible like you are used to. Implementation-wise, a lot of > code internally is shared between the spec-compliant and old > implementations. > > I intend to put this up for RFC. There is however one last detail that > needs to be cleared up: what about "type issues"? > To give an example of a "type issue": there is a `string DOMNode::$prefix` > property. DOM spec tells us that this should be nullable: when there is no > prefix for a node, the prefix should return NULL. However, because the > property is a string, this currently returns an empty string instead in > PHP. Not a big deal maybe, but there's many of these subtle > inconsistencies: null vs false return value, arguments that should accept > `?string` instead of `string`, etc. > Sadly, it's not possible to fix the typing issues for properties and > methods for DOMNode, DOMElement, ... because of BC: properties and methods > can be overridden. > Or is it? > > Currently, as a result of the HTML 5 RFC, the new DOM\{HTML,XML}Document > classes keep using the DOMNode, DOMElement, ... classes. > For consistency, the DOMNode etc class were aliased to the DOM namespace, > i.e. DOM\Node is an alias for DOMNode, DOM\Element an alias for DOMElement > etc. > Being an alias, this means that fixing types for DOM\Node is not possible > because it's really just another name for DOMNode, so changing it for > DOM\Node means changing it for DOMNode. > _Unless_ we no longer alias the classes but make them proper classes > instead. This means we can fix the typing for DOM\Node while keeping > DOMNode untouched, preserving BC. The downside is that it becomes more > difficult for interoperability. One of the reasons the HTML 5 RFC > introduced aliases instead of proper classes is so that code taking a > DOMNode as an argument could also be passed a DOM\Node. However, if we make > it a proper class instead, such code has to either transition fully to the > new DOM classes _or_ use a type union, e.g. DOMNode|DOM\Node. > In my opinion, having them become proper classes instead of aliases has my > preference: either we fix everything in one go now while we have the > opportunity, or never. > > Let me know what you think, especially regarding the type issues. > > Kind regards > Niels > > [1] https://github.com/php/php-src/issues/8388 > [2] https://wiki.php.net/rfc/domdocument_html5_parser > [3] https://externals.io/message/104687 > [4] https://www.mediawiki.org/wiki/Parsoid/PHP/Help_wanted > [5] https://github.com/php/php-src/pull/13031 > [6] https://github.com/nielsdos/wpt/tree/master/dom/php-out (yes, this is > a dirty port) > Thank you for the work! I agree that making them proper classes instead of aliases is the better proposition here. I'm not fully informed about the DOM spec, and I don't know if the current class/interface hierarchy is in the best shape, but maybe we should also consider having a look a this? About making those new classes finals, this would require reconsidering the class hierarchy anyway, as nearly everything inherits from DOMNode, and other classes (namely Comment/Text/CData nodes) extend other classes. However, I would not necessarily be against it, especially if we add the required interfaces, as the current mechanism of registering a custom class is not very powerful and rather cumbersome to use as the constructor is never called. As such, I'm not sure if I would support adding the current mechanism to customize the node classes returned by the extension. Indeed, the current mechanism doesn't play nicely at all with static analysis and this is something I stopped trying to integrate when writing my DocBook renderer project. [1] Best regards, Gina P. Banyard [1] https://gitlab.com/Girgias/docbook-renderer --000000000000331731060da705da--