Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121073 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 87948 invoked from network); 15 Sep 2023 23:17:11 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 15 Sep 2023 23:17:11 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id E5E061804BC for ; Fri, 15 Sep 2023 16:17:10 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-ej1-f43.google.com (mail-ej1-f43.google.com [209.85.218.43]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 15 Sep 2023 16:17:10 -0700 (PDT) Received: by mail-ej1-f43.google.com with SMTP id a640c23a62f3a-9936b3d0286so348918366b.0 for ; Fri, 15 Sep 2023 16:17:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1694819829; x=1695424629; darn=lists.php.net; h=content-transfer-encoding:in-reply-to:references:to:from :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=oV86RfT8Q90+yxncEEmq6BRPZ/9HghxY/172VuybsgM=; b=ZuYNPZoLJ6Ua2hQSS6WKGn/QlOxNtMK2s3lnsUFfxs59F9KgxBkwwgIJGDMqrn3Yh1 T2qs4wMgfqDZqoNC3EkwnnYeiuChM4LHvr90oVqcDsfTMhyUbV/7Wn6/e/d+v6mBE9nP uQAIgiKTTbBjE52Mg/fZBNKVX9GVJ8eyDTsTWn/D1JeBhB9xNN5PoNYRQOmJcAyfKki8 IyOMIwccz4Ul39akL8DRjVLOUO2N14HxmGSqRbxacgob4PZTzC5Sj66wO9+MiT7Bcquq nuspCtxfj7G7Ea24vehkcsG8nMYNef2977uc4QKBCEAp3OsPrAgJIB4dHWsPsBnJ2qeE gMnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694819829; x=1695424629; h=content-transfer-encoding:in-reply-to:references:to:from :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oV86RfT8Q90+yxncEEmq6BRPZ/9HghxY/172VuybsgM=; b=m68WA6qkLYqu4TBtLQEPqQ3SCPD9RpIG7oQzRqxcpfpWeP2q70WUZ/pZCUPSmZBHwS V4WuF5SnQg0HlLw9sW+28Iq/ln3K2x4CdD1N80hkw4ECFRPCl7YtrfAWdSAvxayFKyQz lhPvQLh4c4fLRxhqTvTQ8ueUMiLQ2XeTuSKbhT/O5eAxy1MAtsmN+31ZrZhQvKSCAr5B dxKjIS1IWf5f2Kq0JZHgq4WcYcHuUnxQsq5camXpWM/oDNEFrBVyuOHX742ARe0jHb8d vngcdPR+8pheBXBO3k+6uv6I4Mbv63IrEW5eHHg5zx+7hiocHRQ2by6pPcnN0IdFfWaW WFNg== X-Gm-Message-State: AOJu0YzQPx/Tic5zOUfrmk52UqQT1p6RvN4noAF+OhJ/ARo+IwFNS28x bmuoTsi7klbZCHS9D6wIlQMnF1ERfkw= X-Google-Smtp-Source: AGHT+IFAxSeS4FzctYOL+HmK6+A7OAc9jhQ+tsRIqWtFy5GYlv/OHhZbKe4QreKvV6XhObkX/iAGvQ== X-Received: by 2002:a17:906:30c2:b0:9a9:f2fd:2a2b with SMTP id b2-20020a17090630c200b009a9f2fd2a2bmr2588412ejb.73.1694819828667; Fri, 15 Sep 2023 16:17:08 -0700 (PDT) Received: from [192.168.0.59] (178-117-137-225.access.telenet.be. [178.117.137.225]) by smtp.gmail.com with ESMTPSA id cf20-20020a170906b2d400b0099bd453357esm2975509ejb.41.2023.09.15.16.17.07 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 15 Sep 2023 16:17:08 -0700 (PDT) Message-ID: Date: Sat, 16 Sep 2023 01:17:06 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: PHP Internals References: In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: dossche.niels@gmail.com (Niels Dossche) On 9/2/23 21:41, Niels Dossche wrote: > Hello internals > > I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support". > https://wiki.php.net/rfc/domdocument_html5_parser > > Kind regards > Niels Hi internals I'd like to announce a change to the RFC. The new RFC version is 0.5.1, the old one was 0.4.0. The diff can be viewed via the revision history button on the right. I had a productive discussion with Tim and Arne about the class hierarchy. Here's a summary of the changes and the rationale. Until now, the RFC specified that DOM\HTML5Document extends DOMDocument. However, as we're introducing a new class anyway, we believe we should take the opportunity to improve the API. We have the following concerns: a) It's a bit of an awkward class hierarchy. *If* we hypothetically would want to get rid of DOMDocument in the far far future, we can't easily do that. b) API is messy. Some methods are useless for HTML5Document. E.g.: validate(), loadXML(), loadXMLFile(). They can be a source of confusion. c) The fact that you can pass HTML5Document to methods accepting DOMDocument may result in unexpected behaviour when the method expects a particular behaviour. It would be better if developers could "opt-in" to accepting both DOMDocument and HTML5Document in a method using a common base class. d) The properties set by DOMDocument's constructor are overridden by load methods, which is surprising. That's even mentioned as the second top comment on https://www.php.net/manual/en/domdocument.loadxml.php. Furthermore, the XML version argument of the constructor is even useless for HTML5 documents. So we propose the following changes to the RFC. We'll add a common abstract base class DOM\Document (name taken from the DOM spec & Javascript world). DOM\Document contains the properties and abstract methods common to both HTML and XML documents. Examples of what it includes/excludes: * includes: firstElementChild, lastElementChild, ... * excludes: xmlStandalone, xmlVersion, validate(), ... Then we'll have two subclasses: DOM\HTMLDocument (previously we called this DOM\HTML5Document) and DOM\XMLDocument. We dropped the 5 from the name to be more resilient to version changes and match the DOM spec name. DOMDocument will also use DOM\Document as a base class to make it interchangeable with the new classes. The above would solve points a, b, and c. To solve point d, we can use "factory methods": This means HTMLDocument's constructor will be made private, and instead we'll have three static methods that create a new instance: - HTMLDocument::fromHTMLString(string $xml): HTMLDocument; - HTMLDocument::fromHTMLFile(string $filename): HTMLDocument; - HTMLDocument::fromEmptyDocument(string $encoding="UTF-8"): HTMLDocument; Or to put it in PHP code: ``` namespace DOM { // The base abstract document class abstract class Document extends DOM\Node implements DOM\ParentNode { /* all properties and methods that are common and sensible for both XML & HTML documents */ } class XMLDocument extends Document { /* insert specific XML methods and properties (e.g. xmlVersion, validate(), ...) here */ private function __construct() {} public static function fromEmptyDocument(string $version = "1.0", string $encoding = "UTF-8"); public static function fromFile(string $path); public static function fromString(string $source); } class HTMLDocument extends Document { /* insert specific Html methods and properties here */ private function __construct() {} public static function fromEmptyDocument(string $encoding = "UTF-8"); public static function fromFile(string $path); public static function fromString(string $source); } } class DOMDocument extends DOM\Document { /* Keep methods, properties, and constructor the same as they are now */ } ``` We're only adding XMLDocument for completeness and API parity. It's a drop-in replacement for DOMDocument, and behaves the exact same. The difference is that the API is on par with HTMLDocument, and the construction is designed to be more misuse-resistant. DOMDocument will NOT change, and remains for the foreseeable future. We also have to change the $ownerDocument field in DOMNode to have type ?DOM\Document instead of ?DOMDocument. Problem is that this breaks BC (but only a minor break): https://3v4l.org/El7Ve. Overriding properties is kind of useless, but if someone does it, then the compiler will complain loudly during compilation and it should be easy to fix. Of course, these changes means that the discussion period will run a bit longer than originally foreseen. Kind regards Niels