Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:120982 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 52777 invoked from network); 4 Sep 2023 20:16:07 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 4 Sep 2023 20:16:07 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id BBFB71804BE for ; Mon, 4 Sep 2023 13:16:02 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 4 Sep 2023 13:16:02 -0700 (PDT) Received: by mail-ej1-f45.google.com with SMTP id a640c23a62f3a-986d8332f50so276906766b.0 for ; Mon, 04 Sep 2023 13:16:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1693858560; x=1694463360; darn=lists.php.net; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=FwfZ2P8ZOLqjxl6geSkSOQ94vrR262VUgrrzUFMU9XM=; b=RbyF9zfPcANCXKr+7MA4n5uNqzpSx1eJGJ7OTqzAOXlLxjbQLt3h0muEXYbijM5ym5 /iSxIMbYXfjx56In5JEWXxjHOLu928uzk1geq6l0rAmabjouo32ValD2umdKlmF4bA8a LYqiLsXiuoGTtlFLCSVoCnUIsEbrim0acQ9bKB5efESKiQm1EgKEiicSzXlpiLpJ5aDM 6frmF1w5yfP/CzemQIfYuzCfxU0+ImFzVvwKtEzXbtshXDj6tCkntIo7YOCXv3lkE6/u upT300hFhnvsfNHmJo7/qqB4XcuTAVo51ZA4Wr8klhl3In51swFc7DOoV6wtuQswuM3L DL0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693858560; x=1694463360; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FwfZ2P8ZOLqjxl6geSkSOQ94vrR262VUgrrzUFMU9XM=; b=eCwlk3neIChn2LRdxWS0gtAtPZ9Xbx6JdVkEC6ED8L6UlZLaejMawhuU8YgVujV+iR dQ68dXT/KVWKHPQkHAyvkaPgkXNddkOPZDLkhodt8M8LESE5jbQHbWzv/C6F1ofsLl33 fGjicEmr//ig5kVi2Lzc36di6fw6jwIIYpifm+iLB1oJO1wY0yv/dWs275f4FQOYMCIm XKWjZppi0+3dHSA0Au6nQKf9sYD7Lkta0aOHXQ8KFqg5Fzy3ZKR5ag8kxQeq1FMYaZ++ 5o6dawVaK6YDN2T/cWkRNiuyYaMP0cbtEkbF7XcYoZ6+Z13soTHazdnHOT4BCCYBg9F3 8BFg== X-Gm-Message-State: AOJu0Yzs/Q0PrjXiv/CaiEFkdzx05yn/YKXefYinvqCv11CB03uVHfMQ pY1mUMuHBBCIDzyw/oiQfmU= X-Google-Smtp-Source: AGHT+IGuVogyWsT9aq0tkokjgbkqJXSVdbGKPGxdoodK+WP4hTqYHVFoInRrEv9HBDoT4asAbrF9Mg== X-Received: by 2002:a17:906:314a:b0:99b:dd38:864d with SMTP id e10-20020a170906314a00b0099bdd38864dmr7568929eje.23.1693858560343; Mon, 04 Sep 2023 13:16:00 -0700 (PDT) Received: from ?IPV6:2a02:1811:cc83:ee50:280e:1e36:3a00:824? (ptr-dtfv08akcem5xburtic.18120a2.ip6.access.telenet.be. [2a02:1811:cc83:ee50:280e:1e36:3a00:824]) by smtp.gmail.com with ESMTPSA id a1-20020a17090682c100b009a1a653770bsm6543768ejy.87.2023.09.04.13.15.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Sep 2023 13:16:00 -0700 (PDT) Message-ID: <487c61eb-1a4b-4698-a892-b3f69155feca@gmail.com> Date: Mon, 4 Sep 2023 22:15:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: Dennis Snell Cc: PHP Internals References: <3B406FC1-DEC9-4CB3-80F7-CB90B2F5AA71@automattic.com> In-Reply-To: <3B406FC1-DEC9-4CB3-80F7-CB90B2F5AA71@automattic.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: dossche.niels@gmail.com (Niels Dossche) Hi Dennis On 04/09/2023 21:54, Dennis Snell wrote: > Thanks for the proposal Niels, > > I’ve dealt with my own grief working through issues in DOMDocument and wanting it to work but finding it inadequate. > >> HTML5 > > This would be a great starting point; I would love it if we took the opportunity to fix named character reference decoding, as PHP has (to my knowledge) never respected (at least in HTML5) that they decode differently inside attributes as they do inside markup, considering rules such as the ambiguous ampersand and decode errors. > > It’s also been frustrating that DOMDocument parses tags in RCDATA sections where they don’t exist, such as in TITLE or TEXTAREA elements, escapes certain types of invalid comments so that they appear rendered in the saved document, and misses basic semantic rules (e.g. creating a BUTTON element as a child of a BUTTON element instead of closing out the already-open BUTTON). With this proposal: a real HTML5 parser, these above mentioned problems will fortunately be a problem from the past :) > > I’d like to share some what a few of us have been working on inside WordPress, which is to build a conformant streaming HTML5 parser: >  - https://developer.wordpress.org/reference/classes/wp_html_tag_processor/ >  - https://make.wordpress.org/core/2023/08/19/progress-report-html-api/ > > It’s just food for thought right now because adding HTML5 support to DOMDocument would benefit everyone, but we decided we had common need in PHP to work with HTML not in a DOM, but in a streaming fashion, one with very little runtime overhead. My long-term plan has been to get a good grasp for the interface needs and thoroughly test it within the WordPress community and then propose its inclusion into PHP. It’s been incredibly handy so far, and on my laptop runs at around 20 MB/s, which is not great, but good enough for many needs. My naive C port runs on the same laptop at around 80 MB/s and I believe that we can likely triple or quadruple that speed again if any of us working on it knew how to take advantage of SIMD instrinsics. > > It tries to accomplish a few goals: >  - be fast enough >  - interpret HTML as an HTML5-compliant browser will >  - find specific locations within an HTML document and then read or modify them >  - pass through any invalid HTML it encounters for the browser to resolve/fix unless modifying the part of the document containing those invalid constructions > I've seen someone link this on Reddit today, it's a really nice project! It reminds me of Cloudflare's lol-html, which is also a streaming parser used to modify and sanitize documents linearly. I believe this could be a great addition, it solves a different problem that the ext/dom extension solves. So I think it would be a great complementary addition. > I only bring up this different interface because once we started digging deep into DOMDocument we found that the problems with it were far from superficial; that there is a host of problems and a mismatched interface to our common needs. It has surprised me that PHP, the language of the web, has had such trouble handling HTML, the language of the web, and we wanted to completely resolve this issue once and for all within WordPress so we can clean up decades’ old problems with encoding, decoding, security, and sanitization. Yes, I was also quite surprised of the lacking support for modern web features, and also the problems with spec compliance. I only recently got into maintaining ext/dom. So there's still a lot of work to do. I had already started with adding more DOM APIs in the 8.3 release cycle and plan to continue that effort in 8.4. Another major project I want to do for 8.4, besides HTML5 support, is fixing the spec compliance issues in an opt-in manner. This would help with security & sanitization problems (HTML5 should help with the encoding&decoding). > > Warmly, > Dennis Snell Kind regards Niels > >> On Sep 2, 2023, at 12:41 PM, Niels Dossche > wrote: >> >> I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization support". >> https://wiki.php.net/rfc/domdocument_html5_parser >> >> Kind regards >> Niels