Newsgroups: php.internals,php.internals Path: news.php.net Xref: news.php.net php.internals:120979 php.internals:120980 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 48028 invoked from network); 4 Sep 2023 19:55:08 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 4 Sep 2023 19:55:08 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 96CB31804AC for ; Mon, 4 Sep 2023 12:55:07 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_40,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,HTML_MESSAGE,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS2635 192.0.84.0/24 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 4 Sep 2023 12:55:07 -0700 (PDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id CB0BA1D83A3 for ; Mon, 4 Sep 2023 19:55:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:in-reply-to:date:date:subject:subject :mime-version:content-type:content-type:message-id:from:from :received:received:received:received:received:received; s= automattic1; t=1693857304; bh=wCDk9VzaB+pFjw+1uSguNMDObAHRiOF2IW Xjo8fRh3s=; b=HeettBps/Njff8sqPXqdQQqsoDZV3QO96w6Dij+sT+VkeyfQSV XJ6uc93MBclDrsJQMrwulDwq0sg4Zwcq/GHe9t+cecLzu0as9xDEfsYWrJNwiF3c 6zKBDsp+hI1H+/bYLIKAx56HP6X3TQc819TsnaBCp2JPkLtn1ovwb2hwDpAEcs5g lpklyYUSKyT0eZILoWeaTMeoxJWXvuuPW229BE1z0fUbz8+b3/nEbDLdcn+sg8f/ D5x4nkzrI2rg/W48xtB6iDs/IzWM1xA9+PI0qC0tp7+RI7XkQ4Z4SxTSYa4tBdqz ElsR4T2nXFiVrD5F+K1Er2kFVkh+YbeKk2Cw== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nBFOZbvEVFKy for ; Mon, 4 Sep 2023 19:55:04 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id D3E691D8375 for ; Mon, 4 Sep 2023 19:55:04 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="PFSh6ym4"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id C43D4A0A0A for ; Mon, 4 Sep 2023 19:55:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1693857304; bh=wCDk9VzaB+pFjw+1uSguNMDObAHRiOF2IWXjo8fRh3s=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=PFSh6ym4Ek2Af7A2NOM9AsLGZAHWt5b1p7lm+NFHCFqaPpLnRXdDNbGV83f/RzLfK zWTB8fdT8ZnbVVr/Va/K2Wu7FQuktfxxMx35O2JSGzBejExqEdOOW0OJWMMAFHNDrC 8GDmQuAoWUtU+Wg1gd1f8Ohdsc0jTiu4th7ro8en43xXPo6DuI4VGW3iQzQaMicf8g tj1C23uAbsymKjJSEpbTxF+qh15+3DytKrvIBlcmYjuMglCUt0axJWajeHbCpw4ZwS Ucoeyqp9RCQq8665mYW2k2KXwPKM5DSfl43NcM5QNofG3OsvnhKNjOT7xFlWbBWlF/ W1uZPITI8+DiQ== Received: from mail-pl1-f198.google.com (mail-pl1-f198.google.com [209.85.214.198]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id BF23CA09E2 for ; Mon, 4 Sep 2023 19:55:04 +0000 (UTC) Received: by mail-pl1-f198.google.com with SMTP id d9443c01a7336-1c0888c175fso16002135ad.3 for ; Mon, 04 Sep 2023 12:55:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693857304; x=1694462104; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=b2TVSJG482nscAzUFX9/e/df+eqLBBiDdPcoGN7LUFY=; b=C0zC0dD3lBjy+j8WLsn1ABDT3tcce7Sp7IYkiFvWxgcoQ8xRQrLf6tPR7TxSDFFPhS EGBojhPZup39HZZkRTkwhqbY0UA97L885U+oXbd3az3L/mWMloWgmik3C87vh7wKywbi 4kT83ng69jXccHEPiWr+l9yLdPXLy4rYGWg5mULJerwE17Qpq3wTItWv8mb6Z1gx+Eoo 98BADxqggVu4KTvCsrO+ybsf4JBcG54qYvV/lPyUkuwEm2LX1pvIPnZ84LAJeoZMfbT4 wdC8GNPNY9M+L0kR0qpyQ4Yj5M7uMoo+pp34Q7juh90jQFhDjy/fOuWY06kdBD9qvJHE W3YQ== X-Gm-Message-State: AOJu0YxuvOwDlGHzDBA4UiFOd/R5Ip+1Bf7TD2incCb2SP0Uf0WkXY5G NQxOo9LL20e5s5qOj0DgZ+2y8BaFiDseMScTFxu691z1/S0ZFqdSxMjT6fjAhRt9ul/QRwNIoIJ FLZNzw8DCeaEjhM5B5LQsjHjjJg== X-Received: by 2002:a17:902:d50d:b0:1b6:6c32:59a8 with SMTP id b13-20020a170902d50d00b001b66c3259a8mr10499940plg.36.1693857303830; Mon, 04 Sep 2023 12:55:03 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGM0cE92pmULCK55diuC1UY6GmOJuccLskGOKnbxwlcppzqS3rqLobEgJdZB/2iip3qOKuUuQ== X-Received: by 2002:a17:902:d50d:b0:1b6:6c32:59a8 with SMTP id b13-20020a170902d50d00b001b66c3259a8mr10499926plg.36.1693857303345; Mon, 04 Sep 2023 12:55:03 -0700 (PDT) Received: from smtpclient.apple (142-254-87-225.fiber.dynamic.sonic.net. [142.254.87.225]) by smtp.gmail.com with ESMTPSA id jo8-20020a170903054800b001c3267ae31bsm4523477plb.301.2023.09.04.12.55.01 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 04 Sep 2023 12:55:03 -0700 (PDT) Reply-To: Dennis Snell X-Google-Original-From: Dennis Snell Message-ID: <3B406FC1-DEC9-4CB3-80F7-CB90B2F5AA71@automattic.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_16DD6D99-C3C6-4AF2-8455-90EA3211D453" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.700.6\)) Date: Mon, 4 Sep 2023 12:54:47 -0700 In-Reply-To: Cc: PHP Internals To: Niels Dossche References: X-Mailer: Apple Mail (2.3731.700.6) Subject: Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: internals@lists.php.net ("Dennis Snell via internals") --Apple-Mail=_16DD6D99-C3C6-4AF2-8455-90EA3211D453 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Thanks for the proposal Niels, I=E2=80=99ve dealt with my own grief working through issues in = DOMDocument and wanting it to work but finding it inadequate. > HTML5 This would be a great starting point; I would love it if we took the = opportunity to fix named character reference decoding, as PHP has (to my = knowledge) never respected (at least in HTML5) that they decode = differently inside attributes as they do inside markup, considering = rules such as the ambiguous ampersand and decode errors. It=E2=80=99s also been frustrating that DOMDocument parses tags in = RCDATA sections where they don=E2=80=99t exist, such as in TITLE or = TEXTAREA elements, escapes certain types of invalid comments so that = they appear rendered in the saved document, and misses basic semantic = rules (e.g. creating a BUTTON element as a child of a BUTTON element = instead of closing out the already-open BUTTON). I=E2=80=99d like to share some what a few of us have been working on = inside WordPress, which is to build a conformant streaming HTML5 parser: - = https://developer.wordpress.org/reference/classes/wp_html_tag_processor/ - https://make.wordpress.org/core/2023/08/19/progress-report-html-api/ It=E2=80=99s just food for thought right now because adding HTML5 = support to DOMDocument would benefit everyone, but we decided we had = common need in PHP to work with HTML not in a DOM, but in a streaming = fashion, one with very little runtime overhead. My long-term plan has = been to get a good grasp for the interface needs and thoroughly test it = within the WordPress community and then propose its inclusion into PHP. = It=E2=80=99s been incredibly handy so far, and on my laptop runs at = around 20 MB/s, which is not great, but good enough for many needs. My = naive C port runs on the same laptop at around 80 MB/s and I believe = that we can likely triple or quadruple that speed again if any of us = working on it knew how to take advantage of SIMD instrinsics. It tries to accomplish a few goals: - be fast enough - interpret HTML as an HTML5-compliant browser will - find specific locations within an HTML document and then read or = modify them - pass through any invalid HTML it encounters for the browser to = resolve/fix unless modifying the part of the document containing those = invalid constructions I only bring up this different interface because once we started digging = deep into DOMDocument we found that the problems with it were far from = superficial; that there is a host of problems and a mismatched interface = to our common needs. It has surprised me that PHP, the language of the = web, has had such trouble handling HTML, the language of the web, and we = wanted to completely resolve this issue once and for all within = WordPress so we can clean up decades=E2=80=99 old problems with = encoding, decoding, security, and sanitization. Warmly, Dennis Snell > On Sep 2, 2023, at 12:41 PM, Niels Dossche > wrote: >=20 > I'm opening the discussion for my RFC "DOM HTML5 parsing and = serialization support". > https://wiki.php.net/rfc/domdocument_html5_parser >=20 > Kind regards > Niels --Apple-Mail=_16DD6D99-C3C6-4AF2-8455-90EA3211D453--