Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125181 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id EC28B1A00BD for ; Sat, 24 Aug 2024 12:47:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724503779; bh=A4/EIaVkRPD0lsmlkcfhIhhJo3ivrjyF/xLdWf/PbBA=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=Vojz++NYc/ksKmaguMZcNIUcIiwi6Dl5wtZZhfH/VegGxPWG9Co3XR73Hoc46v515 S8hPBJicmswZ0gkpnyts2iz5tLw9kucjDGPv7rVz6mMyH8VI0fBBf0U4i5BHHbm3ej 2X77WXpIO+Dn4J9QzR/TCfwBpRwnBtA0AzCGQC6M+CPrGFROt+/Yglq0UCl7b4sjQx opTm2elPu8d+vDjwjpqWe3Zrs1KyEXnabU7qrw/l15m8APLs7ohdSRoDsqfeuQrFXt JaiiYdv9dDH9HK2M7SDSdhu9nxAFHP3PaZTgti41qjQdsYRTWkdEkuvwYloEc5gE6w KFbK9hbXoJMsw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A0D2A180052 for ; Sat, 24 Aug 2024 12:49:38 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.2 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mout.gmx.net (mout.gmx.net [212.227.15.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (prime256v1) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 24 Aug 2024 12:49:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmx.de; s=s31663417; t=1724503663; x=1725108463; i=cmbecker69@gmx.de; bh=UMornypWxE8f2l3mlVJ4YuQovfaA/4C6hyJP9O0iKuk=; h=X-UI-Sender-Class:Message-ID:Date:MIME-Version:Subject:To:Cc: References:From:In-Reply-To:Content-Type: Content-Transfer-Encoding:cc:content-transfer-encoding: content-type:date:from:message-id:mime-version:reply-to:subject: to; b=QlYH9PBcXQ9paGZTLXFyRL8lQF+rzReJkwOK9VTJCTWGCMdsjfx93Tlj+yIg7Z7a XwwdH0Sj7LsxiF5dMHIQtZjR7Sdjulz05wa7XcibA35DPDLhN+XhmW8Gf5mWUEC5z 9JDNnFyMd4mqvu2dBCjDZUPuwzzrXbFcO1pIP7tLj/sdRsUB84Zgn3lTiKmK51SmA 7835Jt33qy/BOeNVNX6NjQ2KjXVyidvrVe/elF9o2EqpcrCLadTXdtiQtpDvnR9IH muMwXw6GbnTAQ2j54qto3BWEs9dvv5Fwmv16CSJnH5/gWeG6iutqOADp4N42sSrBZ 95qg/dj1nl/WseHTNQ== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Received: from [192.168.2.130] ([79.251.205.37]) by mail.gmx.net (mrgmx005 [212.227.17.190]) with ESMTPSA (Nemesis) id 1MGhyS-1svbYL2G8I-00Bn4m; Sat, 24 Aug 2024 14:47:43 +0200 Message-ID: <48fa132e-3511-4503-8523-b59972bcfd53@gmx.de> Date: Sat, 24 Aug 2024 14:47:43 +0200 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand Content-Language: de-DE To: Dennis Snell , Niels Dossche Cc: Internals References: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> In-Reply-To: <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:DpzYCsavoJsQR2v6gr8HFS7vTLfHUJy3udqDTQViugwU7/e7btT 4Wxk6NdbVAQKkOXrFlCNOZ5eQ0UHNbI3ArqKmJoWJfHXFlrOVxErp6rWrdgVlPsoo8JoUda C4sgLTHszA8/X/fQwSYCBU3un6ycDIrr5wf264mGnBPI9xtn2tGKaZyOLudNZ4Lsgtw1kmP V87Xa1LgVnVqSN/v2EfCg== UI-OutboundReport: notjunk:1;M01:P0:4rVhsZzM7ik=;gJX3Bi3zqJBneTfd7y7dLbfpAWz 2mOGYdej/y1t6Pr+VeQf+gA8q2S7TCeDe2btMBeHRpnKekhZTPOV2wtT1g0EOAIndsDkNlwpV eXVwGovQnYiCm8QbLVroLgfoZmlC9d+JS9frT9l+6RG+CwNrV1q+dLQBf4JKt2FpMsEwa5k/Y X6+kni/MRO/CXZF93IvADZPyJMwat5rJTeePfUg1ZHIDPwOavTgMnF9hPNb4v+Ua9a822iKjg cGEDXbc4heTSxc/VqKks/sTe4uyr1uISyu+w5BL4munYHUy2Bg4xV8ByU6dmRaA1h4GLgBdQU V45h1IfdvaTX6YJ7733h9yrhwNj9tH2fcw+UpCIGEbFlblcwEbCwao7Gk5DCyWg+ihQkbdQka Pz8cZ43X7KfgOvgogL9o3U8qWsAKM6lRtnyD+emSJtfThOVvtip4Hkazvcyun0XpXyPzsebtG r1JREsGKF8iV3OR6e2vV6cgtnbUKH7L2YOQB0Cf52MpR235jTOYR7zx5vITGwGFFoPlVP3wTh /bHHMRnIMl1onEreZuik8WiL3Gw13AWbrfH7rqqAuPTFA1W4MkmNAymqc7fpJiC9hDkdpeP08 7VUq8N0S+Klwi2/SKzJjgJjD9YsVUNgeBrtbZQUWSUvqhhD3+dK3J8ndZPP2uqBUfMMxkApSd pdz2DPIJDeUW+cNE+UxAm9ATD9RwkB5UJbZWQV8jB8c0E+1RTcA+9emWpnZB47uR5S57xi7OT PFHLKZX4BFtyFSyDb6QXh4MU/6TKzjLYNypz8WUkRod7YHNUVOWrguGAVz+F1hVlOIaXj5WiQ KkRGOLG6x58VqqnQ+zU4NC7w== From: cmbecker69@gmx.de ("Christoph M. Becker") On 23.08.2024 at 01:02, Dennis Snell wrote: >> If we could have a single implementation, that would be great. I do und= erstand of course your concern that DOM is not a required extension, and t= herefore basing the internals on Lexbor makes it tied to the DOM extension= which may not be available. I however suspect that a large chunk of peopl= e needing a function like this have DOM available (as DOM is required by m= any HTML-processing-related packages). I can also look into it sometime so= on if you want; anyway feel free to ping me. > > I=E2=80=99m also very open to lexbor-based approaches but I=E2=80=99ve s= o-far found it more complicated than I expected. In some part this is beca= use it involves setting up the parser and state machine for the HTML speci= fication and much of the actual decoding can be safely done without this. > > The other part is the extension aspect. I hear you, that you would expec= t calling code to have the DOM extensions available, but that=E2=80=99s si= mply not the case when developing a platform like WordPress, which I do. W= e don=E2=80=99t have control over the servers or environments where people= are deploying this, and the availability of the DOM extensions is low eno= ugh that WordPress code simply cannot use `DOMDocument` (even though it sh= ouldn=E2=80=99t because of the wild problems that has for attempting to pa= rse HTML). > > People resort to `html_entity_decode()` because that=E2=80=99s the only = option. In WordPress we now have a spec-compliant decoder, but as it=E2=80= =99s in user-space PHP its performance is far below what=E2=80=99s possibl= e. > > I=E2=80=99d love your help in setting up lexbor=E2=80=99s state machine = to decode text nodes. I=E2=80=99d love it even more if this could be part = of the PHP language. It constantly surprises me that _the language of the = web_ (PHP) doesn=E2=80=99t have the tools to speak _the language of the we= b_ (HTML). This RFC is all about taking a step towards ensuring that PHP d= evelopers can rely on PHP to be a reliable middle-man between the HTML dom= ain and the PHP domain. > > In other words, requiring the DOM extension or `DOM\HtmlDocument` would = be such a non-starter for WordPress (accounting for 43% of the web today) = that it would completely unavailable. Well, I don't think it would be a big deal to move the bundled lexbor to somewhere where it is always available. I mean, so far it's only used by ext/dom so it's bundled there, but if other parts of the php-src code base would use it, we could put it elsewhere. Christoph