Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:125192
X-Original-To: internals@lists.php.net
Delivered-To: internals@lists.php.net
Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5])
	by qa.php.net (Postfix) with ESMTPS id 32A981A00BD
	for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:59 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail;
	t=1724531810; bh=SU6U1EtcA/3roD/2kGFacrIVgWHGFdy96/zKpOG6vqo=;
	h=From:Subject:In-Reply-To:Date:Cc:References:To:From;
	b=dBauMfPGYQxgpTsVE47ZKMu1HOAeMYEvkf2VNT0Gkd07FwfTpUo/aZ0rNTOT4C8S9
	 gs5bQm39/qGT1gkMw4CMJTe6XIfFI70qXWhq8FPubr5oMiXK1hfCd8kNwoyt29TGdM
	 o4cVgTSMz7D+A6OaXCHafmIAhBe74ubPEEUxEgkxhhEO2oMkulQOufgyql3WzoiTWL
	 BXZ1wvUVV0/FuG9HQML5KI6cMS1hBvYK+ZE5GdKGNa7xfTABamqNHmnRhixHEi2HqL
	 DTyQryOVc/KF0AffWylq7awpbV34f8Ae7XBcOm9ng3lm8ova15X/sRweMJnqFY85bf
	 eJRUCwMHP8aCg==
Received: from php-smtp4.php.net (localhost [127.0.0.1])
	by php-smtp4.php.net (Postfix) with ESMTP id 7B92F180083
	for <internals@lists.php.net>; Sat, 24 Aug 2024 20:36:48 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net
X-Spam-Level: 
X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=4.0.0
X-Spam-Virus: No
X-Envelope-From: <dennis.snell@automattic.com>
Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by php-smtp4.php.net (Postfix) with ESMTPS
	for <internals@lists.php.net>; Sat, 24 Aug 2024 20:36:45 +0000 (UTC)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by mx1.dfw.automattic.com (Postfix) with ESMTP id 24C733408FB
	for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:53 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com;
	 h=x-mailer:references:message-id:content-transfer-encoding:date
	:date:in-reply-to:subject:subject:mime-version:content-type
	:content-type:from:from:received:received:received:received
	:received:received; s=automattic1; t=1724531692; bh=SU6U1EtcA/3r
	oD/2kGFacrIVgWHGFdy96/zKpOG6vqo=; b=ffHH/ypYbPE8pDXS0BUp/+/EH5qU
	xvW1rSZjlYqQvptNpJDV/87uYV9MgkPZcgDql5nCNwkSMSyqhlBchkCKt4N5b2Ql
	klNynHrJV716GEAVN2MB52+ahQR/zuw46YyB12jkakdP2BIONUJ3HY5B85wyH/r5
	Y9QmA/seE93QUTMUjjBXj3bbosOZpWou+3/t1JapRTf190u0M+KbQgDwqUGInBHP
	maolhFiJZ48gBZSsv3gbM7iGWFsVKIhWSI+K6KZIHECtqJo893io+83g4bfxckS/
	RyxZO90LVXt0sPQGSwcHWx+TJ350VWLDGIp7HK3h8cXkZJU0zAuFhrbuCg==
X-Virus-Scanned: Debian amavisd-new at wordpress.com
Received: from mx1.dfw.automattic.com ([127.0.0.1])
	by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id FK5P4mYVDDuA for <internals@lists.php.net>;
	Sat, 24 Aug 2024 20:34:52 +0000 (UTC)
Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by mx1.dfw.automattic.com (Postfix) with ESMTPS id CEEFB3408F9
	for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:52 +0000 (UTC)
Authentication-Results: mail.automattic.com;
	dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="lZnkN6uT";
	dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="lFAoHp7m";
	dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="lu9/dr8W";
	dkim-atps=neutral
Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id BF872A078B
	for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com;
	s=automattic2; t=1724531692;
	bh=SU6U1EtcA/3roD/2kGFacrIVgWHGFdy96/zKpOG6vqo=;
	h=From:Subject:In-Reply-To:Date:Cc:References:To:From;
	b=lZnkN6uTNMMdJMthKU9NyoIgViWK5lrjsv8tgdXy00hD4EG0nU38Lxj+r76cYVGnh
	 SndeTKx+O87VK/sfhHmR+XH+X2tIChhzJV4eQKTrLJXSa7CYQdBSD15s4P02pRtn5s
	 dn0LBDAk9+xpGJxrH1EXbq+KKCap4sltHhRNnXUViHUq7b1pvjJVJb48CsZAyEMmtN
	 o58SDLd7pjcAav0kCNvEPjKPCfapv1flWWtReFbtW/EuGNcCJaSCTkj2xQaPwlaiN4
	 pu0cdYlv4aB5nai+KEhV04f9TaUIcK4tj77HE1m6mRFj3qtl7LwqLi8b3GfgYd5Tjd
	 L7w9uALvcpypA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com;
	s=automattic1; t=1724531692;
	bh=SU6U1EtcA/3roD/2kGFacrIVgWHGFdy96/zKpOG6vqo=;
	h=From:Subject:In-Reply-To:Date:Cc:References:To:From;
	b=lFAoHp7m4jyozEyCrNHivWPiKQopTvKSeD1vrps1qBKkFNUGpfL2m9oACp+BCwOPk
	 QeqYWw/yCfuvuRYuFW9uPFso+s1jvTehdckaMGQCgiLTd8ZyfBiwp9dXAO2Nj6kMxJ
	 KfAmfI5CjbutofuhJt+YOm+CCWghPyIxFzWbx8MkxkQYBXduTUfeK7KjVpjqggyhg0
	 GjmwAX0zIoKE/6+HygbFouhoM1F2TgWl2DmqCOtIrce/o49Nlxm+PyMwLadP3/A5Zg
	 5R68H7oSdG99CwJaz8AoE8HivpptxZ0gDq2UvqQR+PM1M3kJdIx1jFsv9GoFwNzx2+
	 w5HAIPkSEjsBQ==
Received: from mail-io1-f71.google.com (mail-io1-f71.google.com [209.85.166.71])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id A87C7A0386
	for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com;
	s=automattic2; t=1724531692;
	bh=SU6U1EtcA/3roD/2kGFacrIVgWHGFdy96/zKpOG6vqo=;
	h=From:Subject:In-Reply-To:Date:Cc:References:To:From;
	b=lu9/dr8WIfGlPql3CxoryUULval5SURCjdTi4VtmHM86Lbjakyc3DJt6wUgvhUf79
	 /dmy7R1YkeO4+EAc0BIUbVE8q09fxtku9GnL8IDs3kryjLSJUoGqrutTYFe3l4iXjH
	 0gB4QeV4TgVoQSIfVNR0enf92UtjDfYjPXFGALJwlEkROLiqYwXOogsnf8M85LhtDq
	 azeC9O36paYKrgbYhcu0LUiPqj8jE48i4aVa83bdZtdvGxq/aiLKaZzGDGD/D/pYpe
	 9IyKUATs2yvyb31297BAubiKQ9kji0XJL8gXXjzS4F/5sZoweuFVvE2PECCBJMb3wZ
	 hZ4cQOY3UvI5Q==
Received: by mail-io1-f71.google.com with SMTP id ca18e2360f4ac-81f9053ac4dso328998239f.0
        for <internals@lists.php.net>; Sat, 24 Aug 2024 13:34:52 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1724531692; x=1725136492;
        h=to:references:message-id:content-transfer-encoding:cc:date
         :in-reply-to:subject:mime-version:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=wqJbEownqVGQ3cbYFl+ns3sZpAT/Eg1oIFKUkcTDQdg=;
        b=Ku2MlXjjvo4vN98ND+9NzYdbgVUdRFDOGoFi9USktY2RsWQFpiJT5Yy9KjN+UNnUzc
         xt3sV0iZFuvEPPHhKjC1pI6dlcodys2KRUDlMyD4yEAz74nMPDfXz/xO3B21VggeCRJo
         BcC/gsFY/i3MFitrtz20fCoBKUkz2+y64Y1YUB8Ec6xE3rmNfj+Rvd0RCxlJMCLl04W2
         72Krn/UlrlEwBfbzaiZUp4gJ0CdQhkQcjEGAzAWU+N5VgZLu4OuMmLhZx7h2W3CLkU0i
         6WtUDqfTs+MIisMUaizSHWf95HhatP3KtII6jMrMNb1KwZ2Ujw3sbljOmggXHpi1NE3j
         mqWg==
X-Forwarded-Encrypted: i=1; AJvYcCX0BQoEWXR2/vzbOQgdPpii/9XCXHYDicD+D0/PMJidu3GHDWC4w0gwMYqkWVwqOBNaG5f8NVUsvSY=@lists.php.net
X-Gm-Message-State: AOJu0Yx0/3pWJHkQ16oR/TlHgUoxUZRuWcJD8GEogyTqkCgDfivZjdj3
	6vjhJMcddBZoHzpRWaKbjDnPf7tIas2t4zdxCDXqXMztRU4PsTznsEngH6ZC79ZuQLTSDhbZdw0
	CqdL+lG/bzq0lf6el/uOuCaqNf93gHmCSZj0AZVLqX1YwzxGXavFAnZI=
X-Received: by 2002:a05:6e02:20e4:b0:39a:f9e2:3ed2 with SMTP id e9e14a558f8ab-39e3c985071mr62462605ab.11.1724531692128;
        Sat, 24 Aug 2024 13:34:52 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IHVq3dMsSIwo4PzVB3ck/LOYkFSeTRlDK807gtvnRi1an4Bn/aa4bNeIkcMrn6Qyl0m018U1g==
X-Received: by 2002:a05:6e02:20e4:b0:39a:f9e2:3ed2 with SMTP id e9e14a558f8ab-39e3c985071mr62462485ab.11.1724531691557;
        Sat, 24 Aug 2024 13:34:51 -0700 (PDT)
Received: from smtpclient.apple (ip70-171-161-83.om.om.cox.net. [70.171.161.83])
        by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-39d73eee741sm21687915ab.81.2024.08.24.13.34.51
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Sat, 24 Aug 2024 13:34:51 -0700 (PDT)
X-Google-Original-From: Dennis Snell <dennis.snell@automattic.com>
Content-Type: text/plain;
	charset=utf-8
Precedence: bulk
list-help: <mailto:internals+help@lists.php.net
list-unsubscribe: <mailto:internals+unsubscribe@lists.php.net>
list-post: <mailto:internals@lists.php.net>
List-Id: internals.lists.php.net
x-ms-reactions: disallow
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\))
Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand
In-Reply-To: <48fa132e-3511-4503-8523-b59972bcfd53@gmx.de>
Date: Sat, 24 Aug 2024 15:34:40 -0500
Cc: Niels Dossche <dossche.niels@gmail.com>,
 Internals <internals@lists.php.net>
Content-Transfer-Encoding: quoted-printable
Message-ID: <87F1748F-BE78-4B3E-989C-293A8F5CE2E0@automattic.com>
References: <CAA8B983-FD39-4AB1-B314-62608B0C0EF4@a8c.com>
 <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com>
 <efaf4c62-a552-4232-8a22-410578c13b8d@gmail.com>
 <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com>
 <48fa132e-3511-4503-8523-b59972bcfd53@gmx.de>
To: "Christoph M. Becker" <cmbecker69@gmx.de>
X-Mailer: Apple Mail (2.3776.700.51)
From: dennis.snell@automattic.com (Dennis Snell)



> On Aug 24, 2024, at 7:47=E2=80=AFAM, Christoph M. Becker =
<cmbecker69@gmx.de> wrote:
>=20
> On 23.08.2024 at 01:02, Dennis Snell wrote:
>=20
>>> If we could have a single implementation, that would be great. I do =
understand of course your concern that DOM is not a required extension, =
and therefore basing the internals on Lexbor makes it tied to the DOM =
extension which may not be available. I however suspect that a large =
chunk of people needing a function like this have DOM available (as DOM =
is required by many HTML-processing-related packages). I can also look =
into it sometime soon if you want; anyway feel free to ping me.
>>=20
>> I=E2=80=99m also very open to lexbor-based approaches but I=E2=80=99ve =
so-far found it more complicated than I expected. In some part this is =
because it involves setting up the parser and state machine for the HTML =
specification and much of the actual decoding can be safely done without =
this.
>>=20
>> The other part is the extension aspect. I hear you, that you would =
expect calling code to have the DOM extensions available, but that=E2=80=99=
s simply not the case when developing a platform like WordPress, which I =
do. We don=E2=80=99t have control over the servers or environments where =
people are deploying this, and the availability of the DOM extensions is =
low enough that WordPress code simply cannot use `DOMDocument` (even =
though it shouldn=E2=80=99t because of the wild problems that has for =
attempting to parse HTML).
>>=20
>> People resort to `html_entity_decode()` because that=E2=80=99s the =
only option. In WordPress we now have a spec-compliant decoder, but as =
it=E2=80=99s in user-space PHP its performance is far below what=E2=80=99s=
 possible.
>>=20
>> I=E2=80=99d love your help in setting up lexbor=E2=80=99s state =
machine to decode text nodes. I=E2=80=99d love it even more if this =
could be part of the PHP language. It constantly surprises me that _the =
language of the web_ (PHP) doesn=E2=80=99t have the tools to speak _the =
language of the web_ (HTML). This RFC is all about taking a step towards =
ensuring that PHP developers can rely on PHP to be a reliable middle-man =
between the HTML domain and the PHP domain.
>>=20
>> In other words, requiring the DOM extension or `DOM\HtmlDocument` =
would be such a non-starter for WordPress (accounting for 43% of the web =
today) that it would completely unavailable.
>=20
> Well, I don't think it would be a big deal to move the bundled lexbor =
to
> somewhere where it is always available.  I mean, so far it's only used
> by ext/dom so it's bundled there, but if other parts of the php-src =
code
> base would use it, we could put it elsewhere.

Having a DOM parser for HTML in PHP itself without requiring an =
extension would open up many new possibilities. For example, WordPress =
test suites don=E2=80=99t have any functional =
=E2=80=9CassertEquivalentMarkup()=E2=80=9D functions because there=E2=80=99=
s no spec-compliant parser in PHP. We finally wrote our own user-space =
HTML parser, but relying on `DOM\HtmlDocument` would be much easier.

These test suites need to run on a variety of environments and PHP =
versions, so it=E2=80=99s moot thinking we could hasten the use of a =
native class to get the job done, but if it remains locked inside an =
optional extension, it may be borderline impossible to ever migrate to =
it.

>=20
> Christoph
>=20

Dennis Snell