Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125192 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 32A981A00BD for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724531810; bh=SU6U1EtcA/3roD/2kGFacrIVgWHGFdy96/zKpOG6vqo=; h=From:Subject:In-Reply-To:Date:Cc:References:To:From; b=dBauMfPGYQxgpTsVE47ZKMu1HOAeMYEvkf2VNT0Gkd07FwfTpUo/aZ0rNTOT4C8S9 gs5bQm39/qGT1gkMw4CMJTe6XIfFI70qXWhq8FPubr5oMiXK1hfCd8kNwoyt29TGdM o4cVgTSMz7D+A6OaXCHafmIAhBe74ubPEEUxEgkxhhEO2oMkulQOufgyql3WzoiTWL BXZ1wvUVV0/FuG9HQML5KI6cMS1hBvYK+ZE5GdKGNa7xfTABamqNHmnRhixHEi2HqL DTyQryOVc/KF0AffWylq7awpbV34f8Ae7XBcOm9ng3lm8ova15X/sRweMJnqFY85bf eJRUCwMHP8aCg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 7B92F180083 for <internals@lists.php.net>; Sat, 24 Aug 2024 20:36:48 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: <dennis.snell@automattic.com> Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for <internals@lists.php.net>; Sat, 24 Aug 2024 20:36:45 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 24C733408FB for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:message-id:content-transfer-encoding:date :date:in-reply-to:subject:subject:mime-version:content-type :content-type:from:from:received:received:received:received :received:received; s=automattic1; t=1724531692; bh=SU6U1EtcA/3r oD/2kGFacrIVgWHGFdy96/zKpOG6vqo=; b=ffHH/ypYbPE8pDXS0BUp/+/EH5qU xvW1rSZjlYqQvptNpJDV/87uYV9MgkPZcgDql5nCNwkSMSyqhlBchkCKt4N5b2Ql klNynHrJV716GEAVN2MB52+ahQR/zuw46YyB12jkakdP2BIONUJ3HY5B85wyH/r5 Y9QmA/seE93QUTMUjjBXj3bbosOZpWou+3/t1JapRTf190u0M+KbQgDwqUGInBHP maolhFiJZ48gBZSsv3gbM7iGWFsVKIhWSI+K6KZIHECtqJo893io+83g4bfxckS/ RyxZO90LVXt0sPQGSwcHWx+TJ350VWLDGIp7HK3h8cXkZJU0zAuFhrbuCg== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FK5P4mYVDDuA for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:52 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id CEEFB3408F9 for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:52 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="lZnkN6uT"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="lFAoHp7m"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="lu9/dr8W"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id BF872A078B for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724531692; bh=SU6U1EtcA/3roD/2kGFacrIVgWHGFdy96/zKpOG6vqo=; h=From:Subject:In-Reply-To:Date:Cc:References:To:From; b=lZnkN6uTNMMdJMthKU9NyoIgViWK5lrjsv8tgdXy00hD4EG0nU38Lxj+r76cYVGnh SndeTKx+O87VK/sfhHmR+XH+X2tIChhzJV4eQKTrLJXSa7CYQdBSD15s4P02pRtn5s dn0LBDAk9+xpGJxrH1EXbq+KKCap4sltHhRNnXUViHUq7b1pvjJVJb48CsZAyEMmtN o58SDLd7pjcAav0kCNvEPjKPCfapv1flWWtReFbtW/EuGNcCJaSCTkj2xQaPwlaiN4 pu0cdYlv4aB5nai+KEhV04f9TaUIcK4tj77HE1m6mRFj3qtl7LwqLi8b3GfgYd5Tjd L7w9uALvcpypA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1724531692; bh=SU6U1EtcA/3roD/2kGFacrIVgWHGFdy96/zKpOG6vqo=; h=From:Subject:In-Reply-To:Date:Cc:References:To:From; b=lFAoHp7m4jyozEyCrNHivWPiKQopTvKSeD1vrps1qBKkFNUGpfL2m9oACp+BCwOPk QeqYWw/yCfuvuRYuFW9uPFso+s1jvTehdckaMGQCgiLTd8ZyfBiwp9dXAO2Nj6kMxJ KfAmfI5CjbutofuhJt+YOm+CCWghPyIxFzWbx8MkxkQYBXduTUfeK7KjVpjqggyhg0 GjmwAX0zIoKE/6+HygbFouhoM1F2TgWl2DmqCOtIrce/o49Nlxm+PyMwLadP3/A5Zg 5R68H7oSdG99CwJaz8AoE8HivpptxZ0gDq2UvqQR+PM1M3kJdIx1jFsv9GoFwNzx2+ w5HAIPkSEjsBQ== Received: from mail-io1-f71.google.com (mail-io1-f71.google.com [209.85.166.71]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id A87C7A0386 for <internals@lists.php.net>; Sat, 24 Aug 2024 20:34:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724531692; bh=SU6U1EtcA/3roD/2kGFacrIVgWHGFdy96/zKpOG6vqo=; h=From:Subject:In-Reply-To:Date:Cc:References:To:From; b=lu9/dr8WIfGlPql3CxoryUULval5SURCjdTi4VtmHM86Lbjakyc3DJt6wUgvhUf79 /dmy7R1YkeO4+EAc0BIUbVE8q09fxtku9GnL8IDs3kryjLSJUoGqrutTYFe3l4iXjH 0gB4QeV4TgVoQSIfVNR0enf92UtjDfYjPXFGALJwlEkROLiqYwXOogsnf8M85LhtDq azeC9O36paYKrgbYhcu0LUiPqj8jE48i4aVa83bdZtdvGxq/aiLKaZzGDGD/D/pYpe 9IyKUATs2yvyb31297BAubiKQ9kji0XJL8gXXjzS4F/5sZoweuFVvE2PECCBJMb3wZ hZ4cQOY3UvI5Q== Received: by mail-io1-f71.google.com with SMTP id ca18e2360f4ac-81f9053ac4dso328998239f.0 for <internals@lists.php.net>; Sat, 24 Aug 2024 13:34:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724531692; x=1725136492; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:subject:mime-version:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wqJbEownqVGQ3cbYFl+ns3sZpAT/Eg1oIFKUkcTDQdg=; b=Ku2MlXjjvo4vN98ND+9NzYdbgVUdRFDOGoFi9USktY2RsWQFpiJT5Yy9KjN+UNnUzc xt3sV0iZFuvEPPHhKjC1pI6dlcodys2KRUDlMyD4yEAz74nMPDfXz/xO3B21VggeCRJo BcC/gsFY/i3MFitrtz20fCoBKUkz2+y64Y1YUB8Ec6xE3rmNfj+Rvd0RCxlJMCLl04W2 72Krn/UlrlEwBfbzaiZUp4gJ0CdQhkQcjEGAzAWU+N5VgZLu4OuMmLhZx7h2W3CLkU0i 6WtUDqfTs+MIisMUaizSHWf95HhatP3KtII6jMrMNb1KwZ2Ujw3sbljOmggXHpi1NE3j mqWg== X-Forwarded-Encrypted: i=1; AJvYcCX0BQoEWXR2/vzbOQgdPpii/9XCXHYDicD+D0/PMJidu3GHDWC4w0gwMYqkWVwqOBNaG5f8NVUsvSY=@lists.php.net X-Gm-Message-State: AOJu0Yx0/3pWJHkQ16oR/TlHgUoxUZRuWcJD8GEogyTqkCgDfivZjdj3 6vjhJMcddBZoHzpRWaKbjDnPf7tIas2t4zdxCDXqXMztRU4PsTznsEngH6ZC79ZuQLTSDhbZdw0 CqdL+lG/bzq0lf6el/uOuCaqNf93gHmCSZj0AZVLqX1YwzxGXavFAnZI= X-Received: by 2002:a05:6e02:20e4:b0:39a:f9e2:3ed2 with SMTP id e9e14a558f8ab-39e3c985071mr62462605ab.11.1724531692128; Sat, 24 Aug 2024 13:34:52 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHVq3dMsSIwo4PzVB3ck/LOYkFSeTRlDK807gtvnRi1an4Bn/aa4bNeIkcMrn6Qyl0m018U1g== X-Received: by 2002:a05:6e02:20e4:b0:39a:f9e2:3ed2 with SMTP id e9e14a558f8ab-39e3c985071mr62462485ab.11.1724531691557; Sat, 24 Aug 2024 13:34:51 -0700 (PDT) Received: from smtpclient.apple (ip70-171-161-83.om.om.cox.net. [70.171.161.83]) by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-39d73eee741sm21687915ab.81.2024.08.24.13.34.51 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 24 Aug 2024 13:34:51 -0700 (PDT) X-Google-Original-From: Dennis Snell <dennis.snell@automattic.com> Content-Type: text/plain; charset=utf-8 Precedence: bulk list-help: <mailto:internals+help@lists.php.net list-unsubscribe: <mailto:internals+unsubscribe@lists.php.net> list-post: <mailto:internals@lists.php.net> List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\)) Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand In-Reply-To: <48fa132e-3511-4503-8523-b59972bcfd53@gmx.de> Date: Sat, 24 Aug 2024 15:34:40 -0500 Cc: Niels Dossche <dossche.niels@gmail.com>, Internals <internals@lists.php.net> Content-Transfer-Encoding: quoted-printable Message-ID: <87F1748F-BE78-4B3E-989C-293A8F5CE2E0@automattic.com> References: <CAA8B983-FD39-4AB1-B314-62608B0C0EF4@a8c.com> <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> <efaf4c62-a552-4232-8a22-410578c13b8d@gmail.com> <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> <48fa132e-3511-4503-8523-b59972bcfd53@gmx.de> To: "Christoph M. Becker" <cmbecker69@gmx.de> X-Mailer: Apple Mail (2.3776.700.51) From: dennis.snell@automattic.com (Dennis Snell) > On Aug 24, 2024, at 7:47=E2=80=AFAM, Christoph M. Becker = <cmbecker69@gmx.de> wrote: >=20 > On 23.08.2024 at 01:02, Dennis Snell wrote: >=20 >>> If we could have a single implementation, that would be great. I do = understand of course your concern that DOM is not a required extension, = and therefore basing the internals on Lexbor makes it tied to the DOM = extension which may not be available. I however suspect that a large = chunk of people needing a function like this have DOM available (as DOM = is required by many HTML-processing-related packages). I can also look = into it sometime soon if you want; anyway feel free to ping me. >>=20 >> I=E2=80=99m also very open to lexbor-based approaches but I=E2=80=99ve = so-far found it more complicated than I expected. In some part this is = because it involves setting up the parser and state machine for the HTML = specification and much of the actual decoding can be safely done without = this. >>=20 >> The other part is the extension aspect. I hear you, that you would = expect calling code to have the DOM extensions available, but that=E2=80=99= s simply not the case when developing a platform like WordPress, which I = do. We don=E2=80=99t have control over the servers or environments where = people are deploying this, and the availability of the DOM extensions is = low enough that WordPress code simply cannot use `DOMDocument` (even = though it shouldn=E2=80=99t because of the wild problems that has for = attempting to parse HTML). >>=20 >> People resort to `html_entity_decode()` because that=E2=80=99s the = only option. In WordPress we now have a spec-compliant decoder, but as = it=E2=80=99s in user-space PHP its performance is far below what=E2=80=99s= possible. >>=20 >> I=E2=80=99d love your help in setting up lexbor=E2=80=99s state = machine to decode text nodes. I=E2=80=99d love it even more if this = could be part of the PHP language. It constantly surprises me that _the = language of the web_ (PHP) doesn=E2=80=99t have the tools to speak _the = language of the web_ (HTML). This RFC is all about taking a step towards = ensuring that PHP developers can rely on PHP to be a reliable middle-man = between the HTML domain and the PHP domain. >>=20 >> In other words, requiring the DOM extension or `DOM\HtmlDocument` = would be such a non-starter for WordPress (accounting for 43% of the web = today) that it would completely unavailable. >=20 > Well, I don't think it would be a big deal to move the bundled lexbor = to > somewhere where it is always available. I mean, so far it's only used > by ext/dom so it's bundled there, but if other parts of the php-src = code > base would use it, we could put it elsewhere. Having a DOM parser for HTML in PHP itself without requiring an = extension would open up many new possibilities. For example, WordPress = test suites don=E2=80=99t have any functional = =E2=80=9CassertEquivalentMarkup()=E2=80=9D functions because there=E2=80=99= s no spec-compliant parser in PHP. We finally wrote our own user-space = HTML parser, but relying on `DOM\HtmlDocument` would be much easier. These test suites need to run on a variety of environments and PHP = versions, so it=E2=80=99s moot thinking we could hasten the use of a = native class to get the job done, but if it remains locked inside an = optional extension, it may be borderline impossible to ever migrate to = it. >=20 > Christoph >=20 Dennis Snell