Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125215 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id B06931A00BD for ; Sun, 25 Aug 2024 15:25:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724599634; bh=PAOLIAHL+vU/u9OB5jhQCynsJiQB2R7PEl2GSJp3oZU=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=TNWSH0TJH6cW1tB+Mh2g1uni/egLvTBFFr+HR3abq1P0AZjsGNVhdrO9TbFJcnpsn jAuQeGa2snoTRsFPc/oWmOfwUrBJ+pM/kjOClaZ1k2cyJXnrknMO8i11qUzzdAH03e Rc/RRaD2GrbRlFy1DhSQ+Mc6tfgZ0E/Jv+C7ANYDinJ+YZvdl5xrGXsGAxs3tZZKIi TnX6xzDxDRyqASZEbEqClXBqMEpUpqLlYmPuFhVh/qwlq3VClzQDnfkJ4Qw8Z15gH7 My6S5NoZnNPJvmxg+MELSVCy8SKfkUqkVek9KiTiqfDnicmYQw9zan6+RrdGePNBzq 6kdGbhPVGym9A== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 00D1B18006D for ; Sun, 25 Aug 2024 15:27:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 25 Aug 2024 15:27:13 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id D8D6634059A for ; Sun, 25 Aug 2024 15:25:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:in-reply-to:date:date:subject:subject :mime-version:content-type:content-type:message-id:from:from :received:received:received:received:received:received; s= automattic1; t=1724599520; bh=PAOLIAHL+vU/u9OB5jhQCynsJiQB2R7PEl 2GSJp3oZU=; b=hdZPmsJZyyF3hIWbjhAs9NvCEPceFVV3xEvHl0fdh9sub5FGqV /0ceLWSFSbU86fhnniIhOY9qx0n8Al+xmBfIXPqqLqoARxBk9ockTkv7Rgxote1h sJq2AgLczlUzhpMRuroLMTfiYHV/ECafX1ndqSn2JpKYzLe9yOtoAVzBXgQRBN8k F+7OLYKgr/CKHJcMSO6Qp/x6ZpSg6GdJl2NiwkI+E7ZE2p1dCSaToEtw3HaZRvJ4 aYseFc+NsonpyiBxxSb2bfqzKedDswCKp5oYd3SijPT4iw1o/iOXqZ5e5+tRHa6i Tt4mLxQkDASdeN8YSj7cgTyNmZ3MZIUUlzcg== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id eOneB1g6ob5n for ; Sun, 25 Aug 2024 15:25:20 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 5F4C934053C for ; Sun, 25 Aug 2024 15:25:20 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="JagZ9Tk2"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="MUpQdVQd"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="fWNCibN6"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 023BEA0959 for ; Sun, 25 Aug 2024 15:25:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724599520; bh=PAOLIAHL+vU/u9OB5jhQCynsJiQB2R7PEl2GSJp3oZU=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=JagZ9Tk205BBSI4my19V+UvljkSIh7ksYPE2q9PqTQqo1bdUt8XRMB6HFvc81KL2B 1k3+I97WHPNZEgqkXFlZtjvppNblHmVobFSLVkl1+pjsSvhmpOaI+43hMElnFOAXfP bG0TEEmgvaEdhKPvYHasCtGqWVt8g6Zp8Ir9jI17Xk2yT2tdMsC53QV7eHDbNJVGAd N8/RbJ/8sCeToH71VvF93qLiW0sDg3VQUAoHIqbza1Ry1We/U7E6pD0uBn4zREz5Jh 8R8DrWBshziXt7ohmq33y7Od37hYKMUz2yaU054/K/wp/6qJUKIvWYONAvx3GHzTMk 3tcBmYrN7L6PQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1724599520; bh=PAOLIAHL+vU/u9OB5jhQCynsJiQB2R7PEl2GSJp3oZU=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=MUpQdVQdGUclH96RZtWxtNALne1bxh02Yqc8RaK6XoEM3ylfzlZ7tJ0bzG10hTeXL 0xMV1vcw0/1qDjI3bofrRWRVqpt5a6BrcniPaysUxJkvcyEECvis7Likmg9HgjXf++ 9LBu7B6IgHIYQPK4rQLq0afb6zaIXVVjFE0OuKEbhQ1ruUAm3wuGV+9w3dijYvlgDi e7QG+Jijmr9g2Ese4eCObKGZ3eqdkj7ECdpSsm5yUWEUp2EUus0/9y5lVKZDRlLgBZ wa3E1hi2g0ne0TXYQpHkmYlUz06wVpbLUUci4i/R0vO6USAiLPgG1n7NY6OYx69ffu X9M4NwaRSk6rA== Received: from mail-io1-f69.google.com (mail-io1-f69.google.com [209.85.166.69]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id D278FA0713 for ; Sun, 25 Aug 2024 15:25:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724599519; bh=PAOLIAHL+vU/u9OB5jhQCynsJiQB2R7PEl2GSJp3oZU=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=fWNCibN6JsMTTI3K61hXCc+OExYub+ykFoPFwPIHHlgq0uaHOnh93MOP21WksWAfM dvFq6O2Jr3kePA8DthhQJgnOoKHpDBTN/Qf0BZo9iyi2JtDjw9DyG8Teicf0o2FwSG N+0hKobes0xOi1X4rEkCz6pk+E2wu4w1Seue4QdtSRn+ZJlPc8mepBeXawlg7KySwu 9hA903qrwagzgeQ1G0b1pmQKI8txoxVPtjQDtPqsPhldOZoXVzIzK1MjzbuCq+A57N c8yOlpC+v6w7T8K/t8y8Mb/9By7JyPonzGx5W75ZpASdmBrIYnlVGe0p7RpZsltT2s c13bARi+BosSw== Received: by mail-io1-f69.google.com with SMTP id ca18e2360f4ac-81f99a9111fso371451539f.2 for ; Sun, 25 Aug 2024 08:25:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724599519; x=1725204319; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oiU56f4MKGiA14y7w7q4E1p7T7qEoGtVnzeUcts//7Q=; b=E1FWZ2/OUUDq5HChIEb94IkrVDHih7wn7iFJF8XIyNpmLgG12BJjiwBwu8a+B0dJJ+ +TOlPSWx2APemm0bamOSwyg35K2mOTSB2mzU7bxUpxLvt9R6DpcaKVYrVKQTfTBo4Bsa kiety2j+duMW2dNuOJA2+kjv3ZS1BviBwqcXl037+TR9E5hIPnnY3u8iGKrIQtu8CNgV oJ3Fu2TS8GSVY6LsrQACpXSvE/UwvFWAC17w5SBtaSNzHp0LDG13tgcXyZ1NT0AnhZ/2 hm/tfwvznU+ql3Ew3PnvA/YT+Bk+DLMBZwu1uw5DLwHCoEHb1XBoSbA9hwCwBC7PG8Xi oY/g== X-Forwarded-Encrypted: i=1; AJvYcCWweNcD9mDKStvZz96Gv+Ve4at5gzRQClQB4GcTwXP+aIpra8xQey6RD3awzQKKVgWLcDgX5GekRxI=@lists.php.net X-Gm-Message-State: AOJu0Yzn/hPK+PqUiLQp/CgvOFar4Ooq00kLfT1etF79SfqoDRAUXHnF Bmn9pyYE9Zm9+G4Cdfl8tEhKqohssdRWGVsQ1igabb1KoDTMKk1SYXO6XgcE2cYanD5DshP02K+ DA82WiCDQv9oPmi5rngYIphq5hi+XhjFQovPfMxxeHiHKFDKNjSXBHyE= X-Received: by 2002:a05:6602:1404:b0:822:2584:2f73 with SMTP id ca18e2360f4ac-827881aec80mr1154346339f.11.1724599519239; Sun, 25 Aug 2024 08:25:19 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHHMkqHE+UKufGmOutsveYV4hpqm3nzN60DpyYXjFvnSz9pqVpJ8eHH7y1vNwI+zMiIXryySg== X-Received: by 2002:a05:6602:1404:b0:822:2584:2f73 with SMTP id ca18e2360f4ac-827881aec80mr1154343839f.11.1724599518721; Sun, 25 Aug 2024 08:25:18 -0700 (PDT) Received: from smtpclient.apple (ip70-171-161-83.om.om.cox.net. [70.171.161.83]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4ce71152fc8sm1779818173.174.2024.08.25.08.25.18 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 25 Aug 2024 08:25:18 -0700 (PDT) X-Google-Original-From: Dennis Snell Message-ID: Content-Type: multipart/alternative; boundary="Apple-Mail=_B276F957-ED5F-4B85-84F3-3DD52172687D" Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\)) Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand Date: Sun, 25 Aug 2024 10:25:07 -0500 In-Reply-To: Cc: Niels Dossche , Internals To: Jakob Givoni References: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> <90B08F35-06D5-4D5C-BA7B-B7116EE18769@automattic.com> X-Mailer: Apple Mail (2.3776.700.51) From: dennis.snell@automattic.com (Dennis Snell) --Apple-Mail=_B276F957-ED5F-4B85-84F3-3DD52172687D Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Aug 25, 2024, at 3:15=E2=80=AFAM, Jakob Givoni = wrote: >=20 >=20 > On Sat, Aug 24, 2024 at 10:31=E2=80=AFPM Dennis Snell = > = wrote: >> On Aug 24, 2024, at 2:56=E2=80=AFPM, Jakob Givoni > wrote: >>>=20 >>> Hi Dennis, >>>=20 >>> Overall it sounds like a reasonable RFC. >>> =20 >>> > Dennis: >>> > >>> > > Niels: >>> > > >>> > > I'm not so sure that the name "decode_html" is self-descriptive = enough, it sounds very generic. >>> > >>> > The name is not very important to me. For the sake of history, the = reason I have chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an = HTML parser, this is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80= =9D content and decoding it into a =E2=80=9Cplain PHP string.=E2=80=9D >>>=20 >>> Why not make it two methods called "decode_html_text" and = "decode_html_attribute"? >>> Consider the following reasons: >>> 1. The function doesn't actually decode html as such, it decodes = either an html text node string or an html attribute string. >>=20 >> Thanks Jakob. In WordPress I did just this. >> https://developer.wordpress.org/reference/classes/wp_html_decoder/ >>=20 >> Part of the reason for that was the inability to require something = like an enum (due to PHP version support requirements). The Enum = solution feels very nice too. >>=20 >>> 2. Saves the $context parameter and the constants/enums, making the = call significantly shorter.=20 >>=20 >> In my PR I=E2=80=99ve actually expanded the Enum to include a few = other contexts. I feel like there=E2=80=99s a balance we have to do if = we want to ride the line between fully reliable and fully convenient. On = one hand, we could say =E2=80=9Cdon=E2=80=99t send the text content of a = SCRIPT element to this function!=E2=80=9D But on the other hand, that = kind of forces people to expect that SCRIPT content is different. >>=20 >> With the Enum there is that in-built training material when someone = looks and finds `Attribute | BodyText | ForeignText | Script | Style` = (the contexts I=E2=80=99ve explored in my PR).=20 >>=20 >> We could make the same argument for `decode_html_script()` and = `decode_foreign_text_node()` and `decode_html_style()`. Somehow the = context feels cleaner to me, and like a single entry point for learning = instead of five. >>=20 >=20 > Yes. With 5 different contexts it's starting to shift in favor of a = single function :-) > I only saw the RFC which from what I can tell still only features 2 of = them. I haven't seen the PR (RFC Implementation section says "Yet to = come").=20 Oops, I=E2=80=99ll get to this! >>> 3. It feels like decoding either text or attribute are two = significantly different things. I admit I could be wrong, if code like = decode_html($e->isAttritbute() ? HtmlContext::Attribute : = HtmlContext::Text, $e->getContent()) is likely to be seen. >>=20 >> None of these contexts are significantly different, which is one of = the major dangers of using `html_entity_decode()`. The results will look = just about right most of the time. It=E2=80=99s the subtle differences = that matter most, I suppose. >=20 > Well, that was kind of what I meant - even if the differences are = usually absent or subtle, they are significant (i.e. not necessarily = big, but meaningful), meaning using it wrong would give the wrong = result, right? Saying that they are not significantly different to me = means that the result would just be a little less good sometimes, not = directly wrong. In hindsight I think I misunderstood what you were saying and got it = backwards. I meant that the algorithms are subtly different, but as you = point out, yes, the outcomes can be significant. ln the better cases we = get data corruption, but these do lead to misidentification of unsafe = content. For example, =E2=80=9Cja\x00avascript=E2=80=9D should decode as = =E2=80=9Cjavascript=E2=80=9D when rendered by a browser when found = inside the BODY of a page, but an attribute should read = =E2=80=9Cja=EF=BF=BDvascript.=E2=80=9D >>=20 >> The lesson I have drawn is that people frequently have what they = understand to be a text node or an attribute value, but they aren=E2=80=99= t aware that they are supposed to decode differently, and they also = aren=E2=80=99t reaching to interact with a full parser to get these = values. If PHP could train people as they use these functions, purely = through their interfaces, I think that could help elevate the level of = reliability out there in the wild, as long as they aren=E2=80=99t too = cumbersome (hence explicitly no default context argument _or_ using = separately-named functions). >>=20 >> Having the Enum I think enhances the ease with which people can = reliably also decode things like SCRIPT and STYLE nodes. =E2=80=9CI know = `html_decode_text()` but I don=E2=80=99t know what the rules for SCRIPT = are or if they=E2=80=99re different so I=E2=80=99ll just stick with = that.=E2=80=9D vs =E2=80=9CMy IDE suggests that `Script` is a different = context, that=E2=80=99s interesting, I=E2=80=99ll try that and see how = it=E2=80=99s different." >>=20 >=20 > That is a good point and using enums favours that learning push since = they are inherently grouped together. >>>=20 >>> Best, >>> Jakob >>> =20 >>=20 >> Thanks for your input. I=E2=80=99m grateful for the discussions and = that people are sharing. >>=20 >=20 > Cheers! Warmly, Dennis Snell --Apple-Mail=_B276F957-ED5F-4B85-84F3-3DD52172687D Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
On Aug 25, = 2024, at 3:15=E2=80=AFAM, Jakob Givoni <jakob@givoni.dk> = wrote:


On Sat, Aug 24, 2024 at 10:31=E2=80=AF= PM Dennis Snell <dennis.snell@automattic.com> wrote:
On Aug = 24, 2024, at 2:56=E2=80=AFPM, Jakob Givoni <jakob@givoni.dk> = wrote:

Hi = Dennis,

Overall it sounds like a reasonable RFC.
  
> = Dennis:
>
> > Niels:
> >
> = > I'm not so sure that the name "decode_html" is self-descriptive = enough, it sounds very generic.
>
> The name is not very = important to me. For the sake of history, the reason I have chosen = =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an HTML parser, this is = focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=9D content and = decoding it into a =E2=80=9Cplain PHP string.=E2=80=9D

Why not = make it two methods called "decode_html_text" and = "decode_html_attribute"?
Consider the following = reasons:
1. The function doesn't actually decode html as such, = it decodes either an html text node string or an html attribute = string.

Thanks Jakob. = In WordPress I did just this.

Part of the reason for that was = the inability to require something like an enum (due to PHP version = support requirements). The Enum solution feels very nice = too.

2. = Saves the $context parameter and the constants/enums, making the call = significantly = shorter. 

In my = PR I=E2=80=99ve actually expanded the Enum to include a few other = contexts. I feel like there=E2=80=99s a balance we have to do if we want = to ride the line between fully = reliable and fully convenient. On = one hand, we could say =E2=80=9Cdon=E2=80=99t send the text content of a = SCRIPT element to this function!=E2=80=9D But on the other hand, that = kind of forces people to expect that SCRIPT content is = different.

With the Enum there is that in-built = training material when someone looks and finds `Attribute | BodyText | = ForeignText | Script | Style` (the contexts I=E2=80=99ve explored in my = PR). 

We could make the same argument for = `decode_html_script()` and `decode_foreign_text_node()` and = `decode_html_style()`. Somehow the context feels cleaner to me, and like = a single entry point for learning instead of = five.


Yes. With 5 different contexts it's starting to = shift in favor of a single function :-)
I only saw the RFC which from what I can tell = still only features 2 of them. I haven't seen the PR = (RFC Implementation section says "Yet to = come"). 

Oops, I=E2=80=99= ll get to this!

3. It feels like decoding either text or attribute are = two significantly different things. I admit I could be wrong, if code = like decode_html($e->isAttritbute() ? HtmlContext::Attribute : = HtmlContext::Text, $e->getContent()) is likely to be = seen.

None of these = contexts are significantly dif= ferent, which is one of the major dangers of using = `html_entity_decode()`. The results will look just about right most of = the time. It=E2=80=99s the subtle differences that matter most, I = suppose.

Well, that was kind of what I meant - even if = the differences are usually absent or subtle, they are significant (i.e. = not necessarily big, but meaningful), meaning using it wrong would give = the wrong result, right? Saying that they are not significantly = different to me = means that the result would just be a little less good sometimes, not = directly wrong.

In hindsight I = think I misunderstood what you were saying and got it backwards. I meant = that the algorithms are subtly different, but as you point out, yes, the = outcomes can be significant. ln the better cases we get data corruption, = but these do lead to misidentification of unsafe = content.

For example, = =E2=80=9C&#x6a&#x61;\x00avascript=E2=80=9D should decode as = =E2=80=9Cjavascript=E2=80=9D when rendered by a browser when found = inside the BODY of a page, but an attribute should read = =E2=80=9Cja=EF=BF=BDvascript.=E2=80=9D


The lesson I have = drawn is that people frequently have what they understand to be a text = node or an attribute value, but they aren=E2=80=99t aware that they are = supposed to decode differently, and they also aren=E2=80=99t reaching to = interact with a full parser to get these values. If PHP could train = people as they use these functions, purely through their interfaces, I = think that could help elevate the level of reliability out there in the = wild, as long as they aren=E2=80=99t too cumbersome = (hence explicitly no default context argument _or_ using = separately-named functions).

Having the Enum I = think enhances the ease with which people can reliably also decode = things like SCRIPT and STYLE nodes. =E2=80=9CI know `html_decode_text()` = but I don=E2=80=99t know what the rules for SCRIPT are or if they=E2=80=99= re different so I=E2=80=99ll just stick with that.=E2=80=9D vs =E2=80=9CMy= IDE suggests that `Script` is a different context, that=E2=80=99s = interesting, I=E2=80=99ll try that and see how it=E2=80=99s = different."


That is a good = point and using enums favours that learning push since they are = inherently grouped together.

Best,
Jakob
 
<= /blockquote>
Thanks for your input. I=E2=80=99m grateful = for the discussions and that people are = sharing.


Cheers!

Warmly,
De= nnis Snell

= --Apple-Mail=_B276F957-ED5F-4B85-84F3-3DD52172687D--