Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:125204
Precedence: bulk
MIME-Version: 1.0
References: <CAA8B983-FD39-4AB1-B314-62608B0C0EF4@a8c.com> <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com>
 <efaf4c62-a552-4232-8a22-410578c13b8d@gmail.com> <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com>
 <CAJmjfvXvBpxc-t2SkrK587Cm+=iOY9tNmpt-SC-pgg7pALJ9XA@mail.gmail.com> <90B08F35-06D5-4D5C-BA7B-B7116EE18769@automattic.com>
In-Reply-To: <90B08F35-06D5-4D5C-BA7B-B7116EE18769@automattic.com>
Date: Sun, 25 Aug 2024 10:15:26 +0200
Message-ID: <CAJmjfvWCz3SW858N+wAxGF0mLc1c7r3bR3n8hxLfr+72YuyAEg@mail.gmail.com>
Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand
To: Dennis Snell <dennis.snell@automattic.com>
Cc: Niels Dossche <dossche.niels@gmail.com>, Internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="000000000000044bd706207d9de4"
From: jakob@givoni.dk (Jakob Givoni)

--000000000000044bd706207d9de4
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Aug 24, 2024 at 10:31=E2=80=AFPM Dennis Snell <dennis.snell@automat=
tic.com>
wrote:

> On Aug 24, 2024, at 2:56=E2=80=AFPM, Jakob Givoni <jakob@givoni.dk> wrote=
:
>
>
> Hi Dennis,
>
> Overall it sounds like a reasonable RFC.
>
> > Dennis:
> >
> > > Niels:
> > >
> > > I'm not so sure that the name "decode_html" is self-descriptive
> enough, it sounds very generic.
> >
> > The name is not very important to me. For the sake of history, the
> reason I have chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an =
HTML parser, this
> is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=9D content and=
 decoding it into
> a =E2=80=9Cplain PHP string.=E2=80=9D
>
> Why not make it two methods called "decode_html_text" and
> "decode_html_attribute"?
> Consider the following reasons:
> 1. The function doesn't actually decode html as such, it decodes either a=
n
> html text node string or an html attribute string.
>
>
> Thanks Jakob. In WordPress I did just this.
> https://developer.wordpress.org/reference/classes/wp_html_decoder/
>
> Part of the reason for that was the inability to require something like a=
n
> enum (due to PHP version support requirements). The Enum solution feels
> very nice too.
>
> 2. Saves the $context parameter and the constants/enums, making the call
> significantly shorter.
>
>
> In my PR I=E2=80=99ve actually expanded the Enum to include a few other c=
ontexts.
> I feel like there=E2=80=99s a balance we have to do if we want to ride th=
e line
> between *fully reliable* and *fully convenient*. On one hand, we could
> say =E2=80=9Cdon=E2=80=99t send the text content of a SCRIPT element to t=
his function!=E2=80=9D But
> on the other hand, that kind of forces people to expect that SCRIPT conte=
nt
> is different.
>
> With the Enum there is that in-built training material when someone looks
> and finds `Attribute | BodyText | ForeignText | Script | Style` (the
> contexts I=E2=80=99ve explored in my PR).
>
> We could make the same argument for `decode_html_script()` and
> `decode_foreign_text_node()` and `decode_html_style()`. Somehow the conte=
xt
> feels cleaner to me, and like a single entry point for learning instead o=
f
> five.
>
>
Yes. With 5 different contexts it's starting to shift in favor of a single
function :-)
I only saw the RFC which from what I can tell still only features 2 of
them. I haven't seen the PR (RFC Implementation section says "Yet to
come").

> 3. It feels like decoding either text or attribute are two significantly
> different things. I admit I could be wrong, if code like
> decode_html($e->isAttritbute() ? HtmlContext::Attribute :
> HtmlContext::Text, $e->getContent()) is likely to be seen.
>
>
> None of these contexts are *significantly* different, which is one of the
> major dangers of using `html_entity_decode()`. The results will look just
> about right most of the time. It=E2=80=99s the subtle differences that ma=
tter most,
> I suppose.
>

Well, that was kind of what I meant - even if the differences are usually
absent or subtle, they are significant (i.e. not necessarily big, but
meaningful), meaning using it wrong would give the wrong result, right?
Saying that they are not *significantly different* to me means that the
result would just be a little less good sometimes, not directly wrong.


>
> The lesson I have drawn is that people frequently have what they
> understand to be a text node or an attribute value, but they aren=E2=80=
=99t aware
> that they are supposed to decode differently, and they also aren=E2=80=99=
t reaching
> to interact with a full parser to get these values. If PHP could train
> people as they use these functions, purely through their interfaces, I
> think that could help elevate the level of reliability out there in the
> wild, as long as they aren=E2=80=99t *too* cumbersome (hence explicitly n=
o
> default context argument _or_ using separately-named functions).
>
> Having the Enum I think enhances the ease with which people can reliably
> also decode things like SCRIPT and STYLE nodes. =E2=80=9CI know
> `html_decode_text()` but I don=E2=80=99t know what the rules for SCRIPT a=
re or if
> they=E2=80=99re different so I=E2=80=99ll just stick with that.=E2=80=9D =
vs =E2=80=9CMy IDE suggests that
> `Script` is a different context, that=E2=80=99s interesting, I=E2=80=99ll=
 try that and see
> how it=E2=80=99s different."
>
>
That is a good point and using enums favours that learning push since they
are inherently grouped together.

>
> Best,
> Jakob
>
>
>
> Thanks for your input. I=E2=80=99m grateful for the discussions and that =
people
> are sharing.
>
>
Cheers!

--000000000000044bd706207d9de4
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><div class=3D"gmail_quote"><div=
 dir=3D"ltr" class=3D"gmail_attr">On Sat, Aug 24, 2024 at 10:31=E2=80=AFPM =
Dennis Snell &lt;<a href=3D"mailto:dennis.snell@automattic.com">dennis.snel=
l@automattic.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pa=
dding-left:1ex"><div>On Aug 24, 2024, at 2:56=E2=80=AFPM, Jakob Givoni &lt;=
<a href=3D"mailto:jakob@givoni.dk" target=3D"_blank">jakob@givoni.dk</a>&gt=
; wrote:<br><div><blockquote type=3D"cite"><br><div><div dir=3D"ltr">Hi Den=
nis,<br><br>Overall it sounds like a reasonable RFC.<div>=C2=A0 <br>&gt; De=
nnis:</div><div>&gt;<br>&gt; &gt; Niels:</div><div>&gt; &gt;<br>&gt; &gt; I=
&#39;m not so sure that the name &quot;decode_html&quot; is self-descriptiv=
e enough, it sounds very generic.<br>&gt;<br>&gt; The name is not very impo=
rtant to me. For the sake of history, the reason I have chosen =E2=80=9Cdec=
ode HTML=E2=80=9D is because, unlike an HTML parser, this is focused on tak=
ing a snippet of HTML =E2=80=9Ctext=E2=80=9D content and decoding it into a=
 =E2=80=9Cplain PHP string.=E2=80=9D<br><br>Why not make it two methods cal=
led &quot;decode_html_text&quot; and &quot;decode_html_attribute&quot;?</di=
v><div>Consider the following reasons:</div><div>1. The function doesn&#39;=
t actually decode html as such, it decodes either an html text node string =
or an html attribute string.</div></div></div></blockquote><div><br></div><=
div>Thanks Jakob. In WordPress I did just this.</div><div><a href=3D"https:=
//developer.wordpress.org/reference/classes/wp_html_decoder/" target=3D"_bl=
ank">https://developer.wordpress.org/reference/classes/wp_html_decoder/</a>=
</div><div><br></div><div>Part of the reason for that was the inability to =
require something like an enum (due to PHP version support requirements). T=
he Enum solution feels very nice too.</div><br><blockquote type=3D"cite"><d=
iv><div dir=3D"ltr"><div>2. Saves the $context parameter and the constants/=
enums, making the call significantly shorter.=C2=A0</div></div></div></bloc=
kquote><div><br></div><div>In my PR I=E2=80=99ve actually expanded the Enum=
 to include a few other contexts. I feel like there=E2=80=99s a balance we =
have to do if we want to ride the line between <i>fully reliable</i>=C2=A0a=
nd <i>fully convenient</i>. On one hand, we could say =E2=80=9Cdon=E2=80=99=
t send the text content of a SCRIPT element to this function!=E2=80=9D But =
on the other hand, that kind of forces people to expect that SCRIPT content=
 is different.</div><div><br></div><div>With the Enum there is that in-buil=
t training material when someone looks and finds `Attribute | BodyText | Fo=
reignText | Script | Style` (the contexts I=E2=80=99ve explored in my PR).=
=C2=A0</div><div><br></div><div>We could make the same argument for `decode=
_html_script()` and `decode_foreign_text_node()` and `decode_html_style()`.=
 Somehow the context feels cleaner to me, and like a single entry point for=
 learning instead of five.</div><br></div></div></blockquote><div><br></div=
><div>Yes. With 5 different contexts it&#39;s starting to shift in favor of=
 a single function :-)</div><div>I only saw the RFC which from what I can t=
ell still only features 2 of them. I haven&#39;t seen the PR (RFC=C2=A0Impl=
ementation section says &quot;Yet to come&quot;).=C2=A0</div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex"><div><div><blockquote type=3D"cite"><di=
v><div dir=3D"ltr"><div>3. It feels like decoding either text or attribute =
are two significantly different things. I admit I could be wrong, if code l=
ike decode_html($e-&gt;isAttritbute() ? HtmlContext::Attribute : HtmlContex=
t::Text, $e-&gt;getContent()) is likely to be seen.</div></div></div></bloc=
kquote><div><br></div><div>None of these contexts are <i>significantly</i>=
=C2=A0different, which is one of the major dangers of using `html_entity_de=
code()`. The results will look just about right most of the time. It=E2=80=
=99s the subtle differences that matter most, I suppose.</div></div></div><=
/blockquote><div><br></div><div>Well, that was kind of what I meant - even =
if the differences are usually absent or subtle, they are significant (i.e.=
 not necessarily big, but meaningful), meaning using it wrong would give th=
e wrong result, right? Saying that they are not <i>significantly different<=
/i> to me means that the result would just be a little less good sometimes,=
 not directly wrong.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote=
" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);=
padding-left:1ex"><div><div><div><br></div><div>The lesson I have drawn is =
that people frequently have what they understand to be a text node or an at=
tribute value, but they aren=E2=80=99t aware that they are supposed to deco=
de differently, and they also aren=E2=80=99t reaching to interact with a fu=
ll parser to get these values. If PHP could train people as they use these =
functions, purely through their interfaces, I think that could help elevate=
 the level of reliability out there in the wild, as long as they aren=E2=80=
=99t <i>too</i>=C2=A0cumbersome (hence explicitly no default context argume=
nt _or_ using separately-named functions).</div><div><br></div><div>Having =
the Enum I think enhances the ease with which people can reliably also deco=
de things like SCRIPT and STYLE nodes. =E2=80=9CI know `html_decode_text()`=
 but I don=E2=80=99t know what the rules for SCRIPT are or if they=E2=80=99=
re different so I=E2=80=99ll just stick with that.=E2=80=9D vs =E2=80=9CMy =
IDE suggests that `Script` is a different context, that=E2=80=99s interesti=
ng, I=E2=80=99ll try that and see how it=E2=80=99s different.&quot;</div><d=
iv><br></div></div></div></blockquote><div><br></div><div>That is a good po=
int and using enums favours that learning push since they are inherently gr=
ouped together.</div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div=
><div><div></div><blockquote type=3D"cite"><div dir=3D"ltr"><br></div></blo=
ckquote><blockquote type=3D"cite"><div><div dir=3D"ltr"><div>Best,</div><di=
v>Jakob</div><div>=C2=A0</div></div>
</div></blockquote><br></div><div>Thanks for your input. I=E2=80=99m gratef=
ul for the discussions and that people are sharing.</div><div><br></div></d=
iv></blockquote><div><br></div><div>Cheers!</div><div><br></div></div></div=
>

--000000000000044bd706207d9de4--