Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125204 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 31DEA1A00BD for ; Sun, 25 Aug 2024 08:15:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724573855; bh=sShlzG+NT4uqYwcUEyVoV+MkdKzNTW1qpDz9Bt1ABvU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=MSExcY6LWznD+aJfvE3del6JygHbrnvwH4sijUlQ6kFspgGW3JxGTtha4tO2zYzVT tBoQh2WWIBWDttAcouh9oLbcXxw6ZeIgssz4klYc9knC7BfUR5IYLz+ozSIegn/iBh lnA4BwMTxnh3382Hliz4LwiO+iaQEiMMXSplKGNdSI/AGPqckTYZ8hlBLmdOvPNCo3 MCKZCoD2FQGx2Vd9cPpOANFBohT/q3K6BPWuXMHgBOjLcJVydBxChaSs1t2gq6a0Xw ryfGPHd3f1DIpStf89PrtM1Ub0+AoEOVLZN2EBkgI5UItURMLPcDYhnqu3vxnLKkSd XzafLsZdWXLMw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id EA1F0180557 for ; Sun, 25 Aug 2024 08:17:33 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from smtp-out3.simply.com (smtp-out3.simply.com [94.231.106.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 25 Aug 2024 08:17:32 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp.simply.com (Simply.com) with ESMTP id 4Ws67b21MKz1DPk0 for ; Sun, 25 Aug 2024 10:15:39 +0200 (CEST) Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by smtp.simply.com (Simply.com) with ESMTPSA id 4Ws67Z6L90z1DPkf for ; Sun, 25 Aug 2024 10:15:38 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=givoni.dk; s=unoeuro; t=1724573739; bh=sShlzG+NT4uqYwcUEyVoV+MkdKzNTW1qpDz9Bt1ABvU=; h=References:In-Reply-To:From:Date:Subject:To:Cc; b=QQ2PUtbgv1vJNEY1GjEU5eE+w9Gtv1p0qZnOuMUJ8ZA2bqUgnDX9HwxoJu852RL2H G5Lj98dnDK3HcsbpgZPTKYhC/ZCSlq/H3hSYBuGIz1sT3uMCn3h30VDj73OcvO9jOk 64Pfa3UhtqRL6kZXxvtnWyQmsckrJfdnQBXKgGSg= Received: by mail-yb1-f169.google.com with SMTP id 3f1490d57ef6-e165ab430e7so3404216276.2 for ; Sun, 25 Aug 2024 01:15:38 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCWo1AusJjxooDX2XDHWuuKzis36WyoIetOiij1dkWJ0nD1DoXM5Kp2RetZtlnACoz4HjBwum/XTteI=@lists.php.net X-Gm-Message-State: AOJu0Yz7ayRl8U4iCS9woBQn0T3bDwoHi/e2Gxai7HswxaPCx09eS8qr srSX7GjM//dUO3VwdkYMEpxVAfB7bdUHivTPYthR5uKOc9h9J924BytRrn0UgLDSZW8eg5mYFfo 4h5LsRSYCHUsylzNO6ekdKtWVSbw= X-Google-Smtp-Source: AGHT+IFgmZLLSOHU76wCqKteWjBTPLVLx3CQnSoA906kYTlXx1Gu+PA8/wsEbt5BquDdSNw7bcQrkaJLyHg1KVo8jkU= X-Received: by 2002:a05:6902:1b85:b0:e17:8de7:8c38 with SMTP id 3f1490d57ef6-e17a85ea26cmr7002885276.25.1724573737565; Sun, 25 Aug 2024 01:15:37 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> <90B08F35-06D5-4D5C-BA7B-B7116EE18769@automattic.com> In-Reply-To: <90B08F35-06D5-4D5C-BA7B-B7116EE18769@automattic.com> Date: Sun, 25 Aug 2024 10:15:26 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand To: Dennis Snell Cc: Niels Dossche , Internals Content-Type: multipart/alternative; boundary="000000000000044bd706207d9de4" From: jakob@givoni.dk (Jakob Givoni) --000000000000044bd706207d9de4 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Aug 24, 2024 at 10:31=E2=80=AFPM Dennis Snell wrote: > On Aug 24, 2024, at 2:56=E2=80=AFPM, Jakob Givoni wrote= : > > > Hi Dennis, > > Overall it sounds like a reasonable RFC. > > > Dennis: > > > > > Niels: > > > > > > I'm not so sure that the name "decode_html" is self-descriptive > enough, it sounds very generic. > > > > The name is not very important to me. For the sake of history, the > reason I have chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an = HTML parser, this > is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=9D content and= decoding it into > a =E2=80=9Cplain PHP string.=E2=80=9D > > Why not make it two methods called "decode_html_text" and > "decode_html_attribute"? > Consider the following reasons: > 1. The function doesn't actually decode html as such, it decodes either a= n > html text node string or an html attribute string. > > > Thanks Jakob. In WordPress I did just this. > https://developer.wordpress.org/reference/classes/wp_html_decoder/ > > Part of the reason for that was the inability to require something like a= n > enum (due to PHP version support requirements). The Enum solution feels > very nice too. > > 2. Saves the $context parameter and the constants/enums, making the call > significantly shorter. > > > In my PR I=E2=80=99ve actually expanded the Enum to include a few other c= ontexts. > I feel like there=E2=80=99s a balance we have to do if we want to ride th= e line > between *fully reliable* and *fully convenient*. On one hand, we could > say =E2=80=9Cdon=E2=80=99t send the text content of a SCRIPT element to t= his function!=E2=80=9D But > on the other hand, that kind of forces people to expect that SCRIPT conte= nt > is different. > > With the Enum there is that in-built training material when someone looks > and finds `Attribute | BodyText | ForeignText | Script | Style` (the > contexts I=E2=80=99ve explored in my PR). > > We could make the same argument for `decode_html_script()` and > `decode_foreign_text_node()` and `decode_html_style()`. Somehow the conte= xt > feels cleaner to me, and like a single entry point for learning instead o= f > five. > > Yes. With 5 different contexts it's starting to shift in favor of a single function :-) I only saw the RFC which from what I can tell still only features 2 of them. I haven't seen the PR (RFC Implementation section says "Yet to come"). > 3. It feels like decoding either text or attribute are two significantly > different things. I admit I could be wrong, if code like > decode_html($e->isAttritbute() ? HtmlContext::Attribute : > HtmlContext::Text, $e->getContent()) is likely to be seen. > > > None of these contexts are *significantly* different, which is one of the > major dangers of using `html_entity_decode()`. The results will look just > about right most of the time. It=E2=80=99s the subtle differences that ma= tter most, > I suppose. > Well, that was kind of what I meant - even if the differences are usually absent or subtle, they are significant (i.e. not necessarily big, but meaningful), meaning using it wrong would give the wrong result, right? Saying that they are not *significantly different* to me means that the result would just be a little less good sometimes, not directly wrong. > > The lesson I have drawn is that people frequently have what they > understand to be a text node or an attribute value, but they aren=E2=80= =99t aware > that they are supposed to decode differently, and they also aren=E2=80=99= t reaching > to interact with a full parser to get these values. If PHP could train > people as they use these functions, purely through their interfaces, I > think that could help elevate the level of reliability out there in the > wild, as long as they aren=E2=80=99t *too* cumbersome (hence explicitly n= o > default context argument _or_ using separately-named functions). > > Having the Enum I think enhances the ease with which people can reliably > also decode things like SCRIPT and STYLE nodes. =E2=80=9CI know > `html_decode_text()` but I don=E2=80=99t know what the rules for SCRIPT a= re or if > they=E2=80=99re different so I=E2=80=99ll just stick with that.=E2=80=9D = vs =E2=80=9CMy IDE suggests that > `Script` is a different context, that=E2=80=99s interesting, I=E2=80=99ll= try that and see > how it=E2=80=99s different." > > That is a good point and using enums favours that learning push since they are inherently grouped together. > > Best, > Jakob > > > > Thanks for your input. I=E2=80=99m grateful for the discussions and that = people > are sharing. > > Cheers! --000000000000044bd706207d9de4 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

On Sat, Aug 24, 2024 at 10:31=E2=80=AFPM = Dennis Snell <dennis.snel= l@automattic.com> wrote:
On Aug 24, 2024, at 2:56=E2=80=AFPM, Jakob Givoni <= jakob@givoni.dk>= ; wrote:

Hi Den= nis,

Overall it sounds like a reasonable RFC.
=C2=A0
> De= nnis:
>
> > Niels:
> >
> > I= 'm not so sure that the name "decode_html" is self-descriptiv= e enough, it sounds very generic.
>
> The name is not very impo= rtant to me. For the sake of history, the reason I have chosen =E2=80=9Cdec= ode HTML=E2=80=9D is because, unlike an HTML parser, this is focused on tak= ing a snippet of HTML =E2=80=9Ctext=E2=80=9D content and decoding it into a= =E2=80=9Cplain PHP string.=E2=80=9D

Why not make it two methods cal= led "decode_html_text" and "decode_html_attribute"?
Consider the following reasons:
1. The function doesn'= t actually decode html as such, it decodes either an html text node string = or an html attribute string.

<= div>Thanks Jakob. In WordPress I did just this.

Part of the reason for that was the inability to = require something like an enum (due to PHP version support requirements). T= he Enum solution feels very nice too.

2. Saves the $context parameter and the constants/= enums, making the call significantly shorter.=C2=A0

In my PR I=E2=80=99ve actually expanded the Enum= to include a few other contexts. I feel like there=E2=80=99s a balance we = have to do if we want to ride the line between fully reliable=C2=A0a= nd fully convenient. On one hand, we could say =E2=80=9Cdon=E2=80=99= t send the text content of a SCRIPT element to this function!=E2=80=9D But = on the other hand, that kind of forces people to expect that SCRIPT content= is different.

With the Enum there is that in-buil= t training material when someone looks and finds `Attribute | BodyText | Fo= reignText | Script | Style` (the contexts I=E2=80=99ve explored in my PR).= =C2=A0

We could make the same argument for `decode= _html_script()` and `decode_foreign_text_node()` and `decode_html_style()`.= Somehow the context feels cleaner to me, and like a single entry point for= learning instead of five.


Yes. With 5 different contexts it's starting to shift in favor of= a single function :-)
I only saw the RFC which from what I can t= ell still only features 2 of them. I haven't seen the PR (RFC=C2=A0Impl= ementation section says "Yet to come").=C2=A0
3. It feels like decoding either text or attribute = are two significantly different things. I admit I could be wrong, if code l= ike decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContex= t::Text, $e->getContent()) is likely to be seen.

None of these contexts are significantly= =C2=A0different, which is one of the major dangers of using `html_entity_de= code()`. The results will look just about right most of the time. It=E2=80= =99s the subtle differences that matter most, I suppose.
<= /blockquote>

Well, that was kind of what I meant - even = if the differences are usually absent or subtle, they are significant (i.e.= not necessarily big, but meaningful), meaning using it wrong would give th= e wrong result, right? Saying that they are not significantly different<= /i> to me means that the result would just be a little less good sometimes,= not directly wrong.
=C2=A0

The lesson I have drawn is = that people frequently have what they understand to be a text node or an at= tribute value, but they aren=E2=80=99t aware that they are supposed to deco= de differently, and they also aren=E2=80=99t reaching to interact with a fu= ll parser to get these values. If PHP could train people as they use these = functions, purely through their interfaces, I think that could help elevate= the level of reliability out there in the wild, as long as they aren=E2=80= =99t too=C2=A0cumbersome (hence explicitly no default context argume= nt _or_ using separately-named functions).

Having = the Enum I think enhances the ease with which people can reliably also deco= de things like SCRIPT and STYLE nodes. =E2=80=9CI know `html_decode_text()`= but I don=E2=80=99t know what the rules for SCRIPT are or if they=E2=80=99= re different so I=E2=80=99ll just stick with that.=E2=80=9D vs =E2=80=9CMy = IDE suggests that `Script` is a different context, that=E2=80=99s interesti= ng, I=E2=80=99ll try that and see how it=E2=80=99s different."


That is a good po= int and using enums favours that learning push since they are inherently gr= ouped together.

Best,
Jakob
=C2=A0

Thanks for your input. I=E2=80=99m gratef= ul for the discussions and that people are sharing.


Cheers!

--000000000000044bd706207d9de4--