Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125189 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id DFC641A00BD for ; Sat, 24 Aug 2024 19:56:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724529510; bh=DLPBD5U9PW+Wq6IUzxjtzATMCHp4l5q85ZbgmufrefY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=iliMm59vAtH6Fq1r+xZLbr0xh/+eSF+wnXKK0Pi4HaKX9WHn2Zrd+4kVEvTlqqKnK jgIrQ3mvZDUBAjI9AruADwdU9xyQRAk+t28DswdpMuCn5MIni4nEx3sk381w/wyA9o iGH8lIpJ63NaJaL48BhQqGPLRUuwDU6VG04EV6Q65IIQnFp3S/Ylb4muB2/rRkF2cG cDSOuHqlS7/6jFitUdGuoGRCmA1fhnZi0h9jLFYWQvGJ8vvILy3V0jyYnnLJBOYBS4 FHU8vCOOFdjgf1ZFemkqRmFSAHsMYOX4mXzTotBd+CbyEwNecGuV4vnW54y+Qm0o23 3jAk76pcCFiVQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id E0424180054 for ; Sat, 24 Aug 2024 19:58:28 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from smtp-out3.simply.com (smtp-out3.simply.com [94.231.106.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 24 Aug 2024 19:58:28 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp.simply.com (Simply.com) with ESMTP id 4Wrnkp4kTbz1DPkh for ; Sat, 24 Aug 2024 21:56:34 +0200 (CEST) Received: from mail-yw1-f172.google.com (mail-yw1-f172.google.com [209.85.128.172]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by smtp.simply.com (Simply.com) with ESMTPSA id 4Wrnkp23hwz1DPk7 for ; Sat, 24 Aug 2024 21:56:34 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=givoni.dk; s=unoeuro; t=1724529394; bh=DLPBD5U9PW+Wq6IUzxjtzATMCHp4l5q85ZbgmufrefY=; h=References:In-Reply-To:From:Date:Subject:To:Cc; b=CWHx8JK1AHUkixUSnwxj9PyrV1PtwUPq+k9eRxf4wkj0emMWdA3yiU6WoHYKnwbH8 DP8v/6loXk4ql66YzaDWT+fJje8GgJn8vfEIit0uaLZ7hiXkApYi8ZCadv7Eo/RhPP tRVT0Xnk4OCTk610ybpkrkTRfD/T3HmhenBOzAb0= Received: by mail-yw1-f172.google.com with SMTP id 00721157ae682-6b6b9867f81so26242747b3.1 for ; Sat, 24 Aug 2024 12:56:34 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCXA5SfZr+Y6gDN6+geT2g34PhJxo1S5hF8AbQKncHcWsxXXr4DPxzZf5ipPh2ABjYXv+UBZe4x9LbA=@lists.php.net X-Gm-Message-State: AOJu0YzCrAogxeWFU4TUfrUZ1DGFDgviPNSDekdixj/+vlFo7U0445js xVLDm7CnSfDxE8xHDlP9dH8HyhgrSyeE9e5V487G25tdyXl7eCvfoMXI9mkCg1i585BuJWPB8ma qm/JQ0+qJ44Hyn/IFkNeGEpCLOcE= X-Google-Smtp-Source: AGHT+IHFYLZG0wC74Z6DdjVEcWyGOM+/6F1goTLB8cce1S4ssqLNOB+LN/iLDLisbcuca+mauodnww1eqBTWvO+wiGw= X-Received: by 2002:a05:690c:4287:b0:6b4:b45:2f1 with SMTP id 00721157ae682-6c628a9f165mr58625627b3.34.1724529392958; Sat, 24 Aug 2024 12:56:32 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> In-Reply-To: <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> Date: Sat, 24 Aug 2024 21:56:21 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand To: Dennis Snell Cc: Niels Dossche , Internals Content-Type: multipart/alternative; boundary="000000000000df4e1f0620734962" From: jakob@givoni.dk (Jakob Givoni) --000000000000df4e1f0620734962 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Dennis, Overall it sounds like a reasonable RFC. > Dennis: > > > Niels: > > > > I'm not so sure that the name "decode_html" is self-descriptive enough, it sounds very generic. > > The name is not very important to me. For the sake of history, the reason I have chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an HTML pars= er, this is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=9D content and deco= ding it into a =E2=80=9Cplain PHP string.=E2=80=9D Why not make it two methods called "decode_html_text" and "decode_html_attribute"? Consider the following reasons: 1. The function doesn't actually decode html as such, it decodes either an html text node string or an html attribute string. 2. Saves the $context parameter and the constants/enums, making the call significantly shorter. 3. It feels like decoding either text or attribute are two significantly different things. I admit I could be wrong, if code like decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContext::Text, $e->getContent()) is likely to be seen. But I somehow don't foresee a lot of situations where text and attribute strings end up in the same code path? A couple of other options that would silence anyone opposed to implicitly favouring utf-8: html_text_to_utf8 and html_attribute_to_utf8 Best, Jakob --000000000000df4e1f0620734962 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Dennis,

Overall it sounds like a reasonable RFC.=
=C2=A0
> Dennis:
>
> > Niels:
&= gt; >
> > I'm not so sure that the name "decode_html&q= uot; is self-descriptive enough, it sounds very generic.
>
> Th= e name is not very important to me. For the sake of history, the reason I h= ave chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an HTML parser,= this is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=9D content= and decoding it into a =E2=80=9Cplain PHP string.=E2=80=9D

Why not = make it two methods called "decode_html_text" and "decode_ht= ml_attribute"?
Consider the following reasons:
1. = The function doesn't actually decode html as such, it decodes either an= html text node string or an html attribute string.
2. Saves the = $context parameter and the constants/enums, making the call significantly s= horter.=C2=A0
3. It feels like decoding either text or attribute = are two significantly different things. I admit I could be wrong, if code l= ike decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContex= t::Text, $e->getContent()) is likely to be seen. But I somehow don't= foresee a lot of situations where text and attribute strings end up in the= same code path?

A couple of other options that wo= uld silence anyone opposed to implicitly favouring utf-8:
html_te= xt_to_utf8 and html_attribute_to_utf8

Best,
<= div>Jakob
=C2=A0
--000000000000df4e1f0620734962--