Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124326 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 708921A00B7 for ; Wed, 10 Jul 2024 00:00:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1720569714; bh=onreLU55QdBZ5ebvdTbet4ugGpzGqnYYBz1n2Mo58Ys=; h=Date:From:To:Subject:From; b=ObbNaeJ/yMhO9+VdC/GPmhEs1BxcVZovrKNL5bSVwbPtUC4GptypFhC5XZKPK9VcA L1j14KFmFnnUSRwa22+OC1ym8UL3tDOocGjhDjbdNY0FvSfkzTQJ/awke6kWbDzxe9 suvh7qZDLJKuvqCgx4AjWUnCi7dPaEG9Uuf6s5k6y+y48r3mz92s5rR9AtVslrDkor xxg98Rpc59Wmlk8hW6NxoIoX7grZ935nddm8NfJ2agP67Rt4aSzRGGjcAxexisPIrS wmyeecXMPKbVgpidlspq7z8LAG9rnhBt4wwuwuJPjx0St80wyP6gHItO4p9bZXhEM2 /W9PFb9V0jUgg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 31FE318058A for ; Wed, 10 Jul 2024 00:01:54 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 10 Jul 2024 00:01:54 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 4E80C32072D for ; Wed, 10 Jul 2024 00:00:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=content-type:content-type:x-mailer:mime-version:message-id :subject:subject:from:from:date:date:received:received:received :received:received:received; s=automattic1; t=1720569628; bh=onr eLU55QdBZ5ebvdTbet4ugGpzGqnYYBz1n2Mo58Ys=; b=lug4OS0dKVCgZnLEKEZ m+0wt5+jqOGIuD4hhad1SPlVqC1su4CWyaUoyCwUKKRXcAEv7xWUaT9wTl/wycIP bKv1fBGrCnPU5d5C6goVnGfttQGVV0Xliq82nfe/D4pjaRI/3q/PcqHKDV4LKB9q tE5Ycrp/nYTFhEklgrHuvNlEPKoj3DpMQxhA0r2TW/Le+QtfuTm0u2sjxMvOlCFA Qcrlmx+5Kx8wab1hzZVarpJSoCzQg+cYW2PjMI9TKPZ+XAPl2UqDq/BR68NtkK5H 0m94KobxSadhI/NgdLwGGQV73tZZE4L37A3jHQD569ypeWzJgTkX23ByDVQtL2+I m7Q== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aJrxXy21ui3x for ; Wed, 10 Jul 2024 00:00:28 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id DC0BF1D4AC9 for ; Wed, 10 Jul 2024 00:00:27 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="cjccZ0Tj"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="BD26ZB8z"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="H+9MICy8"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id CC695A0A07 for ; Wed, 10 Jul 2024 00:00:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1720569627; bh=onreLU55QdBZ5ebvdTbet4ugGpzGqnYYBz1n2Mo58Ys=; h=Date:From:To:Subject:From; b=cjccZ0TjI+aW6/NvA1SY6MT4cmn2rATm+mSzNR0RiLCEuxikQGXoXqPHcFYffpyAt mLv6/QYLIMe7tXjDJBxKVPgSDTc13kd8eGUa47JxYgczIPTCaJhJOiIZD4UMbXByQb +dx7nJqjCG9f5iUsRlQFcpuJdzY3jnAFTX7gU5lsi8iePoPbwMzUMAZBUjxBn4lZQC eHwICdfi8JBECd6lh+EFjuqlh7DIQIjhHDLuzz+ZG3gPsnnFBOiaMrW5/Ga2jm5vvZ pL0EmVAJp2hmsAKdV3D1yK85/jtI/OtvNfrHSj2khJgSinoAafgTdgiMDCuOFmwT8p cpqcIirv9yk2w== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1720569627; bh=onreLU55QdBZ5ebvdTbet4ugGpzGqnYYBz1n2Mo58Ys=; h=Date:From:To:Subject:From; b=BD26ZB8zuIvNNRagL1t05ngIGIK1s7Jrv3JVGdSgXuCvP+2rsNd4FdywLkvbYcJPV UIgajyxAQYEP0C9o3RkBAiewv1RPvRprpGiTubyxkh9+odA/1KHga1m0OwP1m9UPF/ cCFRzvlctnGP/n5qtDvRfMvyaKMTvHZIY/43INXD9q6zGFKWv3r1yt/njzKlyp/E5I 6jW66R8ik7wrBF7dlVwsLeme+4s7QprJVtqoCEV/fDgkSQgE4lpqZTp3lN72GSseZM 1ZNtIimUjqPSqmcCeYAq91JylErZ1E2cAUM94CiMK0V3eSvXBYi6zr+HyD3C9Pots0 hFxXjmXhR9Gzw== Received: from mail-pj1-f69.google.com (mail-pj1-f69.google.com [209.85.216.69]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id BDBBEA0371 for ; Wed, 10 Jul 2024 00:00:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1720569627; bh=onreLU55QdBZ5ebvdTbet4ugGpzGqnYYBz1n2Mo58Ys=; h=Date:From:To:Subject:From; b=H+9MICy8PywHYUlifPSOKIUfOOV7jwc/ibIC+Jh1zUr49rq1dazVTWPdvydk7n2zF 6HSznoSYajtei7eO4+4CkWlo40psPZMEYWPoecmN8LLdBTkSYtbelYzteDsOlISOsF 6/NxHVfLJXZC8ARA9fk4H5ogmCbYAoEwUdHt3/HXr4pyVeM8MSZymiMorRqV4lDFbv CgxEbPNEmeGE0qxL3v2VwZCRxeWPXv3b+kPbKSm3+ny09YjH4SG53j8csqwUoE1t9F XRibjeU3A5Tl/n8bCQpptCH3UHSnrm+oFSjiByn1SBniW+4AUt3pkCdLSNN+h/6gWE k7XYdNDd6j0sg== Received: by mail-pj1-f69.google.com with SMTP id 98e67ed59e1d1-2c9d510f670so2893730a91.0 for ; Tue, 09 Jul 2024 17:00:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720569627; x=1721174427; h=mime-version:message-id:subject:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ehuzxviDjg/DVPTKj1f4gWSKfi8wiiyAPlqa8zM2dGM=; b=CgR+g20xY9OIt+XsCRXMvqPwfzsRoQVaZFMdjm40qkSrNMZq6AsOwy3R/TC80WP06B 8t4B4jQgomXGw4LUqv7xXh7bgOrKVP87j9sSM2zOCi+5QKq0LglN/cCs9UROLwXTLqrf /gDZeUuHPAEkPmFteh9pEx22XST6VEo0N3cHceCL/pt2AvI0wOZ+VFAo2gNb3p4yBwm6 MJvtjtKaxXWl/Wca7cv7zsLcfRTyuHS8Pm5A4Cr98q4N0sbL9wg9zw9Yr8KyxgIUqP95 C83xuzzuCwatZHE9a0enSdmIIdLDjMibH+qmUdwY99sLqm7VLtNkguOVRq0aoXogEZHl dLvA== X-Gm-Message-State: AOJu0YwBuDlvdOWV5ZMrPOs1yff1ytiW80Sv1Hc55TfB5pa4aBqwlINS y1MaHAf91Ju+KnlXQBZtvrCaVGQXeYm4i5nlWn/FnbqWoqX1ud5F4m630GM0OKJce4fFft0JgnK 0bQneiDtzizDuV3/0D9jFF+5kIY0dley+z6YLP5MF1RHzLRS9xkH5v0b4+ekHY8o= X-Received: by 2002:a05:6a20:12d2:b0:1be:ca6c:d93 with SMTP id adf61e73a8af0-1c2984e35b4mr4752526637.52.1720569625354; Tue, 09 Jul 2024 17:00:25 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEj5rZvoAT/h3TKYmAspyVyZI7tg/5QIi6VjuWlqtCOOw1Y/e+qNFyH+nPPbCXpLIMsx6rRQQ== X-Received: by 2002:a05:6a20:12d2:b0:1be:ca6c:d93 with SMTP id adf61e73a8af0-1c2984e35b4mr4752480637.52.1720569624398; Tue, 09 Jul 2024 17:00:24 -0700 (PDT) Received: from [10.88.88.21] ([172.110.168.207]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-77d667ec4c8sm1598736a12.71.2024.07.09.17.00.23 for (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 Jul 2024 17:00:24 -0700 (PDT) Date: Tue, 09 Jul 2024 17:00:24 -0700 (PDT) X-Google-Original-Date: 09 Jul 2024 17:00:23 -0700 X-Google-Original-From: Dennis Snell To: Internals Subject: [PHP-DEV] Decoding HTML and the Ambiguous Ampersand Message-ID: <80EA6CA9-E14E-4672-A88A-46EFE9E2F3F0@automattic.com> Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net MIME-Version: 1.0 X-Mailer: Unibox (443:23.5.0) Content-Type: multipart/alternative; boundary="=_F7889AA9-F1EB-48AE-9C71-74EF34A4EBBB" From: dennis.snell@automattic.com (Dennis Snell) --=_F7889AA9-F1EB-48AE-9C71-74EF34A4EBBB Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Greetings all, The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function has a nu= mber of issues that I=E2=80=99d like to correct. =C2=A0- It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named charact= er references. =C2=A0- 106 of these are named character references which do not require a = trailing semicolon, such as `´` =C2=A0- It=E2=80=99s unaware of the ambiguous ampersand rule, which allows = these 106 in special circumstances. HTML5 asserts that the list of named character references will not expand i= n the future. It can be found authoritatively at the following URL: https://html.spec.whatwg.org/entities.json The ambiguous ampersand rule smoothes over legacy behavior from before HTML= 5 where ampersands were not properly encoded in attribute values, specifica= lly in URL values. For example, in a query string for a search, one might f= ind `?q=3Ddog¬=3Dcat`. The `¬` in that value would decode to U+AC (= =C2=AC), but since it=E2=80=99s in an attribute value it will be left as pl= aintext. Inside normal HTML markup it would transform into `?q=3Ddog=C2=AC= =3Dcat`. There are related nuances when numeric character references are fo= und at the end of a string or boundary without the semicolon. The function signature of `html_entity_decode()` does not currently allow f= or correcting this behavior. I=E2=80=99d like to propose an RFC or a bug fi= x which either extends the function (perhaps by adding a new flag like `ENT= _AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the missin= g character references I wonder if it would be enough to add them to the li= st of default translatable references. One challenge with the existing function is that the concept of the transla= tion table stands in contrast with the fixed and static nature of HTML5= =E2=80=99s replacement tables. A new function or set of functions could ope= n up spec-compliant decoding while providing helpful methods that are neces= sary in many common server-side operations: =C2=A0 - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99= , $raw_text, $input_encoding =3D =E2=80=98utf-8' )` =C2=A0 - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data= =E2=80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8= =E2=80=99 )` =C2=A0 - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | =E2=80=98dat= a=E2=80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8= =E2=80=99 )` These methods are handy for inspecting things like encoded attribute values= in a memory-efficient and processing-efficient way, when it=E2=80=99s not = necessary to decode the entire value. In common situations, one encounters = data-URIs with potentially megabytes of image data and processing only the = first few or tens of bytes can save a lot of overhead. We=E2=80=99re exploring pure-PHP solutions to these problems in WordPress i= n attempts to improve the reliability and safety of handling HTML. I= =E2=80=99d love to hear your thoughts and know if anyone is willing to work= with me to create an RFC or directly propose patches. We=E2=80=99ve create= d a step function which allows finding the next character reference and dec= oding it separately, enabling some novel features like highlighting the cha= racter references in source text. Should I propose an RFC for this? Warmly, Dennis Snell Automattic Inc. --=_F7889AA9-F1EB-48AE-9C71-74EF34A4EBBB Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Greetings all,

The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function has= a number of issues that I=E2=80=99d like to correct.

=C2=A0- It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named ch= aracter references.
=C2=A0- 106 of these are named character references which do not requi= re a trailing semicolon, such as `&acute`
=C2=A0- It=E2=80=99s unaware of the ambiguous ampersand rule, which al= lows these 106 in special circumstances.

HTML5 asserts that the list of named character references will not exp= and in the future. It can be found authoritatively at the following URL:

https://html.spec.whatwg.org/entities.json

The ambiguous ampersand rule smoothes over legacy behavior from before= HTML5 where ampersands were not properly encoded in attribute values, spec= ifically in URL values. For example, in a query string for a search, one mi= ght find `?q=3Ddog&not=3Dcat`. The `&not` in that value would decod= e to U+AC (=C2=AC), but since it=E2=80=99s in an attribute value it will be= left as plaintext. Inside normal HTML markup it would transform into `?q= =3Ddog=C2=AC=3Dcat`. There are related nuances when numeric character refer= ences are found at the end of a string or boundary without the semicolon.

The function signature of `html_entity_decode()` does not currently al= low for correcting this behavior. I=E2=80=99d like to propose an RFC or a b= ug fix which either extends the function (perhaps by adding a new flag like= `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the m= issing character references I wonder if it would be enough to add them to t= he list of default translatable references.

One challenge with the existing function is that the concept of the tr= anslation table stands in contrast with the fixed and static nature of HTML= 5=E2=80=99s replacement tables. A new function or set of functions could op= en up spec-compliant decoding while providing helpful methods that are nece= ssary in many common server-side operations:

=C2=A0 - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data= =E2=80=99, $raw_text, $input_encoding =3D =E2=80=98utf-8' )`
=C2=A0 - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98d= ata=E2=80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8= =E2=80=99 )`
=C2=A0 - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | = =E2=80=98data=E2=80=99, $raw_haystack, $needle, $input_encoding =3D = =E2=80=98utf-8=E2=80=99 )`

These methods are handy for inspecting things like encoded attribute v= alues in a memory-efficient and processing-efficient way, when it=E2=80=99s= not necessary to decode the entire value. In common situations, one encoun= ters data-URIs with potentially megabytes of image data and processing only= the first few or tens of bytes can save a lot of overhead.

We=E2=80=99re exploring pure-PHP solutions to these problems in WordPr= ess in attempts to improve the reliability and safety of handling HTML. I= =E2=80=99d love to hear your thoughts and know if anyone is willing to work= with me to create an RFC or directly propose patches. We=E2=80=99ve create= d a step function which allows finding the next character reference and dec= oding it separately, enabling some novel features like highlighting the cha= racter references in source text.

Should I propose an RFC for this?

Warmly,
Dennis Snell
Automattic Inc.

--=_F7889AA9-F1EB-48AE-9C71-74EF34A4EBBB--