Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124967 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 095FB1A00B7 for ; Fri, 16 Aug 2024 00:59:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723770062; bh=pkAehTJ/1Sg+KWYnq9CY9wDRCIWa0lDSrOrps1gCzeQ=; h=Date:From:To:Subject:In-Reply-To:References:From; b=ESEucEKAhaTtYmiPRlGFUAMnpkxBcJ9jgLGs6km0CPUrm0/UIMeHhJ01yt1ySg3I/ wiFDP58ItFovwqHMF9A27KmV9ckuqUcldr2kEJZyXC8Aq/9p8UYjndXv2oX1prcIMS 7hrfXYtapK5aGb1eTy1Q5QJ1B/yzy1e8sjK/CqfeIAiA5whRkWBa24oM4FeEsa+TYa fQ3OZyQJVe3UtvJgycLeh4kuYjiNr7NSsAhh50drc5KBbBnaSP2B4G07qOeAcVDdk9 Rz996w22MDOYo0KzR13L06PK1i4Y7M3PQO+CZHAJx9i+GTG+/MpX6eiBjDQJwCnCPf udHL4gkkh67hA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id BBF0B180055 for ; Fri, 16 Aug 2024 01:01:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 16 Aug 2024 01:01:01 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 20135340441 for ; Fri, 16 Aug 2024 00:59:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=content-type:content-type:x-mailer:mime-version:references :in-reply-to:message-id:subject:subject:from:from:date:date :received:received:received:received:received:received; s= automattic1; t=1723769953; bh=pkAehTJ/1Sg+KWYnq9CY9wDRCIWa0lDSrO rps1gCzeQ=; b=Q4WJ6bgZr4N8CkDVrfza62kuG5xrfGrAZkGpdleTpU1QNntnN4 zJZk/OG4zsJTMHa9QF0SKmsQz1QetNcRMwCMQmEueskkvhWh8/xCWcSq/7N9cuwl btMcRyKe+MNG5RyOmw2fYSmXzPrGpbQSfXbuchltqSmBQJfNKJwuGFLApblQV7C0 AWLAsu1BF59OZocMY28YHmCOlRSiwjUWJjVBQTg490sy0UNzqDAnalsN13Rx2acp mOEbC5ozgBk+s/+4ndWycQ8QaM4O/BfP2X0AVNdYsqhXbIVJR4EtPrw6N4GBjRLK q1SruFCOqcOZ8Rrh6DiodQ+HSTghtIq2NQGg== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1LNjYbnZ151m for ; Fri, 16 Aug 2024 00:59:13 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id AF0153402F7 for ; Fri, 16 Aug 2024 00:59:13 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="J4RkkLS2"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="DhHP9RCN"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="Qkg5PUz7"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 9FC5CA09DE for ; Fri, 16 Aug 2024 00:59:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1723769953; bh=pkAehTJ/1Sg+KWYnq9CY9wDRCIWa0lDSrOrps1gCzeQ=; h=Date:From:To:Subject:In-Reply-To:References:From; b=J4RkkLS2PNUAJyLVmWJJRzMYvfYQLAXuBRVLQvBwptFTC2vmhdGIz/nZuznXMLf2+ LsIngkLYhefkyhPFlywu1MrxKcAYuS76Y8aNmpqGy12lLrJyPqla4qwMoZ2Oa2xN3J y9IqhowTmnPUKC5C8IM125YnKESXESywg8T0r/QWvI/PYxkF2aP7RG4aw914QoQ9Oz V/IdZR9GAJkdmWvRqDbk9C1YDS3wcQYzZ5iSHJBz5KEJ/aFU2l777NxKiOkxebh/Sl lrB7a44KIpQ3JTY6Ya+JNlajJkug7TS7PCdfHngde9rk5OCkmWq8WDdrMMHkI8eFxc Tf/SbcGK9aGXA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1723769953; bh=pkAehTJ/1Sg+KWYnq9CY9wDRCIWa0lDSrOrps1gCzeQ=; h=Date:From:To:Subject:In-Reply-To:References:From; b=DhHP9RCNMpz7xB791vDSJJ1FdUsojPaeCJyawyu0K3gg2RtegvqZj2ij7Rl17l/WX jJQxJUgGF21GqxjG93UdxTxEPIOooJdgzGcvC/Ys+kVRjsqJOFH0opoEQHd+tnlXzf zJsx9QTTSr6KQshKAMhb6fNvlhqEqLC7w6Ic1ZWk/SaAcuLe0SO/FbEhR/08E2T2YO q09vkKFqPgAZD75VdCG40g7v6241ftWsk94W3/MRB7JWlCd7L4iNTE2u0QanSC+HXT 900Mha6us7IzfsVErTKKxh72DKfN9jmcI8HjW8dXhNuxDNEAJd5DllZOnRtq8i2XUJ qr+wX8LL/ZIew== Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 89825A038F for ; Fri, 16 Aug 2024 00:59:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1723769953; bh=pkAehTJ/1Sg+KWYnq9CY9wDRCIWa0lDSrOrps1gCzeQ=; h=Date:From:To:Subject:In-Reply-To:References:From; b=Qkg5PUz7n253Ge8jhGu7LGZe50jmNLyY5cErf1Yktku/ZVVm9zstaGMDidoLqRawt E9+qKeCHEV17uHoOghuiTc4gKlcZ7Ek7/oYtIA9JREFxotmekznWLRIXqt9PS/1fRQ 8rUNlUVMm/71elTFxa7jBuIKb4ilIcc+KEEYLh8DTkIxkXf4SW/L35TKKbgV11fPLN WPw18/vFDGtNEhLYk4t7SupblqN3KUPRU5TQ2LlEsBNqqRjNvpWJxDmhMbXwjKiAs2 nDjrn/a1jFcN3IMMr2mYvUtOYQmPME0d437Su6UgBk83FvTP6PXux95G8vy0Gb1zQ7 +evrSclXmskOQ== Received: by mail-pf1-f199.google.com with SMTP id d2e1a72fcca58-70f5ef74143so1457276b3a.2 for ; Thu, 15 Aug 2024 17:59:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723769953; x=1724374753; h=mime-version:references:in-reply-to:message-id:subject:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=E2j0smMAzNF7fmXWKf5gPVNXkBcyFOy4TZMTFx5ZG+M=; b=j3KR8qnsOqqV1bUPHBwT/98l5KFm4gOaBVT8gHNCaIPsoYljhvflfHZcyooa8GIhOO /l1EU6OBMeQTR9L8Q8IZlGbNSUgPNncw3Q+uHGY67eKKC63uqqtdqJzfxeYqWws/OOI7 D7857iwxbNRJTfGfBM8X2wzdApULBzMZAq8M2OrCB7G0rGh7lUcoOz8VSIha15Cku3gs 7NO+1Ev3kh4bj28LM/z7LKIavi9VCgNCp/+NyjOhOzj3uGoUe7wv8leYlEM00HDMWcQC YbGK2MyK367fCeslJ+t4tlwpxcAFsM0OiS7femcfPpG681ilR1FtseEyh6oSWWF16i0c R1Qg== X-Gm-Message-State: AOJu0YzaMOFPcHFWbvv9KA890CeWJ1cUKMZdvVULO39nNAGlv3BITrlj zYMaCYwu/LwS5YdnJPnpW7A9KW8h0TGw3ZPkB8yS2moBBfDe+ZNylKiSe4kzcQruQv/piFl9ZXG pBbmq/PWNd1DIwvgnjPc9EjPLwJU/6gmCtwxzMi4pJkV4RTKFz26KFwCf4zC4Koo= X-Received: by 2002:a05:6a00:986:b0:705:b0c0:d7d7 with SMTP id d2e1a72fcca58-713c4e2ae98mr1594791b3a.7.1723769952691; Thu, 15 Aug 2024 17:59:12 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGTxytaIPPAPdR+rDTbGXC4lkC7iCY9fsOVZNoj03OqH0AdAlP0BwPV/GQFa5awRPBmoh2HOg== X-Received: by 2002:a05:6a00:986:b0:705:b0c0:d7d7 with SMTP id d2e1a72fcca58-713c4e2ae98mr1594766b3a.7.1723769951894; Thu, 15 Aug 2024 17:59:11 -0700 (PDT) Received: from [172.16.32.25] ([67.212.194.255]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7127add7df5sm1599793b3a.43.2024.08.15.17.59.11 for (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2024 17:59:11 -0700 (PDT) Date: Thu, 15 Aug 2024 17:59:11 -0700 (PDT) X-Google-Original-Date: 15 Aug 2024 17:59:09 -0700 X-Google-Original-From: Dennis Snell To: Internals Subject: [PHP-DEV] Re: Decoding HTML and the Ambiguous Ampersand Message-ID: <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com> In-Reply-To: References: Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 X-Mailer: Unibox (443:23.5.0) Content-Type: multipart/alternative; boundary="=_C4985891-49BA-42C4-8BCA-4B746C780D9D" From: dennis.snell@automattic.com (Dennis Snell) --=_C4985891-49BA-42C4-8BCA-4B746C780D9D Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > On Jul 9, 2024, at 4:55 PM, Dennis Snell wrote: >=20 >=20 > Greetings all, >=20 >=20 > The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function has a = number of issues that I=E2=80=99d like to correct. >=20 >=20 > =C2=A0- It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named chara= cter references. > =C2=A0- 106 of these are named character references which do not require = a trailing semicolon, such as `´` > =C2=A0- It=E2=80=99s unaware of the ambiguous ampersand rule, which allow= s these 106 in special circumstances. >=20 >=20 > HTML5 asserts that the list of named character references will not expand= in the future. It can be found authoritatively at the following URL: >=20 >=20 > https://html.spec.whatwg.org/entities.json >=20 >=20 >=20 > The ambiguous ampersand rule smoothes over legacy behavior from before HT= ML5 where ampersands were not properly encoded in attribute values, specifi= cally in URL values. For example, in a query string for a search, one might= find `?q=3Ddog¬=3Dcat`. The `¬` in that value would decode to U+AC (= =C2=AC), but since it=E2=80=99s in an attribute value it will be left as pl= aintext. Inside normal HTML markup it would transform into `?q=3Ddog=C2=AC= =3Dcat`. There are related nuances when numeric character references are fo= und at the end of a string or boundary without the semicolon. >=20 >=20 > The function signature of `html_entity_decode()` does not currently allow= for correcting this behavior. I=E2=80=99d like to propose an RFC or a bug = fix which either extends the function (perhaps by adding a new flag like `E= NT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the miss= ing character references I wonder if it would be enough to add them to the = list of default translatable references. >=20 >=20 > One challenge with the existing function is that the concept of the trans= lation table stands in contrast with the fixed and static nature of HTML5= =E2=80=99s replacement tables. A new function or set of functions could ope= n up spec-compliant decoding while providing helpful methods that are neces= sary in many common server-side operations: >=20 >=20 > =C2=A0 - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data= =E2=80=99, $raw_text, $input_encoding =3D =E2=80=98utf-8' )` > =C2=A0 - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data= =E2=80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8= =E2=80=99 )` > =C2=A0 - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | =E2=80=98d= ata=E2=80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8= =E2=80=99 )` >=20 >=20 > These methods are handy for inspecting things like encoded attribute valu= es in a memory-efficient and processing-efficient way, when it=E2=80=99s no= t necessary to decode the entire value. In common situations, one encounter= s data-URIs with potentially megabytes of image data and processing only th= e first few or tens of bytes can save a lot of overhead. >=20 >=20 > We=E2=80=99re exploring pure-PHP solutions to these problems in WordPress= in attempts to improve the reliability and safety of handling HTML. I= =E2=80=99d love to hear your thoughts and know if anyone is willing to work= with me to create an RFC or directly propose patches. We=E2=80=99ve create= d a step function which allows finding the next character reference and dec= oding it separately, enabling some novel features like highlighting the cha= racter references in source text. >=20 >=20 > Should I propose an RFC for this? >=20 >=20 > Warmly, > Dennis Snell > Automattic Inc. >=20 All, I have submitted an RFC draft for including the proposed feature from this = issue. Thanks to everyone who helped me in this process. It=E2=80=99s my fi= rst RFC, so I apologize in advance for any mistakes I=E2=80=99ve made in th= e process. https://wiki.php.net/rfc/decode_html This is proposed for a future PHP version after 8.4. Warmly, Dennis Snell --=_C4985891-49BA-42C4-8BCA-4B746C780D9D Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable

On Jul 9, 2024, at 4:55 PM, Dennis Snell <dennis.snell@a8c.com> = wrote:

Greetings all,

The `html_entity_decode( =E2=80=A6 ENT_H= TML5 =E2=80=A6 )` function has a number of issues that I=E2=80=99d like to = correct.

=C2=A0- It=E2=80=99s missing 720 of HTML= 5=E2=80=99s specified named character references.
=C2=A0- 106 of these are named character= references which do not require a trailing semicolon, such as `&acute`=
=C2=A0- It=E2=80=99s unaware of the ambi= guous ampersand rule, which allows these 106 in special circumstances.

HTML5 asserts that the list of named cha= racter references will not expand in the future. It can be found authoritat= ively at the following URL:


The ambiguous ampersand rule smoothes ov= er legacy behavior from before HTML5 where ampersands were not properly enc= oded in attribute values, specifically in URL values. For example, in a que= ry string for a search, one might find `?q=3Ddog&not=3Dcat`. The `&= not` in that value would decode to U+AC (=C2=AC), but since it=E2=80=99s in= an attribute value it will be left as plaintext. Inside normal HTML markup= it would transform into `?q=3Ddog=C2=AC=3Dcat`. There are related nuances = when numeric character references are found at the end of a string or bound= ary without the semicolon.

The function signature of `html_entity_d= ecode()` does not currently allow for correcting this behavior. I=E2=80=99d= like to propose an RFC or a bug fix which either extends the function (per= haps by adding a new flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably cre= ates a new function. For the missing character references I wonder if it wo= uld be enough to add them to the list of default translatable references.

One challenge with the existing function= is that the concept of the translation table stands in contrast with the f= ixed and static nature of HTML5=E2=80=99s replacement tables. A new functio= n or set of functions could open up spec-compliant decoding while providing= helpful methods that are necessary in many common server-side operations:<= /div>

=C2=A0 - `html_decode( =E2=80=98attribut= e=E2=80=99 | =E2=80=98data=E2=80=99, $raw_text, $input_encoding =3D = =E2=80=98utf-8' )`
=C2=A0 - `html_text_contains( =E2=80=98a= ttribute=E2=80=99 | =E2=80=98data=E2=80=99, $raw_haystack, $needle, $input_= encoding =3D =E2=80=98utf-8=E2=80=99 )`
=C2=A0 - `html_text_starts_with( = =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99, $raw_haystack, $needl= e, $input_encoding =3D =E2=80=98utf-8=E2=80=99 )`

These methods are handy for inspecting t= hings like encoded attribute values in a memory-efficient and processing-ef= ficient way, when it=E2=80=99s not necessary to decode the entire value. In= common situations, one encounters data-URIs with potentially megabytes of = image data and processing only the first few or tens of bytes can save a lo= t of overhead.

We=E2=80=99re exploring pure-PHP solutio= ns to these problems in WordPress in attempts to improve the reliability an= d safety of handling HTML. I=E2=80=99d love to hear your thoughts and know = if anyone is willing to work with me to create an RFC or directly propose p= atches. We=E2=80=99ve created a step function which allows finding the next= character reference and decoding it separately, enabling some novel featur= es like highlighting the character references in source text.

Should I propose an RFC for this?

Warmly,
Dennis Snell
Automattic Inc.

All,

I have submitted an RFC draft for including the proposed feature from = this issue. Thanks to everyone who helped me in this process. It=E2=80=99s = my first RFC, so I apologize in advance for any mistakes I=E2=80=99ve made = in the process.


This is proposed for a future PHP version after 8.4.

Warmly,
Dennis Snell
--=_C4985891-49BA-42C4-8BCA-4B746C780D9D--