Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124977 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 45F131A00B7 for ; Fri, 16 Aug 2024 14:38:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723819225; bh=/m0kCKr0ZHp+eKXwG3JtLliLO8nCfsmQoW0bk/PqWXs=; h=Date:From:To:In-Reply-To:References:Subject:From; b=FpUa2wjR2CvqLk4UhaNnyiqQ2cdh5gDfDxzGNWfwW3x06z1eWjW98Wx8Vjp1dT2U/ 8ojKiiP2hJQPSUIZXkwz+fAt8mFAXfKw1WHWexWCuxJBPyr0dkpWCR1llSLvvpb9CJ VdpWtEYC5nFvwzjIJwf+yki1RynK7QYV98VUat8ZdyLGoSuvu9qUur6yPLsQQed1ZI yHfL4vOzJMNRBGHyMmMb+uqUgGBdCxSe09uzENpGNzdlzo6iWcxbr7xQOKjdqF2pAs Zc24fytnuCv6aaM3stvZaCq2k8MpKQj0rqB6ko6AZ2yyyC5bVUJqBPFThOrV9567Tz TH0ovJ1fC6icw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 64BC9180080 for ; Fri, 16 Aug 2024 14:40:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,HTML_MESSAGE, RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from fout2-smtp.messagingengine.com (fout2-smtp.messagingengine.com [103.168.172.145]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 16 Aug 2024 14:40:23 +0000 (UTC) Received: from phl-compute-03.internal (phl-compute-03.nyi.internal [10.202.2.43]) by mailfout.nyi.internal (Postfix) with ESMTP id 5855E138CD82 for ; Fri, 16 Aug 2024 10:38:36 -0400 (EDT) Received: from phl-imap-09 ([10.202.2.99]) by phl-compute-03.internal (MEProxy); Fri, 16 Aug 2024 10:38:36 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bottled.codes; h=cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1723819116; x=1723905516; bh=iHyBUzMX47 ZhIW0jE8che/qTbhLBw+FDrufLNaC6lwQ=; b=rWW7ti2Al9oUCGgoIIqzgbxmIR 38tCVOB/T98ZjoQI9yGbJcwNO6cMMxw/amldU559g4B4/kx0fWZ4AAtYfkQGQtS5 R9Q5b4cTHRqzNDZ8r+VHGEl2zFOwof6iriR+qP2H/AP8mBRTQtmebAMf+7ZavXTf lDMWiKPKsywtZ8d43MmIzUHdvIzHMDtnxy0ZhxvWRBQS0Dx/J+0Mj/QHdIrxc+IA jyTOGYu3G211E0aCKjl6XZkFa+Q3c/ijV8igJEw0CxMsmSQrs+2SvfgQ7EHBNryY UnYcm8ypO2bu3wUEkb0+P34V/IV24lnF557fQqLiG14ZameKZwapxzl0gj5w== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; t=1723819116; x=1723905516; bh=iHyBUzMX47ZhIW0jE8che/qTbhLB w+FDrufLNaC6lwQ=; b=ARQCMQo7JfIe0A3yZNnps4ZNHVfvrmUZlrB5+VY5UfY0 YZfJpmgSL9qW2TJVLJVU+vYgl/bfP0pG9FeID+38OlJURFiqHq92F85+TwV2WkVF 7cExlqOSBTitHrqq100TXKRK2MawdDb+ZqPsMLBNLYv8fZhBvx1Gy4l+AZ8wlnmH A/7lLREc6dbI+1BdzgF+Y69Hp0oQPy+qsDuwJQXsDo5Qm3JpUs8SC82QyxWyqmDL iZqFFLhqn8L3pv7aeOlN+PUCngoS82xGR+E2moaNQOcDT7dX0lbGMVcnYYb7Fs48 k0jOgoNQfaExwb1GB/XzMs/wRUAtPm3xOKdZk1wo/Q== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddruddtkedgjeejucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdggtfgfnhhsuhgsshgtrhhisggvpdfu rfetoffkrfgpnffqhgenuceurghilhhouhhtmecufedttdenucenucfjughrpefoggffhf fvkfgjfhfutgesrgdtreerredtjeenucfhrhhomhepfdftohgsucfnrghnuggvrhhsfdcu oehrohgssegsohhtthhlvggurdgtohguvghsqeenucggtffrrghtthgvrhhnpefhjeeule dujeeugeffhfdvvdejjeeigfejveefieehiefhvefgtedtgedthfdttdenucffohhmrghi nhepfihhrghtfihgrdhorhhgpdhphhhprdhnvghtnecuvehluhhsthgvrhfuihiivgeptd enucfrrghrrghmpehmrghilhhfrhhomheprhhosgessghothhtlhgvugdrtghouggvshdp nhgspghrtghpthhtohepuddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepihhnth gvrhhnrghlsheslhhishhtshdrphhhphdrnhgvth X-ME-Proxy: Feedback-ID: ifab94697:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 18DD015A005E; Fri, 16 Aug 2024 10:38:36 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Date: Fri, 16 Aug 2024 16:38:15 +0200 To: internals@lists.php.net Message-ID: <798084be-9ad3-4fee-92e8-9c9b90814b64@app.fastmail.com> In-Reply-To: <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com> References: <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com> Subject: Re: [PHP-DEV] Re: Decoding HTML and the Ambiguous Ampersand Content-Type: multipart/alternative; boundary=60528d4b10ea45e6a2a2d72a4e7cd182 From: rob@bottled.codes ("Rob Landers") --60528d4b10ea45e6a2a2d72a4e7cd182 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote: >=20 >> On Jul 9, 2024, at 4:55 PM, Dennis Snell wrote: >>=20 >> Greetings all, >>=20 >> The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function ha= s a number of issues that I=E2=80=99d like to correct. >>=20 >> - It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named charac= ter references. >> - 106 of these are named character references which do not require a= trailing semicolon, such as `´` >> - It=E2=80=99s unaware of the ambiguous ampersand rule, which allows= these 106 in special circumstances. >>=20 >> HTML5 asserts that the list of named character references will not ex= pand in the future. It can be found authoritatively at the following URL: >>=20 >> https://html.spec.whatwg.org/entities.json >>=20 >> The ambiguous ampersand rule smoothes over legacy behavior from befor= e HTML5 where ampersands were not properly encoded in attribute values, = specifically in URL values. For example, in a query string for a search,= one might find `?q=3Ddog¬=3Dcat`. The `¬` in that value would dec= ode to U+AC (=C2=AC), but since it=E2=80=99s in an attribute value it wi= ll be left as plaintext. Inside normal HTML markup it would transform in= to `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when numeric charac= ter references are found at the end of a string or boundary without the = semicolon. >>=20 >> The function signature of `html_entity_decode()` does not currently a= llow for correcting this behavior. I=E2=80=99d like to propose an RFC or= a bug fix which either extends the function (perhaps by adding a new fl= ag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function.= For the missing character references I wonder if it would be enough to = add them to the list of default translatable references. >>=20 >> One challenge with the existing function is that the concept of the t= ranslation table stands in contrast with the fixed and static nature of = HTML5=E2=80=99s replacement tables. A new function or set of functions c= ould open up spec-compliant decoding while providing helpful methods tha= t are necessary in many common server-side operations: >>=20 >> - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99= , $raw_text, $input_encoding =3D =E2=80=98utf-8' )` >> - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2= =80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80= =99 )` >> - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | =E2=80=98da= ta=E2=80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2= =80=99 )` >>=20 >> These methods are handy for inspecting things like encoded attribute = values in a memory-efficient and processing-efficient way, when it=E2=80= =99s not necessary to decode the entire value. In common situations, one= encounters data-URIs with potentially megabytes of image data and proce= ssing only the first few or tens of bytes can save a lot of overhead. >>=20 >> We=E2=80=99re exploring pure-PHP solutions to these problems in WordP= ress in attempts to improve the reliability and safety of handling HTML.= I=E2=80=99d love to hear your thoughts and know if anyone is willing to= work with me to create an RFC or directly propose patches. We=E2=80=99v= e created a step function which allows finding the next character refere= nce and decoding it separately, enabling some novel features like highli= ghting the character references in source text. >>=20 >> Should I propose an RFC for this? >>=20 >> Warmly, >> Dennis Snell >> Automattic Inc. >=20 > All, >=20 > I have submitted an RFC draft for including the proposed feature from = this issue. Thanks to everyone who helped me in this process. It=E2=80=99= s my first RFC, so I apologize in advance for any mistakes I=E2=80=99ve = made in the process. >=20 > https://wiki.php.net/rfc/decode_html >=20 > This is proposed for a future PHP version after 8.4. >=20 > Warmly, > Dennis Snell Hey Dennis, The RFC mentions that encoding must be utf-8. How are programmers suppos= ed to work with this if the php file itself isn=E2=80=99t utf-8 or the i= nput is meaningless in utf-8 or if changing it to utf-8 and back would r= esult in invalid text? =E2=80=94 Rob --60528d4b10ea45e6a2a2d72a4e7cd182 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable


On Fri, Aug 1= 6, 2024, at 02:59, Dennis Snell wrote:

On = Jul 9, 2024, at 4:55 PM, Dennis Snell <dennis.snell@a8c.com> wrote= :

Greetings all,

The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function ha= s a number of issues that I=E2=80=99d like to correct.
 - It=E2=80=99s missing 720 of HTML5=E2=80=99s specifie= d named character references.
 - 106 of these are nam= ed character references which do not require a trailing semicolon, such = as `&acute`
 - It=E2=80=99s unaware of the ambigu= ous ampersand rule, which allows these 106 in special circumstances.
=

HTML5 asserts that the list of named character= references will not expand in the future. It can be found authoritative= ly at the following URL:


The ambiguous ampersand rule smoothes over legacy behavior from befor= e HTML5 where ampersands were not properly encoded in attribute values, = specifically in URL values. For example, in a query string for a search,= one might find `?q=3Ddog&not=3Dcat`. The `&not` in that value w= ould decode to U+AC (=C2=AC), but since it=E2=80=99s in an attribute val= ue it will be left as plaintext. Inside normal HTML markup it would tran= sform into `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when numeri= c character references are found at the end of a string or boundary with= out the semicolon.

The function signature o= f `html_entity_decode()` does not currently allow for correcting this be= havior. I=E2=80=99d like to propose an RFC or a bug fix which either ext= ends the function (perhaps by adding a new flag like `ENT_AMBIGUOUS_AMPE= RSAND`) or preferably creates a new function. For the missing character = references I wonder if it would be enough to add them to the list of def= ault translatable references.

One challenge= with the existing function is that the concept of the translation table= stands in contrast with the fixed and static nature of HTML5=E2=80=99s = replacement tables. A new function or set of functions could open up spe= c-compliant decoding while providing helpful methods that are necessary = in many common server-side operations:

&nbs= p; - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99,= $raw_text, $input_encoding =3D =E2=80=98utf-8' )`
  -= `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99= , $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99 )`=
  - `html_text_starts_with( =E2=80=98attribute=E2=80= =99 | =E2=80=98data=E2=80=99, $raw_haystack, $needle, $input_encoding =3D= =E2=80=98utf-8=E2=80=99 )`

These methods a= re handy for inspecting things like encoded attribute values in a memory= -efficient and processing-efficient way, when it=E2=80=99s not necessary= to decode the entire value. In common situations, one encounters data-U= RIs with potentially megabytes of image data and processing only the fir= st few or tens of bytes can save a lot of overhead.

We=E2=80=99re exploring pure-PHP solutions to these problems in W= ordPress in attempts to improve the reliability and safety of handling H= TML. I=E2=80=99d love to hear your thoughts and know if anyone is willin= g to work with me to create an RFC or directly propose patches. We=E2=80= =99ve created a step function which allows finding the next character re= ference and decoding it separately, enabling some novel features like hi= ghlighting the character references in source text.

Should I propose an RFC for this?

Wa= rmly,
Dennis Snell
Automattic Inc.
=

All,

I hav= e submitted an RFC draft for including the proposed feature from this is= sue. Thanks to everyone who helped me in this process. It=E2=80=99s my f= irst RFC, so I apologize in advance for any mistakes I=E2=80=99ve made i= n the process.

<= div>
This is proposed for a future PHP version after 8.4.<= br>

Warmly,
Dennis Snell

Hey Dennis,

The RFC mentions that encoding must be utf-8. How are programmers sup= posed to work with this if the php file itself isn=E2=80=99t utf-8 or th= e input is meaningless in utf-8 or if changing it to utf-8 and back woul= d result in invalid text?

=E2=80=94 Rob
--60528d4b10ea45e6a2a2d72a4e7cd182--