Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125055 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id B54D91A00BD for ; Mon, 19 Aug 2024 22:46:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724107681; bh=17EopBOCjLUHwOBV7C7NyrUUWC2obCq9S1hL9qXTO+o=; h=From:Subject:Date:References:To:In-Reply-To:From; b=d23oUvtAQcXamD8lzqfZG0IkOfW+S4Iv8m9YDQQkSyW1xQ68IN61S7ZyfNuIYlAuu XtQ2DgiP0DnyWscazNwoKSRY1CiARHGNP1q+hJ/KuhrZGngD62BcVtlRa2ci1y0PNd 293af9oasz8AmFd2dGETKxfjtNt3VMydS/NlNwFBCpv8hcFH4H3COz0EqWECjZAOKv ob4wIXZ0SvXvMDkorkagsyKlxXWMDWfSDJUZF7bTPZ1cFM/WuxE3LSHDOgSSjS4mA+ zY3BSYPxqFScGrnQGHsKswQbBQ/rVVt4AGM10tcAObzp/jRdY1iNotDCcykg9iUcXi /jcxwQhGR+ltA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id B6D4D180042 for ; Mon, 19 Aug 2024 22:47:58 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 19 Aug 2024 22:47:58 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id EF7BD3405CD for ; Mon, 19 Aug 2024 22:46:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:message-id:in-reply-to:references:date:date:subject :subject:mime-version:content-type:content-type:from:from :received:received:received:received:received:received; s= automattic1; t=1724107568; bh=17EopBOCjLUHwOBV7C7NyrUUWC2obCq9S1 hL9qXTO+o=; b=cezq4uB9mKw2W3G7AUvbt3xeScGpfkJ6Kn0l7k+6ngxehGff+f zGZQjMUIe01jIIocyM2A7ACFPy02A4uWNztRwFjUflbYoFh4odK+glDQN4Puscsm EbONr9yvWggHn5p74KgxfpYgQWmYD/0QYtGXKgngb2yEXajMUv/VGsEgbd+R+83P G8T1Goha/k4bH3JuF/jvJQTQ2Ky2pyserKvi23eoCw+7K32pxCzUQzDHmFl2Yeor rkzFssK9CXUQeWHaeXUoBhcC8SdmQL0MuGltvnIRr8RLj/EUQEORHzSQeZ8pUnn/ 9sI7qiVJKAfjrtpcVy7Ay3aIZdKYABMWrBew== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tR6MkxU8lJTX for ; Mon, 19 Aug 2024 22:46:08 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 828E13405CA for ; Mon, 19 Aug 2024 22:46:08 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="J473DQoZ"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="e/XDAp/n"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="TL96GWrV"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 261B4A0984 for ; Mon, 19 Aug 2024 22:46:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724107568; bh=17EopBOCjLUHwOBV7C7NyrUUWC2obCq9S1hL9qXTO+o=; h=From:Subject:Date:References:To:In-Reply-To:From; b=J473DQoZkKNvTyrMGgo/iqkWSA+5lJgo0gz/wRZHQU4K7+b6YArX7VAlyiOnms750 OURHo4aeqdECr3sU9kLzt13lOkS+aQga26fqVM9fW1YBIBmPuAj8x7fUABRAcMehF6 7YAu+F4q2KPBNY3XMIp2YnQs7dndc4KenrtUqb8cXhGNnS7XyJxrWGEqoM5HWIgyxw RJOaohzpa357n9awc7pwauzkO5Ww1iIW4fM4MdyDpDv+v2/HGCPOZUQByjn7jFA3a2 4wEdURn6uXci403Db+SoBnjVsf7cANIAMEAXHwZF8PZYc5I6mEF2dFp+5Yr1OdVaw5 /0yT8Z6A5iPuw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1724107568; bh=17EopBOCjLUHwOBV7C7NyrUUWC2obCq9S1hL9qXTO+o=; h=From:Subject:Date:References:To:In-Reply-To:From; b=e/XDAp/nQeRmt2cMWktRIi8CnCXIDfBkL9gbZoV07eZlmqNgFCTR0T5E4SU2FnoOB w7c4feTyLU12PcjZjJxfSo0kjsTxlw3GRdyAN27r1jI7bmOGHaNe1BRoSMCN73hIWF clr70ah6TStcZ8E916I5VIoSdEgXo2H0lVyep/kj/vYNUu2+8zar5tcgq3anv3CQzK lE7L5wMc2p1ZyB9sVlSHajDzAXQI9B/pD2cb4dTNSbMfdkvvgvQjGV3FLM+x2U6i14 w0oThLGoMZe2KDEalY9J1X5IhiTZj0BN9/Sy0tlhBZcY9Hwy8hYa0wlPuf0sMjZaGA ZwvewNQA2UZBw== Received: from mail-pf1-f198.google.com (mail-pf1-f198.google.com [209.85.210.198]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id E8411A072E for ; Mon, 19 Aug 2024 22:46:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724107567; bh=17EopBOCjLUHwOBV7C7NyrUUWC2obCq9S1hL9qXTO+o=; h=From:Subject:Date:References:To:In-Reply-To:From; b=TL96GWrVBfwT6u4wQch1zPiExeVSF9x2R9Ns3/TZggsXfhMegkh/pQPjA3KAolOzz XObK2xw7Vpl4m9/bPAFzjv0ZLFDDS9S1rtYSMtCiv/4e/OOSmjT8a7RvnfUx89KB6a EOz0AaZzyEebIc74HrOJMvoNY5PG1qaklLyhzxHmnLDMWt3kFLfGfUUa1nCzjjCfvf uWrGArBKxPo871yhjOdhfscVq3CW8A7LxxEXFA69Vu/+pPS8CREGjmbzqgd5us9HkD Y/0oJ8ZSkKd7ZcBJq4j71flUk6ubi+m37WZubjoX5ADgEgY30pZcibSznDuXDSMxk+ /DKUkvJCCWI2w== Received: by mail-pf1-f198.google.com with SMTP id d2e1a72fcca58-71274faa89aso3003380b3a.2 for ; Mon, 19 Aug 2024 15:46:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724107567; x=1724712367; h=message-id:in-reply-to:to:references:date:subject:mime-version:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Zam485rizYPu+Dp698oBZkLMksy7uV5A5HQBja8AeSE=; b=CHP0kt9Xv/CwGlnxhhqr6HXfGzu6e26ZDyHROc6M3WkYO2t+SWowDfIVqlj5l+I7Q4 bl5UJOMFCxBjviNg1vdLcp6zoOawM430BidknkDg9rx4UrlfllbB+wZLm/R2bt0Bo69x /WQN9SCwbOqqkWTjBSE+cQwfAfrtfURwIEgUJeQP5g2LHC0pWYxg8UMl+JsCEpGB/Ig1 w8pYYo0+GU6TiSjgXCBUTwO2nxZb7OqzshD3I0yxXbDby7SZ83OUFRklNV/mclG2EkV5 YA9XO3Op6Pgozc8mfzX27HInUDy1LAQAwZNw7vas1vvbLZMWRdz70vR4InW5GVxzn4IO Yjwg== X-Gm-Message-State: AOJu0Yy5OC5zMj3+yWvdlx3L2/tC+z1ipyNCxPUEcWM1q31UgVEtoYX3 kKMtEYFGziP07AnmINpBSqOLPInVn4ZGLjXQU8wGqajVrka7GtGZmPuVaWCjaxpDjqqH6L1CFpr t52VAN3iHn6lVuIkdNZEVRmvyFLy+KV3vOgo3LN98S43H90JvYPSIQzZPhpsRTD8= X-Received: by 2002:a05:6a00:1951:b0:70d:7547:90f7 with SMTP id d2e1a72fcca58-713c4e256cbmr10983927b3a.12.1724107566724; Mon, 19 Aug 2024 15:46:06 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG0/vVHRezthTKgKa/FFt/rV2JMWP/3YxR477TgjIm7ar+cGJxMQmqF4BY5YJicnFcpyBBgAg== X-Received: by 2002:a05:6a00:1951:b0:70d:7547:90f7 with SMTP id d2e1a72fcca58-713c4e256cbmr10983901b3a.12.1724107566040; Mon, 19 Aug 2024 15:46:06 -0700 (PDT) Received: from smtpclient.apple (wsip-174-76-46-206.ph.ph.cox.net. [174.76.46.206]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-7c6b6367d01sm8190811a12.89.2024.08.19.15.46.04 for (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 19 Aug 2024 15:46:05 -0700 (PDT) Content-Type: multipart/alternative; boundary="Apple-Mail=_FF52ADD5-65BA-4B75-9DFC-087DFE5A2036" Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\)) Subject: [PHP-DEV] Re: [RFC] Decoding HTML and the Ambiguous Ampersand Date: Mon, 19 Aug 2024 15:45:53 -0700 References: To: Internals In-Reply-To: Message-ID: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> X-Mailer: Apple Mail (2.3774.600.62) From: dennis.snell@automattic.com (Dennis Snell) --Apple-Mail=_FF52ADD5-65BA-4B75-9DFC-087DFE5A2036 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Jul 9, 2024, at 4:55=E2=80=AFPM, Dennis Snell = wrote: >=20 > Greetings all, >=20 > The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function has = a number of issues that I=E2=80=99d like to correct. >=20 > - It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named = character references. > - 106 of these are named character references which do not require a = trailing semicolon, such as `´` > - It=E2=80=99s unaware of the ambiguous ampersand rule, which allows = these 106 in special circumstances. >=20 > HTML5 asserts that the list of named character references will not = expand in the future. It can be found authoritatively at the following = URL: >=20 > https://html.spec.whatwg.org/entities.json >=20 > The ambiguous ampersand rule smoothes over legacy behavior from before = HTML5 where ampersands were not properly encoded in attribute values, = specifically in URL values. For example, in a query string for a search, = one might find `?q=3Ddog¬=3Dcat`. The `¬` in that value would = decode to U+AC (=C2=AC), but since it=E2=80=99s in an attribute value it = will be left as plaintext. Inside normal HTML markup it would transform = into `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when numeric = character references are found at the end of a string or boundary = without the semicolon. >=20 > The function signature of `html_entity_decode()` does not currently = allow for correcting this behavior. I=E2=80=99d like to propose an RFC = or a bug fix which either extends the function (perhaps by adding a new = flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new = function. For the missing character references I wonder if it would be = enough to add them to the list of default translatable references. >=20 > One challenge with the existing function is that the concept of the = translation table stands in contrast with the fixed and static nature of = HTML5=E2=80=99s replacement tables. A new function or set of functions = could open up spec-compliant decoding while providing helpful methods = that are necessary in many common server-side operations: >=20 > - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99, = $raw_text, $input_encoding =3D =E2=80=98utf-8' )` > - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80= =99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99 = )` > - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2= =80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99= )` >=20 > These methods are handy for inspecting things like encoded attribute = values in a memory-efficient and processing-efficient way, when it=E2=80=99= s not necessary to decode the entire value. In common situations, one = encounters data-URIs with potentially megabytes of image data and = processing only the first few or tens of bytes can save a lot of = overhead. >=20 > We=E2=80=99re exploring pure-PHP solutions to these problems in = WordPress in attempts to improve the reliability and safety of handling = HTML. I=E2=80=99d love to hear your thoughts and know if anyone is = willing to work with me to create an RFC or directly propose patches. = We=E2=80=99ve created a step function which allows finding the next = character reference and decoding it separately, enabling some novel = features like highlighting the character references in source text. >=20 > Should I propose an RFC for this? >=20 > Warmly, > Dennis Snell > Automattic Inc. Thanks everyone for your feedback so far on the `decode_html()` RFC = [https://wiki.php.net/rfc/decode_html] I=E2=80=99ve updated it replacing the new constants with a new = `HtmlContext` enum, and the interface seems much nicer this way. I = particularly like how PHP enforces passing a valid value, vs. hoping = that the right flag is used. Additionally I added a section that I previously forgot, which = highlights the source of the infamous mojibake/gremlins. HTML has = special rules for remapping the C1 control characters, as if they had = been stored or recorded for Windows-1251. Warmly, Dennis Snell --Apple-Mail=_FF52ADD5-65BA-4B75-9DFC-087DFE5A2036 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
On Jul 9, = 2024, at 4:55=E2=80=AFPM, Dennis Snell <dennis.snell@a8c.com> = wrote:

Greetings all,

The `html_entity_decode( =E2=80=A6= ENT_HTML5 =E2=80=A6 )` function has a number of issues that I=E2=80=99d = like to correct.

 - = It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named character = references.
 - 106 of these are named = character references which do not require a trailing semicolon, such as = `&acute`
 - It=E2=80=99s unaware of the = ambiguous ampersand rule, which allows these 106 in special = circumstances.

HTML5 asserts = that the list of named character references will not expand in the = future. It can be found authoritatively at the following URL:


The ambiguous ampersand rule = smoothes over legacy behavior from before HTML5 where ampersands were = not properly encoded in attribute values, specifically in URL values. = For example, in a query string for a search, one might find = `?q=3Ddog&not=3Dcat`. The `&not` in that value would decode to = U+AC (=C2=AC), but since it=E2=80=99s in an attribute value it will be = left as plaintext. Inside normal HTML markup it would transform into = `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when numeric character = references are found at the end of a string or boundary without the = semicolon.

The function = signature of `html_entity_decode()` does not currently allow for = correcting this behavior. I=E2=80=99d like to propose an RFC or a bug = fix which either extends the function (perhaps by adding a new flag like = `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the = missing character references I wonder if it would be enough to add them = to the list of default translatable references.

One challenge with the = existing function is that the concept of the translation table stands in = contrast with the fixed and static nature of HTML5=E2=80=99s replacement = tables. A new function or set of functions could open up spec-compliant = decoding while providing helpful methods that are necessary in many = common server-side operations:

  - `html_decode( =E2=80=98attribute=E2=80=99 | = =E2=80=98data=E2=80=99, $raw_text, $input_encoding =3D =E2=80=98utf-8' = )`
  - `html_text_contains( =E2=80=98attribute=E2= =80=99 | =E2=80=98data=E2=80=99, $raw_haystack, $needle, $input_encoding = =3D =E2=80=98utf-8=E2=80=99 )`
  - = `html_text_starts_with( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99= , $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99 = )`

These methods are handy = for inspecting things like encoded attribute values in a = memory-efficient and processing-efficient way, when it=E2=80=99s not = necessary to decode the entire value. In common situations, one = encounters data-URIs with potentially megabytes of image data and = processing only the first few or tens of bytes can save a lot of = overhead.

We=E2=80=99re = exploring pure-PHP solutions to these problems in WordPress in attempts = to improve the reliability and safety of handling HTML. I=E2=80=99d love = to hear your thoughts and know if anyone is willing to work with me to = create an RFC or directly propose patches. We=E2=80=99ve created a step = function which allows finding the next character reference and decoding = it separately, enabling some novel features like highlighting the = character references in source text.

Should I propose an RFC for = this?

Warmly,
Dennis Snell
Automattic = Inc.

Thanks everyone for = your feedback so far on the `decode_html()` RFC [https://wiki.php.net/rfc/dec= ode_html]

I=E2=80=99ve updated it replacing = the new constants with a new `HtmlContext` enum, and the interface seems = much nicer this way. I particularly like how PHP enforces passing a = valid value, vs. hoping that the right flag is = used.

Additionally I added a section that I = previously forgot, which highlights the source of the infamous = mojibake/gremlins. HTML has special rules for remapping the C1 control = characters, as if they had been stored or recorded for = Windows-1251.

Warmly,
Dennis = Snell

= --Apple-Mail=_FF52ADD5-65BA-4B75-9DFC-087DFE5A2036--