Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:125102
Message-ID: <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_7F078510-A5C0-4280-B32C-4EFB47B329A3"
Precedence: bulk
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\))
Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand
Date: Thu, 22 Aug 2024 18:02:13 -0500
In-Reply-To: <efaf4c62-a552-4232-8a22-410578c13b8d@gmail.com>
Cc: Internals <internals@lists.php.net>
To: Niels Dossche <dossche.niels@gmail.com>
References: <CAA8B983-FD39-4AB1-B314-62608B0C0EF4@a8c.com>
 <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com>
 <efaf4c62-a552-4232-8a22-410578c13b8d@gmail.com>
From: dennis.snell@automattic.com (Dennis Snell)


--Apple-Mail=_7F078510-A5C0-4280-B32C-4EFB47B329A3
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8


> On Aug 22, 2024, at 5:01=E2=80=AFPM, Niels Dossche =
<dossche.niels@gmail.com> wrote:
>=20
> On 20/08/2024 00:45, Dennis Snell wrote:
>>=20
>>> On Jul 9, 2024, at 4:55=E2=80=AFPM, Dennis Snell =
<dennis.snell@a8c.com> wrote:
>>>=20
>>> Greetings all,
>>>=20
>>> The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function =
has a number of issues that I=E2=80=99d like to correct.
>>>=20
>>>  - It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named =
character references.
>>>  - 106 of these are named character references which do not require =
a trailing semicolon, such as `&acute`
>>>  - It=E2=80=99s unaware of the ambiguous ampersand rule, which =
allows these 106 in special circumstances.
>>>=20
>>> HTML5 asserts that the list of named character references will not =
expand in the future. It can be found authoritatively at the following =
URL:
>>>=20
>>> https://html.spec.whatwg.org/entities.json =
<https://html.spec.whatwg.org/entities.json>
>>>=20
>>> The ambiguous ampersand rule smoothes over legacy behavior from =
before HTML5 where ampersands were not properly encoded in attribute =
values, specifically in URL values. For example, in a query string for a =
search, one might find `?q=3Ddog&not=3Dcat`. The `&not` in that value =
would decode to U+AC (=C2=AC), but since it=E2=80=99s in an attribute =
value it will be left as plaintext. Inside normal HTML markup it would =
transform into `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when =
numeric character references are found at the end of a string or =
boundary without the semicolon.
>>>=20
>>> The function signature of `html_entity_decode()` does not currently =
allow for correcting this behavior. I=E2=80=99d like to propose an RFC =
or a bug fix which either extends the function (perhaps by adding a new =
flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new =
function. For the missing character references I wonder if it would be =
enough to add them to the list of default translatable references.
>>>=20
>>> One challenge with the existing function is that the concept of the =
translation table stands in contrast with the fixed and static nature of =
HTML5=E2=80=99s replacement tables. A new function or set of functions =
could open up spec-compliant decoding while providing helpful methods =
that are necessary in many common server-side operations:
>>>=20
>>>   - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99=
, $raw_text, $input_encoding =3D =E2=80=98utf-8' )`
>>>   - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=
=80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99=
 )`
>>>   - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | =
=E2=80=98data=E2=80=99, $raw_haystack, $needle, $input_encoding =3D =
=E2=80=98utf-8=E2=80=99 )`
>>>=20
>>> These methods are handy for inspecting things like encoded attribute =
values in a memory-efficient and processing-efficient way, when it=E2=80=99=
s not necessary to decode the entire value. In common situations, one =
encounters data-URIs with potentially megabytes of image data and =
processing only the first few or tens of bytes can save a lot of =
overhead.
>>>=20
>>> We=E2=80=99re exploring pure-PHP solutions to these problems in =
WordPress in attempts to improve the reliability and safety of handling =
HTML. I=E2=80=99d love to hear your thoughts and know if anyone is =
willing to work with me to create an RFC or directly propose patches. =
We=E2=80=99ve created a step function which allows finding the next =
character reference and decoding it separately, enabling some novel =
features like highlighting the character references in source text.
>>>=20
>>> Should I propose an RFC for this?
>>>=20
>>> Warmly,
>>> Dennis Snell
>>> Automattic Inc.
>>=20
>> Thanks everyone for your feedback so far on the `decode_html()` RFC =
[https://wiki.php.net/rfc/decode_html =
<https://wiki.php.net/rfc/decode_html>]
>>=20
>> I=E2=80=99ve updated it replacing the new constants with a new =
`HtmlContext` enum, and the interface seems much nicer this way. I =
particularly like how PHP enforces passing a valid value, vs. hoping =
that the right flag is used.
>>=20
>> Additionally I added a section that I previously forgot, which =
highlights the source of the infamous mojibake/gremlins. HTML has =
special rules for remapping the C1 control characters, as if they had =
been stored or recorded for Windows-1251.
>>=20
>> Warmly,
>> Dennis Snell
>>=20
>=20
> Hi Dennis
>=20
> +1 on the concept.
> I just have two concerns:

Thanks Niels. I appreciate the help you=E2=80=99ve already provided on =
this process, and the work you=E2=80=99ve done with lexbor.

>=20
> 1) I'm not so sure that the name "decode_html" is self-descriptive =
enough, it sounds very generic.

The name is not very important to me. For the sake of history, the =
reason I have chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an =
HTML parser, this is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=
=9D content and decoding it into a =E2=80=9Cplain PHP string.=E2=80=9D

The existing `html_entity_decode()` is very close in naming but ties =
this concept into _entities_, and overlooks other basic text decoding =
concerns (newline normalization and NULL byte handling).

Originally I had =E2=80=9Cutf8=E2=80=9D in the name but someone else =
thought it was too long and specific. I want the name to educate =
developers and also be terse. Naming is hard.

> 2) I would strongly suggest to explore an implementation based on =
Lexbor. I'm pretty confident that it can be done by reusing the internal =
APIs. The advantage is that it will be less code to maintain. You pull =
off some fancy tricks in your implementation for performance reasons, =
but that also adds to complexity and maintenance burden. Also since this =
is C, we must be extra careful when implementing tricks.

Yeah I agree and I=E2=80=99ll share more below. The tricks I=E2=80=99m =
using in my PR implementing the RFC are partly there to propose adoption =
into PHP and partly there to get a real sense of my algorithm vs. those =
found in Chrome, Firefox, Safari, and lexbor. I=E2=80=99ve attempted to =
build a search algorithm for named character references that optimizes =
for cache locality in contrast to algorithmic complexity where RAM =
access is assumed to be free.

My code isn=E2=80=99t currently well document and doesn=E2=80=99t meet =
the PHP-src coding standards, but the algorithm is pretty basic and easy =
to explain. It=E2=80=99s also =E2=80=9Cunoptimized=E2=80=9D for C, =
mostly. I think there are still large gains to be made that so far =
I=E2=80=99ve been unable to visualize incorporating into the lexbor =
parser. For example, `decode_html()` assumes we=E2=80=99re starting =
already with a span of text that is HTML text. We=E2=80=99re not making =
conditional decisions on whether the next byte produces a token that =
escapes out of the text parsing mode.

> If we could have a single implementation, that would be great. I do =
understand of course your concern that DOM is not a required extension, =
and therefore basing the internals on Lexbor makes it tied to the DOM =
extension which may not be available. I however suspect that a large =
chunk of people needing a function like this have DOM available (as DOM =
is required by many HTML-processing-related packages). I can also look =
into it sometime soon if you want; anyway feel free to ping me.

I=E2=80=99m also very open to lexbor-based approaches but I=E2=80=99ve =
so-far found it more complicated than I expected. In some part this is =
because it involves setting up the parser and state machine for the HTML =
specification and much of the actual decoding can be safely done without =
this.

The other part is the extension aspect. I hear you, that you would =
expect calling code to have the DOM extensions available, but that=E2=80=99=
s simply not the case when developing a platform like WordPress, which I =
do. We don=E2=80=99t have control over the servers or environments where =
people are deploying this, and the availability of the DOM extensions is =
low enough that WordPress code simply cannot use `DOMDocument` (even =
though it shouldn=E2=80=99t because of the wild problems that has for =
attempting to parse HTML).

People resort to `html_entity_decode()` because that=E2=80=99s the only =
option. In WordPress we now have a spec-compliant decoder, but as it=E2=80=
=99s in user-space PHP its performance is far below what=E2=80=99s =
possible.

I=E2=80=99d love your help in setting up lexbor=E2=80=99s state machine =
to decode text nodes. I=E2=80=99d love it even more if this could be =
part of the PHP language. It constantly surprises me that _the language =
of the web_ (PHP) doesn=E2=80=99t have the tools to speak _the language =
of the web_ (HTML). This RFC is all about taking a step towards ensuring =
that PHP developers can rely on PHP to be a reliable middle-man between =
the HTML domain and the PHP domain.

In other words, requiring the DOM extension or `DOM\HtmlDocument` would =
be such a non-starter for WordPress (accounting for 43% of the web =
today) that it would completely unavailable.

>=20
> And I do have the following thoughts:
> 1) We should amend the ENT_HTML5 related docs already that it's not =
compliant.
> 2) Perhaps ENT_HTML5 should be deprecated. E.g. you could say in your =
RFC that ENT_HTML5 will be deprecated in the release after the version =
that will have decode_html(). The reason I suggest the release _after_ =
and not the _same_ release is because I strongly believe that we should =
have at least one version where the proper alternative is available =
without forcing a deprecation to users already.

I love this suggestion. Just for reference, since I=E2=80=99ve looked =
before and not found it. Can someone indicate where to find the PHP =
function documentation? There are a number of updates I would love to =
propose but I don=E2=80=99t know where to find the content that appears =
in https://www.php.net/manual/en/function.html-entity-decode.php, for =
instance.

>=20
> Kind regards
> Niels

Mad respect to the work you=E2=80=99ve brought to lexbor and to PHP. =
I=E2=80=99m excited to start relying on `\DOM\HtmlDocument` and have =
started using it in my benchmarks and HTML analysis as we develop the =
WordPress HTML API (a streaming, low memory-overhead, reentrant HTML =
parsing and manipulation framework in user-space PHP).

Dennis Snell


--Apple-Mail=_7F078510-A5C0-4280-B32C-4EFB47B329A3
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"overflow-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: =
after-white-space;"><br><div><blockquote type=3D"cite"><div>On Aug 22, =
2024, at 5:01=E2=80=AFPM, Niels Dossche &lt;dossche.niels@gmail.com&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><div><meta =
charset=3D"UTF-8"><span style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none; float: none; display: inline !important;">On 20/08/2024 00:45, =
Dennis Snell wrote:</span><br style=3D"caret-color: rgb(0, 0, 0); =
font-family: Helvetica; font-size: 16px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><blockquote type=3D"cite" style=3D"font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; orphans: auto; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br><blockquote type=3D"cite">On Jul 9, 2024, at =
4:55=E2=80=AFPM, Dennis Snell &lt;dennis.snell@a8c.com&gt; =
wrote:<br><br>Greetings all,<br><br>The `html_entity_decode( =E2=80=A6 =
ENT_HTML5 =E2=80=A6 )` function has a number of issues that I=E2=80=99d =
like to correct.<br><br>&nbsp;- It=E2=80=99s missing 720 of HTML5=E2=80=99=
s specified named character references.<br>&nbsp;- 106 of these are =
named character references which do not require a trailing semicolon, =
such as `&amp;acute`<br>&nbsp;- It=E2=80=99s unaware of the ambiguous =
ampersand rule, which allows these 106 in special =
circumstances.<br><br>HTML5 asserts that the list of named character =
references will not expand in the future. It can be found =
authoritatively at the following URL:<br><br><a =
href=3D"https://html.spec.whatwg.org/entities.json">https://html.spec.what=
wg.org/entities.json</a><span =
class=3D"Apple-converted-space">&nbsp;</span>&lt;<a =
href=3D"https://html.spec.whatwg.org/entities.json">https://html.spec.what=
wg.org/entities.json</a>&gt;<br><br>The ambiguous ampersand rule =
smoothes over legacy behavior from before HTML5 where ampersands were =
not properly encoded in attribute values, specifically in URL values. =
For example, in a query string for a search, one might find =
`?q=3Ddog&amp;not=3Dcat`. The `&amp;not` in that value would decode to =
U+AC (=C2=AC), but since it=E2=80=99s in an attribute value it will be =
left as plaintext. Inside normal HTML markup it would transform into =
`?q=3Ddog=C2=AC=3Dcat`. There are related nuances when numeric character =
references are found at the end of a string or boundary without the =
semicolon.<br><br>The function signature of `html_entity_decode()` does =
not currently allow for correcting this behavior. I=E2=80=99d like to =
propose an RFC or a bug fix which either extends the function (perhaps =
by adding a new flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably =
creates a new function. For the missing character references I wonder if =
it would be enough to add them to the list of default translatable =
references.<br><br>One challenge with the existing function is that the =
concept of the translation table stands in contrast with the fixed and =
static nature of HTML5=E2=80=99s replacement tables. A new function or =
set of functions could open up spec-compliant decoding while providing =
helpful methods that are necessary in many common server-side =
operations:<br><br>&nbsp; - `html_decode( =E2=80=98attribute=E2=80=99 | =
=E2=80=98data=E2=80=99, $raw_text, $input_encoding =3D =E2=80=98utf-8' =
)`<br>&nbsp; - `html_text_contains( =E2=80=98attribute=E2=80=99 | =
=E2=80=98data=E2=80=99, $raw_haystack, $needle, $input_encoding =3D =
=E2=80=98utf-8=E2=80=99 )`<br>&nbsp; - `html_text_starts_with( =
=E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99, $raw_haystack, =
$needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99 )`<br><br>These =
methods are handy for inspecting things like encoded attribute values in =
a memory-efficient and processing-efficient way, when it=E2=80=99s not =
necessary to decode the entire value. In common situations, one =
encounters data-URIs with potentially megabytes of image data and =
processing only the first few or tens of bytes can save a lot of =
overhead.<br><br>We=E2=80=99re exploring pure-PHP solutions to these =
problems in WordPress in attempts to improve the reliability and safety =
of handling HTML. I=E2=80=99d love to hear your thoughts and know if =
anyone is willing to work with me to create an RFC or directly propose =
patches. We=E2=80=99ve created a step function which allows finding the =
next character reference and decoding it separately, enabling some novel =
features like highlighting the character references in source =
text.<br><br>Should I propose an RFC for this?<br><br>Warmly,<br>Dennis =
Snell<br>Automattic Inc.<br></blockquote><br>Thanks everyone for your =
feedback so far on the `decode_html()` RFC [<a =
href=3D"https://wiki.php.net/rfc/decode_html">https://wiki.php.net/rfc/dec=
ode_html</a><span class=3D"Apple-converted-space">&nbsp;</span>&lt;<a =
href=3D"https://wiki.php.net/rfc/decode_html">https://wiki.php.net/rfc/dec=
ode_html</a>&gt;]<br><br>I=E2=80=99ve updated it replacing the new =
constants with a new `HtmlContext` enum, and the interface seems much =
nicer this way. I particularly like how PHP enforces passing a valid =
value, vs. hoping that the right flag is used.<br><br>Additionally I =
added a section that I previously forgot, which highlights the source of =
the infamous mojibake/gremlins. HTML has special rules for remapping the =
C1 control characters, as if they had been stored or recorded for =
Windows-1251.<br><br>Warmly,<br>Dennis Snell<br><br></blockquote><br =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
16px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;"><span =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
16px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none; float: none; =
display: inline !important;">Hi Dennis</span><br style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 16px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br style=3D"caret-color: rgb(0, 0, 0); =
font-family: Helvetica; font-size: 16px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><span style=3D"caret-color: rgb(0, 0, 0); =
font-family: Helvetica; font-size: 16px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none; float: none; display: inline !important;">+1 on =
the concept.</span><br style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"><span style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; =
font-size: 16px; font-style: normal; font-variant-caps: normal; =
font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none; float: none; display: inline !important;">I just have two =
concerns:</span><br style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"></div></blockquote><div><br></div><div>Thanks Niels. I appreciate =
the help you=E2=80=99ve already provided on this process, and the work =
you=E2=80=99ve done with lexbor.</div><br><blockquote =
type=3D"cite"><div><br style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"><span style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; =
font-size: 16px; font-style: normal; font-variant-caps: normal; =
font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none; float: none; display: inline !important;">1) I'm not so sure that =
the name "decode_html" is self-descriptive enough, it sounds very =
generic.</span><br style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"></div></blockquote><div><br></div><div>The name is not very =
important to me. For the sake of history, the reason I have chosen =
=E2=80=9Cdecode HTML=E2=80=9D is because, unlike an HTML parser, this is =
focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=9D content and =
decoding it into a =E2=80=9Cplain PHP =
string.=E2=80=9D</div><div><br></div><div>The existing =
`html_entity_decode()` is very close in naming but ties this concept =
into _entities_, and overlooks other basic text decoding concerns =
(newline normalization and NULL byte =
handling).</div><div><br></div><div>Originally I had =E2=80=9Cutf8=E2=80=9D=
 in the name but someone else thought it was too long and specific. I =
want the name to educate developers and also be terse. Naming is =
hard.</div><br><blockquote type=3D"cite"><div><span style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 16px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none; float: none; display: inline !important;">2) I =
would strongly suggest to explore an implementation based on Lexbor. I'm =
pretty confident that it can be done by reusing the internal APIs. The =
advantage is that it will be less code to maintain. You pull off some =
fancy tricks in your implementation for performance reasons, but that =
also adds to complexity and maintenance burden. Also since this is C, we =
must be extra careful when implementing tricks. =
</span></div></blockquote><div><br></div><div>Yeah I agree and I=E2=80=99l=
l share more below. The tricks I=E2=80=99m using in my PR implementing =
the RFC are partly there to propose adoption into PHP and partly there =
to get a real sense of my algorithm vs. those found in Chrome, Firefox, =
Safari, and lexbor. I=E2=80=99ve attempted to build a search algorithm =
for named character references that optimizes for cache locality in =
contrast to algorithmic complexity where RAM access is assumed to be =
free.</div><div><br></div><div>My code isn=E2=80=99t currently well =
document and doesn=E2=80=99t meet the PHP-src coding standards, but the =
algorithm is pretty basic and easy to explain. It=E2=80=99s also =
=E2=80=9Cunoptimized=E2=80=9D for C, mostly. I think there are still =
large gains to be made that so far I=E2=80=99ve been unable to visualize =
incorporating into the lexbor parser. For example, `decode_html()` =
assumes we=E2=80=99re starting already with a span of text that is HTML =
text. We=E2=80=99re not making conditional decisions on whether the next =
byte produces a token that escapes out of the text parsing =
mode.</div><br><blockquote type=3D"cite"><div><span style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 16px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none; float: none; display: inline !important;">If we =
could have a single implementation, that would be great. I do understand =
of course your concern that DOM is not a required extension, and =
therefore basing the internals on Lexbor makes it tied to the DOM =
extension which may not be available. I however suspect that a large =
chunk of people needing a function like this have DOM available (as DOM =
is required by many HTML-processing-related packages). I can also look =
into it sometime soon if you want; anyway feel free to ping =
me.</span><br style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"></div></blockquote><div><br></div><div>I=E2=80=99m also very open =
to lexbor-based approaches but I=E2=80=99ve so-far found it more =
complicated than I expected. In some part this is because it involves =
setting up the parser and state machine for the HTML specification and =
much of the actual decoding can be safely done without =
this.</div><div><br></div><div>The other part is the extension aspect. I =
hear you, that you would expect calling code to have the DOM extensions =
available, but that=E2=80=99s simply not the case when developing a =
platform like WordPress, which I do. We don=E2=80=99t have control over =
the servers or environments where people are deploying this, and the =
availability of the DOM extensions is low enough that WordPress code =
simply cannot use `DOMDocument` (even though it shouldn=E2=80=99t =
because of the wild problems that has for attempting to parse =
HTML).</div><div><br></div><div>People resort to `html_entity_decode()` =
because that=E2=80=99s the only option. In WordPress we now have a =
spec-compliant decoder, but as it=E2=80=99s in user-space PHP its =
performance is far below what=E2=80=99s =
possible.</div><div><br></div><div>I=E2=80=99d love your help in setting =
up lexbor=E2=80=99s state machine to decode text nodes. I=E2=80=99d love =
it even more if this could be part of the PHP language. It constantly =
surprises me that _the language of the web_ (PHP) doesn=E2=80=99t have =
the tools to speak _the language of the web_ (HTML). This RFC is all =
about taking a step towards ensuring that PHP developers can rely on PHP =
to be a reliable middle-man between the HTML domain and the PHP =
domain.</div><div><br></div><div>In other words, requiring the DOM =
extension or `DOM\HtmlDocument` would be such a non-starter for =
WordPress (accounting for 43% of the web today) that it would completely =
unavailable.</div><br><blockquote type=3D"cite"><div><br =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
16px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;"><span =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
16px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none; float: none; =
display: inline !important;">And I do have the following =
thoughts:</span><br style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"><span style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; =
font-size: 16px; font-style: normal; font-variant-caps: normal; =
font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none; float: none; display: inline !important;">1) We should amend the =
ENT_HTML5 related docs already that it's not compliant.</span><br =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
16px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;"><span =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
16px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none; float: none; =
display: inline !important;">2) Perhaps ENT_HTML5 should be deprecated. =
E.g. you could say in your RFC that ENT_HTML5 will be deprecated in the =
release after the version that will have decode_html(). The reason I =
suggest the release _after_ and not the _same_ release is because I =
strongly believe that we should have at least one version where the =
proper alternative is available without forcing a deprecation to users =
already.</span><br style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 16px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"></div></blockquote><div><br></div><div>I love this suggestion. =
Just for reference, since I=E2=80=99ve looked before and not found it. =
Can someone indicate where to find the PHP function documentation? There =
are a number of updates I would love to propose but I don=E2=80=99t know =
where to find the content that appears in&nbsp;<a =
href=3D"https://www.php.net/manual/en/function.html-entity-decode.php">htt=
ps://www.php.net/manual/en/function.html-entity-decode.php</a>, for =
instance.</div><br><blockquote type=3D"cite"><div><br =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
16px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;"><span =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
16px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none; float: none; =
display: inline !important;">Kind regards</span><br style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 16px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><span style=3D"caret-color: rgb(0, 0, 0); =
font-family: Helvetica; font-size: 16px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none; float: none; display: inline =
!important;">Niels</span></div></blockquote><br></div><div>Mad respect =
to the work you=E2=80=99ve brought to lexbor and to PHP. I=E2=80=99m =
excited to start relying on `\DOM\HtmlDocument` and have started using =
it in my benchmarks and HTML analysis as we develop the WordPress HTML =
API (a streaming, low memory-overhead, reentrant HTML parsing and =
manipulation framework in user-space =
PHP).</div><div><br></div><div>Dennis Snell</div><br></body></html>=

--Apple-Mail=_7F078510-A5C0-4280-B32C-4EFB47B329A3--