Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:125019
Feedback-ID: ifab94697:Fastmail
Precedence: bulk
MIME-Version: 1.0
Date: Sat, 17 Aug 2024 03:53:27 +0200
To: internals@lists.php.net
Message-ID: <6f421f13-d363-4b75-9d55-9d61ef4806c9@app.fastmail.com>
In-Reply-To: <47CB3E5E-1246-471A-B3BE-CE23BEFDDDF7@automattic.com>
References: <CAA8B983-FD39-4AB1-B314-62608B0C0EF4@a8c.com>
 <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com>
 <47CB3E5E-1246-471A-B3BE-CE23BEFDDDF7@automattic.com>
Subject: Re: [PHP-DEV] [RFC] Re: Decoding HTML and the Ambiguous Ampersand
Content-Type: multipart/alternative;
 boundary=1959572ecc994dad86e6b1463873ef4b
From: rob@bottled.codes ("Rob Landers")

--1959572ecc994dad86e6b1463873ef4b
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hey Dennis,

This looks like top posting because you=E2=80=99ve got a lot to read =E2=
=80=94 and well written =E2=80=94 but I want to reply to some points inl=
ine.=20

On Fri, Aug 16, 2024, at 20:43, Dennis Snell wrote:
> >On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote
>=20
> Thanks for the question, Rob, I hope this finds you well!
>=20
> >The RFC mentions that encoding must be utf-8. How are programmers sup=
posed to work with this if the php file itself isn=E2=80=99t utf-8
>=20
> From my experience it=E2=80=99s the opposite case that is more importa=
nt to consider. That is, what happens when we mix UTF-8 source code with=
 latin1 or UTF-8 source HTML with the system-set locale. I tried to hint=
 at this scenario in the "Character encodings and UTF-8=E2=80=9D section.
>=20
> Let=E2=80=99s examine the fundamental breakdown case:
>=20
> ```php
> =E2=80=9C=C3=A9=E2=80=9D =3D=3D=3D decode_html( =E2=80=9C&#xe9;=E2=80=9D=
 );
> ```
>=20
> If the source is UTF-8 there=E2=80=99s no problem. If the source is IS=
O-8859-1 this will fail because xE9 is on the left while xC3 xA9 is on t=
he right. _Except_ if `zend.multibyte=3D1` and (`zend.script_encoding=3D=
iso-8859-1` _or_ if `declare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` i=
s set). The source code may or may not be converted into a different enc=
oding based on configurations that most developers won=E2=80=99t have ac=
cess to, or won=E2=80=99t examine.
>=20
> Even with source code in ISO-8859-1, the `zend.script_encoding` and `z=
end.multibyte` set, `html_entity_decode()` _still_ reports UTF-8 unless =
`zend.default_charset` is set _or_ one of the `iconv` or `mbstring` inte=
rnal charsets is set.

I just want to pause here and say, =E2=80=9Choly crap.=E2=80=9D That is =
quite complex and those edges seem sharp!

>=20
> My point I=E2=80=99m trying to make is that the current situation toda=
y is a minefield due to a dizzying array of system-dependent settings. M=
ost modern code will either be running UTF-8 source code or will be conv=
erting source code _to_ UTF-8 or many other things will already be helpl=
essly broken beyond this one issue.

Unfortunately, we don=E2=80=99t always get to choose the code we work on=
. There is someone on this list using SHIFT_JIS. They probably know more=
 about the ins and outs of dealing with utf-8 centric systems from that =
encoding. Hopefully they can comment more about why this would or would =
not be a bad idea.=20

>=20
> UTF-8 is the unifier that lets us escape this by having a defined and =
explicit encoding at the input and output.

Utf-8 is pretty good, right now, but I don=E2=80=99t think we should mar=
ry the language to it. Will it be =E2=80=9Cthe standard=E2=80=9D in 10 y=
ears, 20 years, 100 years? Languages change, cultures change. Some peopl=
e I know use a font to change triple equals from a literal =3D=3D=3D to =
=E2=89=A1. How long until php recognizes that as a literal operator?

But anyway, to get back on topic; I, personally, would rather see someth=
ing more flexible, with sane defaults for utf-8.

>=20
> > or the input is meaningless in utf-8 or if changing it to utf-8 and =
back would result in invalid text?
>=20
> There shouldn't be input that=E2=80=99s meaningless in UTF-8 if it=E2=80=
=99s valid in any other encoding. Indeed, I have placed the burden on th=
e calling code to convert into UTF-8 beforehand, but that=E2=80=99s not =
altogether different than asking someone to declare into what encoding t=
he character references ought to be decoded.

There=E2=80=99s a huge performance difference between converting a strin=
g from/to different encodings and instructing a function what to parse i=
n the current encoding and also be useful when the page itself is not ut=
f8.=20

>=20
> ```diff
> -html_entity_decode( $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=
=80=98ISO-8859-1=E2=80=99 );
> +$html =3D mb_convert_encoding( $html, =E2=80=98UTF-8=E2=80=99, =E2=80=
=98ISO-8859-1=E2=80=99 );
> +$html =3D decode_html( HTML_TEXT, $html );
> +$html =3D mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, =E2=
=80=98UTF-8=E2=80=99 );
> ```
>=20
> If an encoding can go into UTF-8 (which it should) then it should also=
 be able to return for all supported inputs. That is, we cannot convert =
into UTF-8 and produce a character that is unrepresentable in the source=
 encoding, because that would imply it was there in the source to begin =
with. Furthermore, if the HTML decodes into a code point unsupported in =
the destination encoding, it would be invalid either directly via decodi=
ng, or indirectly via conversion.
>=20
> ```diff
> -=E2=80=9C\x1A=E2=80=9D =3D=3D=3D html_entity_decode( =E2=80=9C&#x1f17=
0;=E2=80=9D, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98ISO-8859-=
1=E2=80=99 );
> +=E2=80=9D?=E2=80=9D =3D=3D=3D mb_convert_encoding( decode_html( HTML_=
TEXT, =E2=80=9C&#x1f170;=E2=80=9D ), =E2=80=98ISO-8859-1=E2=80=99, =E2=80=
=98UTF-8=E2=80=99 );
> ```
>=20
> This gets really confusing because neither of these outputs is a prope=
r decoding, as character encodings that don=E2=80=99t support the full U=
nicode code space cannot adequately represent all valid HTML inputs. HTM=
L is a Unicode decoding by specification, so even in a browser with `<me=
ta charset=3D=E2=80=9CISO-8859-1=E2=80=9D>&#x1f170;` the text content wi=
ll still be `=F0=9F=85=B0`, not `?` or the invisible ASCII control code =
SUB.

I was of the understanding that meta charset was too late to set the enc=
oding (but it=E2=80=99s been awhile since I=E2=80=99ve read the html5 sp=
ec) and the charset needed to be set in the html tag itself. I suppose b=
rowsers simply rewind upon hitting meta charset, but browsers have to de=
al with all kinds of shenanigans.=20

That being said, there is nothing in the spec (that I remember seeing) s=
tating it was Unicode only; just that it was the default.

Further, html may be stored in the database of a certain encoding (such =
as content systems like WordPress or Drupal) where it may not be straigh=
tforward (or desirable) to convert to utf8.=20

>=20
> =E2=80=94
>=20
> I=E2=80=99m sorry for being long-winded but I think it=E2=80=99s neces=
sary to frame these questions in the context of the problem today. We ha=
ve very frequent errors that result from having the wrong defaults and a=
 confusion of text encodings. I=E2=80=99ve seen far more problems from s=
ource code being UTF-8 and assuming the input is, rather than being anyt=
hing else (likely ISO-8859-1 if not UTF-8) assuming the the input isn=E2=
=80=99t.



>=20
>   * It should be possible to convert any string into UTF-8 regardless =
of its origin character set, and then transitively, if it originated the=
re, it should be able to convert back if the HTML represents text that i=
s representable in the original character set.

There are a number of scripts/languages not yet supported (especially on=
 older machines) that would result in =E2=80=9C=EF=BF=BD=E2=80=9D and ca=
nnot be transcribed back to its original encoding. For example, there ar=
e still new scripts being added as late as two years ago: https://www.un=
icode.org/standard/supported.html

>=20
>   * Converting at the boundaries of the application is the way to esca=
pe the confusion of wrestling an arbitrary number of different character=
 sets.

I totally agree with this statement, but we should provide tools instead=
 of dictating a policy.=20

>=20
>   * Proper HTML decoding requires a character set capable of represent=
ing all of Unicode, as the code points in numeric character references r=
efer to Unicode Code Points and _not_ any particular code units or byte =
sequences in any particular encoding.
>=20
>   * Almost every other character set is ASCII compatible, including UT=
F-8, making the domain of problems where this arises even smaller than i=
t might otherwise seem. For example, `&` is `&` in all of the common cha=
racter sets.
>=20
> Have a lovely weekend! And sorry for the potentially mis-threaded repl=
y. I couldn=E2=80=99t figure out how to reply to your message directly b=
ecause the digest emails were still stuck in 2020 for my account and I d=
idn=E2=80=99t switch subscriptions until after your email went out, mean=
ing I didn=E2=80=99t have a copy of your email.
>=20
> >
> >=E2=80=94 Rob

=E2=80=94 Rob
--1959572ecc994dad86e6b1463873ef4b
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html><html><head><title></title><style type=3D"text/css">p.Mso=
Normal,p.MsoNoSpacing{margin:0}</style></head><body><div>Hey Dennis,<br>=
</div><div><br></div><div>This looks like top posting because you=E2=80=99=
ve got a lot to read =E2=80=94 and well written =E2=80=94 but I want to =
reply to some points inline.&nbsp;</div><div><br></div><div>On Fri, Aug =
16, 2024, at 20:43, Dennis Snell wrote:<br></div><blockquote type=3D"cit=
e" id=3D"qt" style=3D""><div>&gt;On Fri, Aug 16, 2024, at 02:59, Dennis =
Snell wrote<br></div><div><br></div><div>Thanks for the question, Rob, I=
 hope this finds you well!<br></div><div><br></div><div>&gt;The RFC ment=
ions that encoding must be utf-8. How are programmers supposed to work w=
ith this if the php file itself isn=E2=80=99t utf-8<br></div><div><br></=
div><div>From my experience it=E2=80=99s the opposite case that is more =
important to consider. That is, what happens when we mix UTF-8 source co=
de with latin1 or UTF-8 source HTML with the system-set locale. I tried =
to hint at this scenario in the "Character encodings and UTF-8=E2=80=9D =
section.<br></div><div><br></div><div>Let=E2=80=99s examine the fundamen=
tal breakdown case:<br></div><div><br></div><div>```php<br></div><div>=E2=
=80=9C=C3=A9=E2=80=9D =3D=3D=3D decode_html( =E2=80=9C&amp;#xe9;=E2=80=9D=
 );<br></div><div>```<br></div><div><br></div><div>If the source is UTF-=
8 there=E2=80=99s no problem. If the source is ISO-8859-1 this will fail=
 because xE9 is on the left while xC3 xA9 is on the right. _Except_ if `=
zend.multibyte=3D1` and (`zend.script_encoding=3Diso-8859-1` _or_ if `de=
clare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` is set). The source code=
 may or may not be converted into a different encoding based on configur=
ations that most developers won=E2=80=99t have access to, or won=E2=80=99=
t examine.<br></div><div><br></div><div>Even with source code in ISO-885=
9-1, the `zend.script_encoding` and `zend.multibyte` set, `html_entity_d=
ecode()` _still_ reports UTF-8 unless `zend.default_charset` is set _or_=
 one of the `iconv` or `mbstring` internal charsets is set.<br></div></b=
lockquote><div><br></div><div>I just want to pause here and say, =E2=80=9C=
holy crap.=E2=80=9D That is quite complex and those edges seem sharp!</d=
iv><div><br></div><blockquote type=3D"cite" id=3D"qt" style=3D""><div><b=
r></div><div>My point I=E2=80=99m trying to make is that the current sit=
uation today is a minefield due to a dizzying array of system-dependent =
settings. Most modern code will either be running UTF-8 source code or w=
ill be converting source code _to_ UTF-8 or many other things will alrea=
dy be helplessly broken beyond this one issue.<br></div></blockquote><di=
v><br></div><div>Unfortunately, we don=E2=80=99t always get to choose th=
e code we work on. There is someone on this list using SHIFT_JIS. They p=
robably know more about the ins and outs of dealing with utf-8 centric s=
ystems from that encoding. Hopefully they can comment more about why thi=
s would or would not be a bad idea.&nbsp;</div><div><br></div><blockquot=
e type=3D"cite" id=3D"qt" style=3D""><div><br></div><div>UTF-8 is the un=
ifier that lets us escape this by having a defined and explicit encoding=
 at the input and output.<br></div></blockquote><div><br></div><div>Utf-=
8 is pretty good, right now, but I don=E2=80=99t think we should marry t=
he language to it. Will it be =E2=80=9Cthe standard=E2=80=9D in 10 years=
, 20 years, 100 years? Languages change, cultures change. Some people I =
know use a font to change triple equals from a literal =3D=3D=3D to&nbsp=
;=E2=89=A1. How long until php recognizes that as a literal operator?<br=
></div><div><br></div><div>But anyway, to get back on topic; I, personal=
ly, would rather see something more flexible, with sane defaults for utf=
-8.</div><div><br></div><blockquote type=3D"cite" id=3D"qt" style=3D""><=
div><br></div><div>&gt; or the input is meaningless in utf-8 or if chang=
ing it to utf-8 and back would result in invalid text?<br></div><div><br=
></div><div>There shouldn't be input that=E2=80=99s meaningless in UTF-8=
 if it=E2=80=99s valid in any other encoding. Indeed, I have placed the =
burden on the calling code to convert into UTF-8 beforehand, but that=E2=
=80=99s not altogether different than asking someone to declare into wha=
t encoding the character references ought to be decoded.<br></div></bloc=
kquote><div><br></div><div>There=E2=80=99s a huge performance difference=
 between converting a string from/to different encodings and instructing=
 a function what to parse in the current encoding and also be useful whe=
n the page itself is not utf8.&nbsp;<br></div><div><br></div><blockquote=
 type=3D"cite" id=3D"qt" style=3D""><div><br></div><div>```diff<br></div=
><div>-html_entity_decode( $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML=
5, =E2=80=98ISO-8859-1=E2=80=99 );<br></div><div>+$html =3D mb_convert_e=
ncoding( $html, =E2=80=98UTF-8=E2=80=99, =E2=80=98ISO-8859-1=E2=80=99 );=
<br></div><div>+$html =3D decode_html( HTML_TEXT, $html );<br></div><div=
>+$html =3D mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, =E2=
=80=98UTF-8=E2=80=99 );<br></div><div>```<br></div><div><br></div><div>I=
f an encoding can go into UTF-8 (which it should) then it should also be=
 able to return for all supported inputs. That is, we cannot convert int=
o UTF-8 and produce a character that is unrepresentable in the source en=
coding, because that would imply it was there in the source to begin wit=
h. Furthermore, if the HTML decodes into a code point unsupported in the=
 destination encoding, it would be invalid either directly via decoding,=
 or indirectly via conversion.<br></div><div><br></div><div>```diff<br><=
/div><div>-=E2=80=9C\x1A=E2=80=9D =3D=3D=3D html_entity_decode( =E2=80=9C=
&amp;#x1f170;=E2=80=9D, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98=
ISO-8859-1=E2=80=99 );<br></div><div>+=E2=80=9D?=E2=80=9D =3D=3D=3D mb_c=
onvert_encoding( decode_html( HTML_TEXT, =E2=80=9C&amp;#x1f170;=E2=80=9D=
 ), =E2=80=98ISO-8859-1=E2=80=99, =E2=80=98UTF-8=E2=80=99 );<br></div><d=
iv>```<br></div><div><br></div><div>This gets really confusing because n=
either of these outputs is a proper decoding, as character encodings tha=
t don=E2=80=99t support the full Unicode code space cannot adequately re=
present all valid HTML inputs. HTML is a Unicode decoding by specificati=
on, so even in a browser with `&lt;meta charset=3D=E2=80=9CISO-8859-1=E2=
=80=9D&gt;&amp;#x1f170;` the text content will still be `=F0=9F=85=B0`, =
not `?` or the invisible ASCII control code SUB.<br></div></blockquote><=
div><br></div><div>I was of the understanding that meta charset was too =
late to set the encoding (but it=E2=80=99s been awhile since I=E2=80=99v=
e read the html5 spec) and the charset needed to be set in the html tag =
itself. I suppose browsers simply rewind upon hitting meta charset, but =
browsers have to deal with all kinds of shenanigans.&nbsp;<br></div><div=
><br></div><div>That being said, there is nothing in the spec (that I re=
member seeing) stating it was Unicode only; just that it was the default=
.<br></div><div><br></div><div>Further, html may be stored in the databa=
se of a certain encoding (such as content systems like WordPress or Drup=
al) where it may not be straightforward (or desirable) to convert to utf=
8.&nbsp;</div><div><br></div><blockquote type=3D"cite" id=3D"qt" style=3D=
""><div><br></div><div>=E2=80=94<br></div><div><br></div><div>I=E2=80=99=
m sorry for being long-winded but I think it=E2=80=99s necessary to fram=
e these questions in the context of the problem today. We have very freq=
uent errors that result from having the wrong defaults and a confusion o=
f text encodings. I=E2=80=99ve seen far more problems from source code b=
eing UTF-8 and assuming the input is, rather than being anything else (l=
ikely ISO-8859-1 if not UTF-8) assuming the the input isn=E2=80=99t.<br>=
</div></blockquote><div><br></div><div><br></div><div><br></div><blockqu=
ote type=3D"cite" id=3D"qt" style=3D""><div><br></div><div>&nbsp; * It s=
hould be possible to convert any string into UTF-8 regardless of its ori=
gin character set, and then transitively, if it originated there, it sho=
uld be able to convert back if the HTML represents text that is represen=
table in the original character set.<br></div></blockquote><div><br></di=
v><div>There are a number of scripts/languages not yet supported (especi=
ally on older machines) that would result in&nbsp;=E2=80=9C=EF=BF=BD=E2=80=
=9D and cannot be transcribed back to its original encoding. For example=
, there are still new scripts being added as late as two years ago:&nbsp=
;<a href=3D"https://www.unicode.org/standard/supported.html">https://www=
.unicode.org/standard/supported.html</a><br></div><div><br></div><blockq=
uote type=3D"cite" id=3D"qt" style=3D""><div><br></div><div>&nbsp; * Con=
verting at the boundaries of the application is the way to escape the co=
nfusion of wrestling an arbitrary number of different character sets.<br=
></div></blockquote><div><br></div><div>I totally agree with this statem=
ent, but we should provide tools instead of dictating a policy.&nbsp;</d=
iv><div><br></div><blockquote type=3D"cite" id=3D"qt" style=3D""><div><b=
r></div><div>&nbsp; * Proper HTML decoding requires a character set capa=
ble of representing all of Unicode, as the code points in numeric charac=
ter references refer to Unicode Code Points and _not_ any particular cod=
e units or byte sequences in any particular encoding.<br></div><div><br>=
</div><div>&nbsp; * Almost every other character set is ASCII compatible=
, including UTF-8, making the domain of problems where this arises even =
smaller than it might otherwise seem. For example, `&amp;` is `&amp;` in=
 all of the common character sets.<br></div><div><br></div><div>Have a l=
ovely weekend! And sorry for the potentially mis-threaded reply. I could=
n=E2=80=99t figure out how to reply to your message directly because the=
 digest emails were still stuck in 2020 for my account and I didn=E2=80=99=
t switch subscriptions until after your email went out, meaning I didn=E2=
=80=99t have a copy of your email.<br></div><div><br></div><div>&gt;<br>=
</div><div>&gt;=E2=80=94 Rob<br></div></blockquote><div><br></div><div i=
d=3D"sig121229152">=E2=80=94 Rob<br></div></body></html>
--1959572ecc994dad86e6b1463873ef4b--