Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:125025
Message-ID: <D6A2BCCF-4883-4508-9652-3B17B1EB9714@automattic.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_54182BAB-AD14-4FA0-9327-6083FA6595C1"
Precedence: bulk
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
Subject: Re: [PHP-DEV] [RFC] Re: Decoding HTML and the Ambiguous Ampersand
Date: Fri, 16 Aug 2024 22:40:21 -0700
In-Reply-To: <6f421f13-d363-4b75-9d55-9d61ef4806c9@app.fastmail.com>
Cc: internals@lists.php.net
To: Rob Landers <rob@bottled.codes>
References: <CAA8B983-FD39-4AB1-B314-62608B0C0EF4@a8c.com>
 <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com>
 <47CB3E5E-1246-471A-B3BE-CE23BEFDDDF7@automattic.com>
 <6f421f13-d363-4b75-9d55-9d61ef4806c9@app.fastmail.com>
From: dennis.snell@automattic.com (Dennis Snell)


--Apple-Mail=_54182BAB-AD14-4FA0-9327-6083FA6595C1
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8


> On Aug 16, 2024, at 6:53=E2=80=AFPM, Rob Landers <rob@bottled.codes> =
wrote:
>=20
> Hey Dennis,
>=20
> This looks like top posting because you=E2=80=99ve got a lot to read =
=E2=80=94 and well written =E2=80=94 but I want to reply to some points =
inline.=20

Rob, no worries! I love your questions and I love being able to work =
together again, even in some limited fashion. Let me prefix this for you =
and for everyone on the list: this is really hairy stuff, and can at =
times require concentrated focus. When I started down this path a long =
time ago I knew very little about it. I=E2=80=99ve been knee-deep in it =
for years and now I feel like I learn something new every day that I =
didn=E2=80=99t know before.

>=20
> On Fri, Aug 16, 2024, at 20:43, Dennis Snell wrote:
>> >On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote
>>=20
>> Thanks for the question, Rob, I hope this finds you well!
>>=20
>> >The RFC mentions that encoding must be utf-8. How are programmers =
supposed to work with this if the php file itself isn=E2=80=99t utf-8
>>=20
>> =46rom my experience it=E2=80=99s the opposite case that is more =
important to consider. That is, what happens when we mix UTF-8 source =
code with latin1 or UTF-8 source HTML with the system-set locale. I =
tried to hint at this scenario in the "Character encodings and UTF-8=E2=80=
=9D section.
>>=20
>> Let=E2=80=99s examine the fundamental breakdown case:
>>=20
>> ```php
>> =E2=80=9C=C3=A9=E2=80=9D =3D=3D=3D decode_html( =E2=80=9C&#xe9;=E2=80=9D=
 );
>> ```
>>=20
>> If the source is UTF-8 there=E2=80=99s no problem. If the source is =
ISO-8859-1 this will fail because xE9 is on the left while xC3 xA9 is on =
the right. _Except_ if `zend.multibyte=3D1` and =
(`zend.script_encoding=3Diso-8859-1` _or_ if =
`declare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` is set). The source =
code may or may not be converted into a different encoding based on =
configurations that most developers won=E2=80=99t have access to, or =
won=E2=80=99t examine.
>>=20
>> Even with source code in ISO-8859-1, the `zend.script_encoding` and =
`zend.multibyte` set, `html_entity_decode()` _still_ reports UTF-8 =
unless `zend.default_charset` is set _or_ one of the `iconv` or =
`mbstring` internal charsets is set.
>=20
> I just want to pause here and say, =E2=80=9Choly crap.=E2=80=9D That =
is quite complex and those edges seem sharp!
>=20
>>=20
>> My point I=E2=80=99m trying to make is that the current situation =
today is a minefield due to a dizzying array of system-dependent =
settings. Most modern code will either be running UTF-8 source code or =
will be converting source code _to_ UTF-8 or many other things will =
already be helplessly broken beyond this one issue.
>=20
> Unfortunately, we don=E2=80=99t always get to choose the code we work =
on. There is someone on this list using SHIFT_JIS. They probably know =
more about the ins and outs of dealing with utf-8 centric systems from =
that encoding. Hopefully they can comment more about why this would or =
would not be a bad idea.=20

This is, in fact, one of my primary motivations for standardizing on =
UTF-8. Keep in mind that HTML not only has a set of character encodings =
that must be supported, but also a requirement that =
parsers=C2=A0not=C2=A0support additional encodings =
<https://html.spec.whatwg.org/#character-encodings> outside of that =
list. This is based on security grounds, for good and even more =
complicated reasons.

Of all of the required supported character sets, all roundtrip through =
UTF-8 as long as they aren=E2=80=99t modified. In fact, almost every =
character set out there should round-trip in this way, because the =
Unicode Consortium=E2=80=99s goal as far as I understand it is to =
capture every possible character in writing in a single universal =
character set. This appears first in the introduction to the HTML =
specification <https://html.spec.whatwg.org/#suggested-reading> and is =
reiterated throughout the document: HTML requires the use of UTF-8 =
<https://html.spec.whatwg.org/#charset>, though allows legacy encodings =
(there really are no =E2=80=9Cinvalid=E2=80=9D HTML documents because =
parse errors have deterministic resolutions).

>=20
>>=20
>> UTF-8 is the unifier that lets us escape this by having a defined and =
explicit encoding at the input and output.
>=20
> Utf-8 is pretty good, right now, but I don=E2=80=99t think we should =
marry the language to it. Will it be =E2=80=9Cthe standard=E2=80=9D in =
10 years, 20 years, 100 years? Languages change, cultures change. Some =
people I know use a font to change triple equals from a literal =3D=3D=3D =
to =E2=89=A1. How long until php recognizes that as a literal operator?
>=20
> But anyway, to get back on topic; I, personally, would rather see =
something more flexible, with sane defaults for utf-8.

To guard against a future where UTF-8 is replaced is planning for the =
most extremely unlikely scenario. UTF-8 is the most universal standard =
for interchange of text content, prevalent in software, systems, and =
programming languages, even those with UTF-16 internals.

It=E2=80=99s a good moment to remind ourselves, however, that Unicode =
defines a tables of character =E2=80=9Ccode points=E2=80=9D which are a =
mapping from a natural number to a character. UTF-8 is an algorithm for =
storing those natural numbers in byte sequences.

We absolutely can plan for over-extensibility, and this is what I=E2=80=99=
ve seen happen with the existing HTML functions in PHP (with options to =
choose what to decode, which entities to use, into which encoding to =
decode, etc). There=E2=80=99s an appearance of an awareness of text =
encoding, but the design of the function interfaces lead people to make =
decisions that open up all sorts of doors to corruption and security =
exploits.

So it wouldn=E2=80=99t matter to my RFC if another encoding were =
standardized as long as one encoding is standardized. Today, I see no =
legitimate competition to UTF-8. The only encodings that come close are =
the two UTF-16 variants because of their prevalence in Java, JavaScript, =
and ObjectiveC strings, but the UTF-16 variable-width encoding suffers a =
number of shortcomings compared to UTF-8 without providing much value in =
exchange.

When the day comes that UTF-8 is deprecated or replaced, major swaths of =
the internet will need overhaul far beyond PHP. Or at least, I have a =
hard time imaging that going any other way.

>=20
>>=20
>> > or the input is meaningless in utf-8 or if changing it to utf-8 and =
back would result in invalid text?
>>=20
>> There shouldn't be input that=E2=80=99s meaningless in UTF-8 if =
it=E2=80=99s valid in any other encoding. Indeed, I have placed the =
burden on the calling code to convert into UTF-8 beforehand, but =
that=E2=80=99s not altogether different than asking someone to declare =
into what encoding the character references ought to be decoded.
>=20
> There=E2=80=99s a huge performance difference between converting a =
string from/to different encodings and instructing a function what to =
parse in the current encoding and also be useful when the page itself is =
not utf8.=20

It definitely seems this way when examining a single function in =
isolation, but I would challenge folks to look out in the wild in =
practice how these functions are used. Typically I see strings =
transcoded multiple times and usually based on the wrong encoding. For =
example, WordPress currently looks at its defined =E2=80=9Cblog_charset=E2=
=80=9D to perform decoding, but most of the time it gets HTML input that =
input isn=E2=80=99t encoded in the blog charset.

What would be a performance win would be to decode and encode text at =
application boundaries so it can be converted once, processed in a =
pipeline where everything agrees on the encoding, and finally once more =
on output. In a UTF-8 world this requires no conversion at all, and =
UTF-8 is the overwhelming majority of code in web applications today.

---

We can keep in mind too that there are two encodings in the picture. The =
HTML source document may be encoded in one encoding while the output =
might need to appear in another. Consider HTML stored in a database as =
latin1/ISO-8859-1. It stores =E2=80=9C=C3=A9 is &#xE9;=E2=80=9D, except =
that unlike in this email, the leading character =C3=A9 is the single =
byte xE9.

This output likely should be sent to a browser as UTF-8. It=E2=80=99s =
acceptable to send latin1, but most pages will have characters =
unrepresentable in latin1. The backend then in decoding the HTML must go =
ahead and internally convert the input character encoding so that the =C3=A9=
 becomes the two byte sequence xC3 xA9 and then decode &#xE9; as xC3 =
xA9.

Were the input =E2=80=9Cf170;=E2=80=9D it simply could not decode =
into any single-byte encoding, failing to be able to decode the HTML. =
`html_entity_decode()` simply leaves that encoding in place. This kind =
of behavior tends to lead to double-encoding of the character =
references, and what the browser gets is `&amp;#x1f170;` instead of =
`=F0=9F=85=B0`.

>=20
>>=20
>> ```diff
>> -html_entity_decode( $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =
=E2=80=98ISO-8859-1=E2=80=99 );
>> +$html =3D mb_convert_encoding( $html, =E2=80=98UTF-8=E2=80=99, =
=E2=80=98ISO-8859-1=E2=80=99 );
>> +$html =3D decode_html( HTML_TEXT, $html );
>> +$html =3D mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, =
=E2=80=98UTF-8=E2=80=99 );
>> ```
>>=20
>> If an encoding can go into UTF-8 (which it should) then it should =
also be able to return for all supported inputs. That is, we cannot =
convert into UTF-8 and produce a character that is unrepresentable in =
the source encoding, because that would imply it was there in the source =
to begin with. Furthermore, if the HTML decodes into a code point =
unsupported in the destination encoding, it would be invalid either =
directly via decoding, or indirectly via conversion.
>>=20
>> ```diff
>> -=E2=80=9C\x1A=E2=80=9D =3D=3D=3D html_entity_decode( =
=E2=80=9C&#x1f170;=E2=80=9D, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =
=E2=80=98ISO-8859-1=E2=80=99 );
>> +=E2=80=9D?=E2=80=9D =3D=3D=3D mb_convert_encoding( decode_html( =
HTML_TEXT, =E2=80=9C&#x1f170;=E2=80=9D ), =E2=80=98ISO-8859-1=E2=80=99, =
=E2=80=98UTF-8=E2=80=99 );
>> ```
>>=20
>> This gets really confusing because neither of these outputs is a =
proper decoding, as character encodings that don=E2=80=99t support the =
full Unicode code space cannot adequately represent all valid HTML =
inputs. HTML is a Unicode decoding by specification, so even in a =
browser with `<meta charset=3D=E2=80=9CISO-8859-1=E2=80=9D>&#x1f170;` =
the text content will still be `=F0=9F=85=B0`, not `?` or the invisible =
ASCII control code SUB.
>=20
> I was of the understanding that meta charset was too late to set the =
encoding (but it=E2=80=99s been awhile since I=E2=80=99ve read the html5 =
spec) and the charset needed to be set in the html tag itself. I suppose =
browsers simply rewind upon hitting meta charset, but browsers have to =
deal with all kinds of shenanigans.=20

The algorithm for determining a document character set is =
straightforward, albeit with many steps. META elements within the first =
kilobyte of a document may determine the inferred encoding if one is not =
provided externally or from an HTTP header. Fun fact, if you find `<meta =
charset=3D=E2=80=9Cutf16=E2=80=9D>` then a browser will properly set the =
document encoding to UTF-8 (just as it ignores DOCTYPE declarations and =
treats all `text/html` content as HTML5).

A while ago I ran some analysis on roughly 300,000 pages from a list of =
top-ranked domains. You can examine the raw data =
<https://github.com/WordPress/wordpress-develop/files/15171826/charset-det=
ections.csv> and find some interesting bits in there. For instance, many =
HTML documents claim multiple incompatible character sets. Thankfully =
the HTML specification is clear on how to handle these situations. There =
really aren=E2=80=99t any shenanigans since 2008 when HTML5 formalized =
the parsing error modes.

>=20
> That being said, there is nothing in the spec (that I remember seeing) =
stating it was Unicode only; just that it was the default.

See the above note on character encoding vs. Unicode character set. =
Unicode is in the introduction to the HTML spec and mentioned =
throughout. It=E2=80=99s in the =E2=80=9Cbig picture=E2=80=9D at the =
start. Even encodings like ISO-2022-JP and GB18030 map to Unicode code =
points and represent different ways to represent those in sequences of =
bytes.

>=20
> Further, html may be stored in the database of a certain encoding =
(such as content systems like WordPress or Drupal) where it may not be =
straightforward (or desirable) to convert to utf8.=20

See above again: this is actually one of the most dangerous parts of =
suggesting in a function signature that a developer pick a character =
encoding, particularly since it invites incompatible decoding of the =
source document. It=E2=80=99s completely fine to store content in a =
database in another encoding, and many legacy systems do. Those are best =
served by converting when reading from the database into UTF-8 and then =
encoding from UTF-8 when saving into the database. The database =
character set confusions make HTML=E2=80=99s look simple, but those are =
out of scope for this RFC. Stating clearly that the function expects =
UTF-8 is about the best way I=E2=80=99ve seen in practice to partner =
with application developers both to educate them and help them =
accomplish their goals.

The primary point to consider here is that these legacy systems =
unintentionally oversimplify the state of encoded text. Typically they =
are running UTF-8 source code matching against a mixture of encodings =
from various inputs, only one of which is the database. For example, =
these systems will often assume that the encoding of the content in the =
database is the same encoding outbound to a browser, inbound in `$_POST` =
parameters, and escaped in `$_GET` query arguments. If the database is =
not using UTF-8 these assumptions are almost always wrong, and thus =
security issues abound.

>=20
>>=20
>> =E2=80=94
>>=20
>> I=E2=80=99m sorry for being long-winded but I think it=E2=80=99s =
necessary to frame these questions in the context of the problem today. =
We have very frequent errors that result from having the wrong defaults =
and a confusion of text encodings. I=E2=80=99ve seen far more problems =
from source code being UTF-8 and assuming the input is, rather than =
being anything else (likely ISO-8859-1 if not UTF-8) assuming the the =
input isn=E2=80=99t.
>=20
>=20
>=20
>>=20
>>   * It should be possible to convert any string into UTF-8 regardless =
of its origin character set, and then transitively, if it originated =
there, it should be able to convert back if the HTML represents text =
that is representable in the original character set.
>=20
> There are a number of scripts/languages not yet supported (especially =
on older machines) that would result in =E2=80=9C=EF=BF=BD=E2=80=9D and =
cannot be transcribed back to its original encoding. For example, there =
are still new scripts being added as late as two years ago: =
https://www.unicode.org/standard/supported.html
>=20

It=E2=80=99s absolutely true that new scripts are added, and someone =
else can confirm or correct me, but typically these appear first in =
Unicode, since Unicode has attempted to already swallow up all recorded =
digital text. When new scripts appear, it=E2=80=99s usually because =
someone found evidence of their use in physical writings and there was =
no previous digital record of them.

Do you have examples of languages that have digital records which are =
supported in the HTML specification which would result in substitution =
when decoding? Since HTML only encodes Unicode code points, I think the =
problem is that HTML cannot represent these characters, if they exist.

This is unrelated to UTF-8 because new scripts and characters and emoji =
get assigned the natural numbers - the code points. It=E2=80=99s up to =
text encodings to represent those indices into the character database =
tables.

>>=20
>>   * Converting at the boundaries of the application is the way to =
escape the confusion of wrestling an arbitrary number of different =
character sets.
>=20
> I totally agree with this statement, but we should provide tools =
instead of dictating a policy.=20

It=E2=80=99s my intention never to take control from a developer. In =
this situation that freedom appears by converting before and after. For =
most situations, for most systems, the most reliable, safe, and =
convenient thing to do will be to assume UTF-8, or check if a string is =
UTF-8 and reject otherwise. In those situations doing nothing is the =
right behavior, and preserves the correct parse within the domain in =
which `decode_html()` operates (which again, preservation or proper =
decoding cannot happen if it attempts to decode into `latin1`, as HTML =
documents are entirely Unicode documents and every single-byte encoding =
is unable to capture this).

This is why I personally feel strongly about having safe defaults =
instead of dangerous ones, as is the unfortunate case with =
`html_entity_decode()`. All the better if we can educate each other =
through the function interfaces to clarify what is happening and what =
the expectations are or need to be.

>=20
>>=20
>>   * Proper HTML decoding requires a character set capable of =
representing all of Unicode, as the code points in numeric character =
references refer to Unicode Code Points and _not_ any particular code =
units or byte sequences in any particular encoding.
>>=20
>>   * Almost every other character set is ASCII compatible, including =
UTF-8, making the domain of problems where this arises even smaller than =
it might otherwise seem. For example, `&` is `&` in all of the common =
character sets.
>>=20
>> Have a lovely weekend! And sorry for the potentially mis-threaded =
reply. I couldn=E2=80=99t figure out how to reply to your message =
directly because the digest emails were still stuck in 2020 for my =
account and I didn=E2=80=99t switch subscriptions until after your email =
went out, meaning I didn=E2=80=99t have a copy of your email.
>>=20
>> >
>> >=E2=80=94 Rob
>=20
> =E2=80=94 Rob



Warmly,
Dennis Snell


--Apple-Mail=_54182BAB-AD14-4FA0-9327-6083FA6595C1
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"overflow-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: =
after-white-space;"><br><div><blockquote type=3D"cite"><div>On Aug 16, =
2024, at 6:53=E2=80=AFPM, Rob Landers &lt;rob@bottled.codes&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><div><meta =
charset=3D"UTF-8"><div style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;">Hey Dennis,<br></div><div style=3D"caret-color: rgb(0, 0, 0); =
font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">This looks like top posting because you=E2=80=99ve=
 got a lot to read =E2=80=94 and well written =E2=80=94 but I want to =
reply to some points =
inline.&nbsp;</div></div></blockquote><div><br></div><div>Rob, no =
worries! I love your questions and I love being able to work together =
again, even in some limited fashion. Let me prefix this for you and for =
everyone on the list: this is really hairy stuff, and can at times =
require concentrated focus. When I started down this path a long time =
ago I knew very little about it. I=E2=80=99ve been knee-deep in it for =
years and now I feel like I learn something new every day that I =
didn=E2=80=99t know before.</div><br><blockquote type=3D"cite"><div><div =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;"><br></div><div =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;">On Fri, Aug 16, =
2024, at 20:43, Dennis Snell wrote:<br></div><blockquote type=3D"cite" =
id=3D"qt" style=3D"font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; orphans: auto; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; widows: auto; word-spacing: =
0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><div>&gt;On =
Fri, Aug 16, 2024, at 02:59, Dennis Snell =
wrote<br></div><div><br></div><div>Thanks for the question, Rob, I hope =
this finds you well!<br></div><div><br></div><div>&gt;The RFC mentions =
that encoding must be utf-8. How are programmers supposed to work with =
this if the php file itself isn=E2=80=99t =
utf-8<br></div><div><br></div><div>=46rom my experience it=E2=80=99s the =
opposite case that is more important to consider. That is, what happens =
when we mix UTF-8 source code with latin1 or UTF-8 source HTML with the =
system-set locale. I tried to hint at this scenario in the "Character =
encodings and UTF-8=E2=80=9D =
section.<br></div><div><br></div><div>Let=E2=80=99s examine the =
fundamental breakdown =
case:<br></div><div><br></div><div>```php<br></div><div>=E2=80=9C=C3=A9=E2=
=80=9D =3D=3D=3D decode_html( =E2=80=9C&amp;#xe9;=E2=80=9D =
);<br></div><div>```<br></div><div><br></div><div>If the source is UTF-8 =
there=E2=80=99s no problem. If the source is ISO-8859-1 this will fail =
because xE9 is on the left while xC3 xA9 is on the right. _Except_ if =
`zend.multibyte=3D1` and (`zend.script_encoding=3Diso-8859-1` _or_ if =
`declare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` is set). The source =
code may or may not be converted into a different encoding based on =
configurations that most developers won=E2=80=99t have access to, or =
won=E2=80=99t examine.<br></div><div><br></div><div>Even with source =
code in ISO-8859-1, the `zend.script_encoding` and `zend.multibyte` set, =
`html_entity_decode()` _still_ reports UTF-8 unless =
`zend.default_charset` is set _or_ one of the `iconv` or `mbstring` =
internal charsets is set.<br></div></blockquote><div style=3D"caret-color:=
 rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">I just want to pause here and say, =E2=80=9Choly =
crap.=E2=80=9D That is quite complex and those edges seem =
sharp!</div><div style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"><br></div><blockquote type=3D"cite" id=3D"qt" style=3D"font-family:=
 Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; orphans: auto; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><div><br></div><div>My point I=E2=80=99m trying =
to make is that the current situation today is a minefield due to a =
dizzying array of system-dependent settings. Most modern code will =
either be running UTF-8 source code or will be converting source code =
_to_ UTF-8 or many other things will already be helplessly broken beyond =
this one issue.<br></div></blockquote><div style=3D"caret-color: rgb(0, =
0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">Unfortunately, we don=E2=80=99t always get to =
choose the code we work on. There is someone on this list using =
SHIFT_JIS. They probably know more about the ins and outs of dealing =
with utf-8 centric systems from that encoding. Hopefully they can =
comment more about why this would or would not be a bad =
idea.&nbsp;</div></div></blockquote><div><br></div><div>This is, in =
fact, one of my primary motivations for standardizing on UTF-8. Keep in =
mind that HTML not only has a set of character encodings that must be =
supported, but also&nbsp;<a =
href=3D"https://html.spec.whatwg.org/#character-encodings">a requirement =
that parsers&nbsp;<i>not</i>&nbsp;support additional =
encodings</a>&nbsp;outside of that list. This is based on security =
grounds, for good and even more complicated =
reasons.</div><div><br></div><div>Of all of the required supported =
character sets, all roundtrip through UTF-8 as long as they aren=E2=80=99t=
 modified. In fact, almost every character set out there should =
round-trip in this way, because the Unicode Consortium=E2=80=99s goal as =
far as I understand it is to capture every possible character in writing =
in a single universal character set. This appears first in the&nbsp;<a =
href=3D"https://html.spec.whatwg.org/#suggested-reading">introduction to =
the HTML specification</a>&nbsp;and is reiterated throughout the =
document:&nbsp;<a href=3D"https://html.spec.whatwg.org/#charset">HTML =
requires the use of UTF-8</a>, though allows legacy encodings (there =
really are no =E2=80=9Cinvalid=E2=80=9D HTML documents because parse =
errors have deterministic resolutions).</div><br><blockquote =
type=3D"cite"><div><div style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"><br></div><blockquote type=3D"cite" id=3D"qt" style=3D"font-family:=
 Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; orphans: auto; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><div><br></div><div>UTF-8 is the unifier that =
lets us escape this by having a defined and explicit encoding at the =
input and output.<br></div></blockquote><div style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">Utf-8 is pretty good, right now, but I don=E2=80=99=
t think we should marry the language to it. Will it be =E2=80=9Cthe =
standard=E2=80=9D in 10 years, 20 years, 100 years? Languages change, =
cultures change. Some people I know use a font to change triple equals =
from a literal =3D=3D=3D to&nbsp;=E2=89=A1. How long until php =
recognizes that as a literal operator?<br></div><div style=3D"caret-color:=
 rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">But anyway, to get back on topic; I, personally, =
would rather see something more flexible, with sane defaults for =
utf-8.</div></div></blockquote><div><br></div><div>To guard against a =
future where UTF-8 is replaced is planning for the most extremely =
unlikely scenario. UTF-8 is the most universal standard for interchange =
of text content, prevalent in software, systems, and programming =
languages, even those with UTF-16 =
internals.</div><div><br></div><div>It=E2=80=99s a good moment to remind =
ourselves, however, that Unicode defines a tables of character =E2=80=9Cco=
de points=E2=80=9D which are a mapping from a natural number to a =
character. UTF-8 is an algorithm for storing those natural numbers in =
byte sequences.</div><div><br></div><div>We absolutely can plan for =
over-extensibility, and this is what I=E2=80=99ve seen happen with the =
existing HTML functions in PHP (with options to choose what to decode, =
which entities to use, into which encoding to decode, etc). There=E2=80=99=
s an appearance of an awareness of text encoding, but the design of the =
function interfaces lead people to make decisions that open up all sorts =
of doors to corruption and security =
exploits.</div><div><br></div><div>So it wouldn=E2=80=99t matter to my =
RFC if another encoding were standardized as long as =
<i>one</i>&nbsp;encoding is standardized. Today, I see no legitimate =
competition to UTF-8. The only encodings that come close are the two =
UTF-16 variants because of their prevalence in Java, JavaScript, and =
ObjectiveC strings, but the UTF-16 variable-width encoding suffers a =
number of shortcomings compared to UTF-8 without providing much value in =
exchange.</div><div><br></div><div>When the day comes that UTF-8 is =
deprecated or replaced, major swaths of the internet will need overhaul =
far beyond PHP. Or at least, I have a hard time imaging that going any =
other way.</div><br><blockquote type=3D"cite"><div><div =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: =
none;"><br></div><blockquote type=3D"cite" id=3D"qt" style=3D"font-family:=
 Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; orphans: auto; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><div><br></div><div>&gt; or the input is =
meaningless in utf-8 or if changing it to utf-8 and back would result in =
invalid text?<br></div><div><br></div><div>There shouldn't be input =
that=E2=80=99s meaningless in UTF-8 if it=E2=80=99s valid in any other =
encoding. Indeed, I have placed the burden on the calling code to =
convert into UTF-8 beforehand, but that=E2=80=99s not altogether =
different than asking someone to declare into what encoding the =
character references ought to be decoded.<br></div></blockquote><div =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;"><br></div><div =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;">There=E2=80=99s =
a huge performance difference between converting a string from/to =
different encodings and instructing a function what to parse in the =
current encoding and also be useful when the page itself is not =
utf8.&nbsp;<br></div></div></blockquote><div><br></div><div>It =
definitely seems this way when examining a single function in isolation, =
but I would challenge folks to look out in the wild in practice how =
these functions are used. Typically I see strings transcoded multiple =
times and usually based on the wrong encoding. For example, WordPress =
currently looks at its defined =E2=80=9Cblog_charset=E2=80=9D to perform =
decoding, but most of the time it gets HTML input that input =
<i>isn=E2=80=99t</i>&nbsp;encoded in the blog =
charset.<br></div><div><br></div><div>What <i>would</i>&nbsp;be a =
performance win would be to decode and encode text at application =
boundaries so it can be converted once, processed in a pipeline where =
everything agrees on the encoding, and finally once more on output. In a =
UTF-8 world this requires no conversion at all, and UTF-8 is the =
overwhelming majority of code in web applications =
today.</div><div><br></div><div>---</div><div><br></div><div>We can keep =
in mind too that there are <i>two</i>&nbsp;encodings in the picture. The =
HTML source document may be encoded in one encoding while the output =
might need to appear in another. Consider HTML stored in a database as =
latin1/ISO-8859-1. It stores =E2=80=9C=C3=A9 is &amp;#xE9;=E2=80=9D, =
except that unlike in this email, the leading character =C3=A9 is the =
single byte xE9.</div><div><br></div><div>This output likely should be =
sent to a browser as UTF-8. It=E2=80=99s acceptable to send latin1, but =
most pages will have characters unrepresentable in latin1. The backend =
then in decoding the HTML must go ahead and internally convert the input =
character encoding so that the =C3=A9 becomes the two byte sequence xC3 =
xA9 <i>and then</i>&nbsp;decode &amp;#xE9; as xC3 =
xA9.</div><div><br></div><div>Were the input =E2=80=9C&amp;#1f170;=E2=80=9D=
 it simply could not decode into any single-byte encoding, failing to be =
able to decode the HTML. `html_entity_decode()` simply leaves that =
encoding in place. This kind of behavior tends to lead to =
double-encoding of the character references, and what the browser gets =
is `&amp;amp;#x1f170;` instead of `=F0=9F=85=B0`.</div><br><blockquote =
type=3D"cite"><div><div style=3D"caret-color: rgb(0, 0, 0); font-family: =
Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: =
none;"><br></div><blockquote type=3D"cite" id=3D"qt" style=3D"font-family:=
 Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; orphans: auto; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: =
none;"><div><br></div><div>```diff<br></div><div>-html_entity_decode( =
$html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98ISO-8859-1=E2=80=99=
 );<br></div><div>+$html =3D mb_convert_encoding( $html, =E2=80=98UTF-8=E2=
=80=99, =E2=80=98ISO-8859-1=E2=80=99 );<br></div><div>+$html =3D =
decode_html( HTML_TEXT, $html );<br></div><div>+$html =3D =
mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, =E2=80=98UTF-8=E2=
=80=99 );<br></div><div>```<br></div><div><br></div><div>If an encoding =
can go into UTF-8 (which it should) then it should also be able to =
return for all supported inputs. That is, we cannot convert into UTF-8 =
and produce a character that is unrepresentable in the source encoding, =
because that would imply it was there in the source to begin with. =
Furthermore, if the HTML decodes into a code point unsupported in the =
destination encoding, it would be invalid either directly via decoding, =
or indirectly via =
conversion.<br></div><div><br></div><div>```diff<br></div><div>-=E2=80=9C\=
x1A=E2=80=9D =3D=3D=3D html_entity_decode( =E2=80=9C&amp;#x1f170;=E2=80=9D=
, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98ISO-8859-1=E2=80=99 =
);<br></div><div>+=E2=80=9D?=E2=80=9D =3D=3D=3D mb_convert_encoding( =
decode_html( HTML_TEXT, =E2=80=9C&amp;#x1f170;=E2=80=9D ), =
=E2=80=98ISO-8859-1=E2=80=99, =E2=80=98UTF-8=E2=80=99 =
);<br></div><div>```<br></div><div><br></div><div>This gets really =
confusing because neither of these outputs is a proper decoding, as =
character encodings that don=E2=80=99t support the full Unicode code =
space cannot adequately represent all valid HTML inputs. HTML is a =
Unicode decoding by specification, so even in a browser with `&lt;meta =
charset=3D=E2=80=9CISO-8859-1=E2=80=9D&gt;&amp;#x1f170;` the text =
content will still be `=F0=9F=85=B0`, not `?` or the invisible ASCII =
control code SUB.<br></div></blockquote><div style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">I was of the understanding that meta charset was =
too late to set the encoding (but it=E2=80=99s been awhile since I=E2=80=99=
ve read the html5 spec) and the charset needed to be set in the html tag =
itself. I suppose browsers simply rewind upon hitting meta charset, but =
browsers have to deal with all kinds of =
shenanigans.&nbsp;<br></div></div></blockquote><div><br></div><div>The =
algorithm for determining a document character set is straightforward, =
albeit with many steps. META elements within the first kilobyte of a =
document may determine the inferred encoding if one is not provided =
externally or from an HTTP header. Fun fact, if you find `&lt;meta =
charset=3D=E2=80=9Cutf16=E2=80=9D&gt;` then a browser will properly set =
the document encoding to UTF-8 (just as it ignores DOCTYPE declarations =
and treats all `text/html` content as HTML5).</div><div><br></div><div>A =
while ago I ran some analysis on roughly 300,000 pages from a list of =
top-ranked domains. You can&nbsp;<a =
href=3D"https://github.com/WordPress/wordpress-develop/files/15171826/char=
set-detections.csv">examine the raw data</a>&nbsp;and find some =
interesting bits in there. For instance, many HTML documents claim =
multiple incompatible character sets. Thankfully the HTML specification =
is clear on how to handle these situations. There really aren=E2=80=99t =
any shenanigans since 2008 when HTML5 formalized the parsing error =
modes.</div><br><blockquote type=3D"cite"><div><div style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">That being said, there is nothing in the spec =
(that I remember seeing) stating it was Unicode only; just that it was =
the default.<br></div></div></blockquote><div><br></div><div>See the =
above note on character encoding vs. Unicode character set. Unicode is =
in the introduction to the HTML spec and mentioned throughout. It=E2=80=99=
s in the =E2=80=9Cbig picture=E2=80=9D at the start. Even encodings like =
ISO-2022-JP and GB18030 map to Unicode code points and represent =
different ways to represent those in sequences of =
bytes.</div><br><blockquote type=3D"cite"><div><div style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">Further, html may be stored in the database of a =
certain encoding (such as content systems like WordPress or Drupal) =
where it may not be straightforward (or desirable) to convert to =
utf8.&nbsp;</div></div></blockquote><div><br></div><div>See above again: =
this is actually one of the most dangerous parts of suggesting in a =
function signature that a developer pick a character encoding, =
particularly since it invites incompatible decoding of the source =
document. It=E2=80=99s completely fine to store content in a database in =
another encoding, and many legacy systems do. Those are best served by =
converting when reading from the database into UTF-8 and then encoding =
from UTF-8 when saving into the database. The database character set =
confusions make HTML=E2=80=99s look simple, but those are out of scope =
for this RFC. Stating clearly that the function expects UTF-8 is about =
the best way I=E2=80=99ve seen in practice to partner with application =
developers both to educate them and help them accomplish their =
goals.</div><div><br></div><div>The primary point to consider here is =
that these legacy systems unintentionally oversimplify the state of =
encoded text. Typically they are running UTF-8 source code matching =
against a mixture of encodings from various inputs, only one of which is =
the database. For example, these systems will often assume that the =
encoding of the content in the database is the same encoding outbound to =
a browser, inbound in `$_POST` parameters, and escaped in `$_GET` query =
arguments. If the database is not using UTF-8 these assumptions are =
almost always wrong, and thus security issues =
abound.</div><br><blockquote type=3D"cite"><div><div style=3D"caret-color:=
 rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><blockquote type=3D"cite" id=3D"qt" =
style=3D"font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: =
none;"><div><br></div><div>=E2=80=94<br></div><div><br></div><div>I=E2=80=99=
m sorry for being long-winded but I think it=E2=80=99s necessary to =
frame these questions in the context of the problem today. We have very =
frequent errors that result from having the wrong defaults and a =
confusion of text encodings. I=E2=80=99ve seen far more problems from =
source code being UTF-8 and assuming the input is, rather than being =
anything else (likely ISO-8859-1 if not UTF-8) assuming the the input =
isn=E2=80=99t.<br></div></blockquote><div style=3D"caret-color: rgb(0, =
0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><blockquote type=3D"cite" id=3D"qt" =
style=3D"font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: =
none;"><div><br></div><div>&nbsp; * It should be possible to convert any =
string into UTF-8 regardless of its origin character set, and then =
transitively, if it originated there, it should be able to convert back =
if the HTML represents text that is representable in the original =
character set.<br></div></blockquote><div style=3D"caret-color: rgb(0, =
0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div style=3D"caret-color: rgb(0, 0, =
0); font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;">There are a number of scripts/languages not yet =
supported (especially on older machines) that would result =
in&nbsp;=E2=80=9C=EF=BF=BD=E2=80=9D and cannot be transcribed back to =
its original encoding. For example, there are still new scripts being =
added as late as two years ago:&nbsp;<a =
href=3D"https://www.unicode.org/standard/supported.html">https://www.unico=
de.org/standard/supported.html</a><br></div><div style=3D"caret-color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant-caps: normal; font-weight: 400; letter-spacing: =
normal; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: =
none;"><br></div></div></blockquote><div><br></div><div>It=E2=80=99s =
absolutely true that new scripts are added, and someone else can confirm =
or correct me, but typically these appear <i>first</i>&nbsp;in Unicode, =
since Unicode has attempted to already swallow up all recorded digital =
text. When new scripts appear, it=E2=80=99s usually because someone =
found evidence of their use in physical writings and there was no =
previous digital record of them.</div><div><br></div><div>Do you have =
examples of languages that have digital records which are supported in =
the HTML specification which would result in substitution when decoding? =
Since HTML <i>only encodes Unicode code points</i>, I think the problem =
is that <i>HTML</i>&nbsp;cannot represent these characters, if they =
exist.</div><div><br></div><div>This is unrelated to UTF-8 because new =
scripts and characters and emoji get assigned the natural numbers - the =
code points. It=E2=80=99s up to text encodings to represent those =
indices into the character database tables.</div><br><blockquote =
type=3D"cite"><div><blockquote type=3D"cite" id=3D"qt" =
style=3D"font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: =
none;"><div><br></div><div>&nbsp; * Converting at the boundaries of the =
application is the way to escape the confusion of wrestling an arbitrary =
number of different character sets.<br></div></blockquote><div =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;"><br></div><div =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;">I totally agree =
with this statement, but we should provide tools instead of dictating a =
policy.&nbsp;</div></div></blockquote><div><br></div><div>It=E2=80=99s =
my intention never to take control from a developer. In this situation =
that freedom appears by converting before and after. For most =
situations, for most systems, the most reliable, safe, and convenient =
thing to do will be to assume UTF-8, or check if a string is UTF-8 and =
reject otherwise. In those situations doing nothing is the right =
behavior, and preserves the correct parse within the domain in which =
`decode_html()` operates (which again, preservation or proper decoding =
cannot happen if it attempts to decode into `latin1`, as HTML documents =
are entirely Unicode documents and <i>every</i>&nbsp;single-byte =
encoding is unable to capture this).</div><div><br></div><div>This is =
why I personally feel strongly about having safe defaults instead of =
dangerous ones, as is the unfortunate case with `html_entity_decode()`. =
All the better if we can educate each other through the function =
interfaces to clarify what is happening and what the expectations are or =
need to be.</div><br><blockquote type=3D"cite"><div><div =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: =
none;"><br></div><blockquote type=3D"cite" id=3D"qt" style=3D"font-family:=
 Helvetica; font-size: 12px; font-style: normal; font-variant-caps: =
normal; font-weight: 400; letter-spacing: normal; orphans: auto; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><div><br></div><div>&nbsp; * Proper HTML =
decoding requires a character set capable of representing all of =
Unicode, as the code points in numeric character references refer to =
Unicode Code Points and _not_ any particular code units or byte =
sequences in any particular =
encoding.<br></div><div><br></div><div>&nbsp; * Almost every other =
character set is ASCII compatible, including UTF-8, making the domain of =
problems where this arises even smaller than it might otherwise seem. =
For example, `&amp;` is `&amp;` in all of the common character =
sets.<br></div><div><br></div><div>Have a lovely weekend! And sorry for =
the potentially mis-threaded reply. I couldn=E2=80=99t figure out how to =
reply to your message directly because the digest emails were still =
stuck in 2020 for my account and I didn=E2=80=99t switch subscriptions =
until after your email went out, meaning I didn=E2=80=99t have a copy of =
your email.<br></div><div><br></div><div>&gt;<br></div><div>&gt;=E2=80=94 =
Rob<br></div></blockquote><div style=3D"caret-color: rgb(0, 0, 0); =
font-family: Helvetica; font-size: 12px; font-style: normal; =
font-variant-caps: normal; font-weight: 400; letter-spacing: normal; =
text-align: start; text-indent: 0px; text-transform: none; white-space: =
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; =
text-decoration: none;"><br></div><div id=3D"sig121229152" =
style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
12px; font-style: normal; font-variant-caps: normal; font-weight: 400; =
letter-spacing: normal; text-align: start; text-indent: 0px; =
text-transform: none; white-space: normal; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; text-decoration: none;">=E2=80=94 =
Rob</div></div></blockquote></div><div><br></div><div><br></div><div>Warml=
y,</div><div>Dennis Snell</div><br></body></html>=

--Apple-Mail=_54182BAB-AD14-4FA0-9327-6083FA6595C1--