Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125019 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 603A01A00BD for ; Sat, 17 Aug 2024 01:54:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723859778; bh=dTxzhdbZZPtoghGPQOgyq9rPMLJoTYUosr/vexUSXU4=; h=Date:From:To:In-Reply-To:References:Subject:From; b=l6vxcgZGrRB/oeOSW1mwICzAW6DO91katwcJFJQzhuDPvtK3tYezfTkpO/14EC/Cr XJmBNjp2GULg1IrgNA7GTuTVSpIqpXvF1qon9t5+xlX5IwRSQm8SongOc6bMuaEjcr YGxOGuUHFzj1cptjOBl0HNFCnYnkPJHUzKRanF5nDu0pXcKEcJHEdM9pKLTLbgEdnt of0fK5R8QTRJjrElMnOUMexFLo09Do/WCGB5TAHStBE3gHPN3CybUCNejf9t6Kvrrh fy8/zikLOi9TrzFVXNFsdN0CZK0deTv6/FZBeCPI9z4SYOhW0vP5iy9yY1kKF9BctJ YeoOb2BheLUAw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 445B418007C for ; Sat, 17 Aug 2024 01:56:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: * X-Spam-Status: No, score=1.4 required=5.0 tests=BAYES_50,BODY_8BITS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING, HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from fout1-smtp.messagingengine.com (fout1-smtp.messagingengine.com [103.168.172.144]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 17 Aug 2024 01:56:16 +0000 (UTC) Received: from phl-compute-03.internal (phl-compute-03.nyi.internal [10.202.2.43]) by mailfout.nyi.internal (Postfix) with ESMTP id E6DBA1382998 for ; Fri, 16 Aug 2024 21:54:28 -0400 (EDT) Received: from phl-imap-09 ([10.202.2.99]) by phl-compute-03.internal (MEProxy); Fri, 16 Aug 2024 21:54:28 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bottled.codes; h=cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1723859668; x=1723946068; bh=9YajM+wKaq KqCFoqj7SZ51GDlZIdhTdex/yeoXtZSd4=; b=SczClTjgq+ItBhSS3XsDGc78J/ B8bhuLWCKzZ6V1IwQAkqeySNRY/zhjkrOVBWx3lU6ctT7Ck3tpzU4LCzpQ2XRJTe /AdoNCPhqEo2hJS823vdDRKhPr9zCrY93bw4ZJvFcSnhnte7VEpQQWTTOp9q+VAb 9gXULYLcOCO8aIe+6zv9yxn39SY5En2OLuygN5VbLj+dSQNts7xc4QT9HuSYad1U GwhidJcGzzGH63ONh33yk1pwUXj3/DYnjlb3uK3URKDlaV3WXOPuzTVy1ICrdqwJ 0+SK0E1IRtT1/Hxk5rEUAu/hbg+/yrja04BTKsq8rWdFX/m1Fp1jYc1jQgGw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; t=1723859668; x=1723946068; bh=9YajM+wKaqKqCFoqj7SZ51GDlZId hTdex/yeoXtZSd4=; b=l1r5g37Z+LnwYvcU8LVrGJGaA3dz3EB6TO/hCMAg34dq NxFfw2mtPp8U8DR6TAL/WE5peFmVPGS1Z/s4zwU5/lP0Mc5sUtNWFjRXp+RIp+Y5 NJ95v0x7TQcAN97T+1v5WOalLW3GiQt2ArMSbdECe9LHUD3nAOFiQeCKsTn8ITpe +vRoMMB9skil3AdUrbqeneaMY0qIaxKJiLdz4yCWLBzG6JgtgcM7mzpv7KDpmck1 j/vR0AtI6T4m7/fIIxuNHLw3d2BgfhejLAJ1XMNSeHMAo8dD/NhrqFFTI5N4bV/j 6s4TifhHkEIcVW6wu2U2fxDtBiCd7xFdk9C9SqsRQw== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddruddtledghedvucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdggtfgfnhhsuhgsshgtrhhisggvpdfu rfetoffkrfgpnffqhgenuceurghilhhouhhtmecufedttdenucenucfjughrpefoggffhf fvkfgjfhfutgesrgdtreerredtjeenucfhrhhomhepfdftohgsucfnrghnuggvrhhsfdcu oehrohgssegsohhtthhlvggurdgtohguvghsqeenucggtffrrghtthgvrhhnpefhledtfe egteffvedvleduudfffeeiiefgudfhudelhfeifeelteeijefhgfehieenucffohhmrghi nhepuhhnihgtohguvgdrohhrghenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmh epmhgrihhlfhhrohhmpehrohgssegsohhtthhlvggurdgtohguvghspdhnsggprhgtphht thhopedupdhmohguvgepshhmthhpohhuthdprhgtphhtthhopehinhhtvghrnhgrlhhsse hlihhsthhsrdhphhhprdhnvght X-ME-Proxy: Feedback-ID: ifab94697:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 73FC115A005E; Fri, 16 Aug 2024 21:54:28 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Date: Sat, 17 Aug 2024 03:53:27 +0200 To: internals@lists.php.net Message-ID: <6f421f13-d363-4b75-9d55-9d61ef4806c9@app.fastmail.com> In-Reply-To: <47CB3E5E-1246-471A-B3BE-CE23BEFDDDF7@automattic.com> References: <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com> <47CB3E5E-1246-471A-B3BE-CE23BEFDDDF7@automattic.com> Subject: Re: [PHP-DEV] [RFC] Re: Decoding HTML and the Ambiguous Ampersand Content-Type: multipart/alternative; boundary=1959572ecc994dad86e6b1463873ef4b From: rob@bottled.codes ("Rob Landers") --1959572ecc994dad86e6b1463873ef4b Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hey Dennis, This looks like top posting because you=E2=80=99ve got a lot to read =E2= =80=94 and well written =E2=80=94 but I want to reply to some points inl= ine.=20 On Fri, Aug 16, 2024, at 20:43, Dennis Snell wrote: > >On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote >=20 > Thanks for the question, Rob, I hope this finds you well! >=20 > >The RFC mentions that encoding must be utf-8. How are programmers sup= posed to work with this if the php file itself isn=E2=80=99t utf-8 >=20 > From my experience it=E2=80=99s the opposite case that is more importa= nt to consider. That is, what happens when we mix UTF-8 source code with= latin1 or UTF-8 source HTML with the system-set locale. I tried to hint= at this scenario in the "Character encodings and UTF-8=E2=80=9D section. >=20 > Let=E2=80=99s examine the fundamental breakdown case: >=20 > ```php > =E2=80=9C=C3=A9=E2=80=9D =3D=3D=3D decode_html( =E2=80=9Cé=E2=80=9D= ); > ``` >=20 > If the source is UTF-8 there=E2=80=99s no problem. If the source is IS= O-8859-1 this will fail because xE9 is on the left while xC3 xA9 is on t= he right. _Except_ if `zend.multibyte=3D1` and (`zend.script_encoding=3D= iso-8859-1` _or_ if `declare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` i= s set). The source code may or may not be converted into a different enc= oding based on configurations that most developers won=E2=80=99t have ac= cess to, or won=E2=80=99t examine. >=20 > Even with source code in ISO-8859-1, the `zend.script_encoding` and `z= end.multibyte` set, `html_entity_decode()` _still_ reports UTF-8 unless = `zend.default_charset` is set _or_ one of the `iconv` or `mbstring` inte= rnal charsets is set. I just want to pause here and say, =E2=80=9Choly crap.=E2=80=9D That is = quite complex and those edges seem sharp! >=20 > My point I=E2=80=99m trying to make is that the current situation toda= y is a minefield due to a dizzying array of system-dependent settings. M= ost modern code will either be running UTF-8 source code or will be conv= erting source code _to_ UTF-8 or many other things will already be helpl= essly broken beyond this one issue. Unfortunately, we don=E2=80=99t always get to choose the code we work on= . There is someone on this list using SHIFT_JIS. They probably know more= about the ins and outs of dealing with utf-8 centric systems from that = encoding. Hopefully they can comment more about why this would or would = not be a bad idea.=20 >=20 > UTF-8 is the unifier that lets us escape this by having a defined and = explicit encoding at the input and output. Utf-8 is pretty good, right now, but I don=E2=80=99t think we should mar= ry the language to it. Will it be =E2=80=9Cthe standard=E2=80=9D in 10 y= ears, 20 years, 100 years? Languages change, cultures change. Some peopl= e I know use a font to change triple equals from a literal =3D=3D=3D to = =E2=89=A1. How long until php recognizes that as a literal operator? But anyway, to get back on topic; I, personally, would rather see someth= ing more flexible, with sane defaults for utf-8. >=20 > > or the input is meaningless in utf-8 or if changing it to utf-8 and = back would result in invalid text? >=20 > There shouldn't be input that=E2=80=99s meaningless in UTF-8 if it=E2=80= =99s valid in any other encoding. Indeed, I have placed the burden on th= e calling code to convert into UTF-8 beforehand, but that=E2=80=99s not = altogether different than asking someone to declare into what encoding t= he character references ought to be decoded. There=E2=80=99s a huge performance difference between converting a strin= g from/to different encodings and instructing a function what to parse i= n the current encoding and also be useful when the page itself is not ut= f8.=20 >=20 > ```diff > -html_entity_decode( $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2= =80=98ISO-8859-1=E2=80=99 ); > +$html =3D mb_convert_encoding( $html, =E2=80=98UTF-8=E2=80=99, =E2=80= =98ISO-8859-1=E2=80=99 ); > +$html =3D decode_html( HTML_TEXT, $html ); > +$html =3D mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, =E2= =80=98UTF-8=E2=80=99 ); > ``` >=20 > If an encoding can go into UTF-8 (which it should) then it should also= be able to return for all supported inputs. That is, we cannot convert = into UTF-8 and produce a character that is unrepresentable in the source= encoding, because that would imply it was there in the source to begin = with. Furthermore, if the HTML decodes into a code point unsupported in = the destination encoding, it would be invalid either directly via decodi= ng, or indirectly via conversion. >=20 > ```diff > -=E2=80=9C\x1A=E2=80=9D =3D=3D=3D html_entity_decode( =E2=80=9C἗= 0;=E2=80=9D, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98ISO-8859-= 1=E2=80=99 ); > +=E2=80=9D?=E2=80=9D =3D=3D=3D mb_convert_encoding( decode_html( HTML_= TEXT, =E2=80=9C🅰=E2=80=9D ), =E2=80=98ISO-8859-1=E2=80=99, =E2=80= =98UTF-8=E2=80=99 ); > ``` >=20 > This gets really confusing because neither of these outputs is a prope= r decoding, as character encodings that don=E2=80=99t support the full U= nicode code space cannot adequately represent all valid HTML inputs. HTM= L is a Unicode decoding by specification, so even in a browser with `🅰` the text content wi= ll still be `=F0=9F=85=B0`, not `?` or the invisible ASCII control code = SUB. I was of the understanding that meta charset was too late to set the enc= oding (but it=E2=80=99s been awhile since I=E2=80=99ve read the html5 sp= ec) and the charset needed to be set in the html tag itself. I suppose b= rowsers simply rewind upon hitting meta charset, but browsers have to de= al with all kinds of shenanigans.=20 That being said, there is nothing in the spec (that I remember seeing) s= tating it was Unicode only; just that it was the default. Further, html may be stored in the database of a certain encoding (such = as content systems like WordPress or Drupal) where it may not be straigh= tforward (or desirable) to convert to utf8.=20 >=20 > =E2=80=94 >=20 > I=E2=80=99m sorry for being long-winded but I think it=E2=80=99s neces= sary to frame these questions in the context of the problem today. We ha= ve very frequent errors that result from having the wrong defaults and a= confusion of text encodings. I=E2=80=99ve seen far more problems from s= ource code being UTF-8 and assuming the input is, rather than being anyt= hing else (likely ISO-8859-1 if not UTF-8) assuming the the input isn=E2= =80=99t. >=20 > * It should be possible to convert any string into UTF-8 regardless = of its origin character set, and then transitively, if it originated the= re, it should be able to convert back if the HTML represents text that i= s representable in the original character set. There are a number of scripts/languages not yet supported (especially on= older machines) that would result in =E2=80=9C=EF=BF=BD=E2=80=9D and ca= nnot be transcribed back to its original encoding. For example, there ar= e still new scripts being added as late as two years ago: https://www.un= icode.org/standard/supported.html >=20 > * Converting at the boundaries of the application is the way to esca= pe the confusion of wrestling an arbitrary number of different character= sets. I totally agree with this statement, but we should provide tools instead= of dictating a policy.=20 >=20 > * Proper HTML decoding requires a character set capable of represent= ing all of Unicode, as the code points in numeric character references r= efer to Unicode Code Points and _not_ any particular code units or byte = sequences in any particular encoding. >=20 > * Almost every other character set is ASCII compatible, including UT= F-8, making the domain of problems where this arises even smaller than i= t might otherwise seem. For example, `&` is `&` in all of the common cha= racter sets. >=20 > Have a lovely weekend! And sorry for the potentially mis-threaded repl= y. I couldn=E2=80=99t figure out how to reply to your message directly b= ecause the digest emails were still stuck in 2020 for my account and I d= idn=E2=80=99t switch subscriptions until after your email went out, mean= ing I didn=E2=80=99t have a copy of your email. >=20 > > > >=E2=80=94 Rob =E2=80=94 Rob --1959572ecc994dad86e6b1463873ef4b Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Hey Dennis,
=

This looks like top posting because you=E2=80=99= ve got a lot to read =E2=80=94 and well written =E2=80=94 but I want to = reply to some points inline. 

On Fri, Aug = 16, 2024, at 20:43, Dennis Snell wrote:
>On Fri, Aug 16, 2024, at 02:59, Dennis = Snell wrote

Thanks for the question, Rob, I= hope this finds you well!

>The RFC ment= ions that encoding must be utf-8. How are programmers supposed to work w= ith this if the php file itself isn=E2=80=99t utf-8

From my experience it=E2=80=99s the opposite case that is more = important to consider. That is, what happens when we mix UTF-8 source co= de with latin1 or UTF-8 source HTML with the system-set locale. I tried = to hint at this scenario in the "Character encodings and UTF-8=E2=80=9D = section.

Let=E2=80=99s examine the fundamen= tal breakdown case:

```php
=E2= =80=9C=C3=A9=E2=80=9D =3D=3D=3D decode_html( =E2=80=9C&#xe9;=E2=80=9D= );
```

If the source is UTF-= 8 there=E2=80=99s no problem. If the source is ISO-8859-1 this will fail= because xE9 is on the left while xC3 xA9 is on the right. _Except_ if `= zend.multibyte=3D1` and (`zend.script_encoding=3Diso-8859-1` _or_ if `de= clare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` is set). The source code= may or may not be converted into a different encoding based on configur= ations that most developers won=E2=80=99t have access to, or won=E2=80=99= t examine.

Even with source code in ISO-885= 9-1, the `zend.script_encoding` and `zend.multibyte` set, `html_entity_d= ecode()` _still_ reports UTF-8 unless `zend.default_charset` is set _or_= one of the `iconv` or `mbstring` internal charsets is set.

I just want to pause here and say, =E2=80=9C= holy crap.=E2=80=9D That is quite complex and those edges seem sharp!

My point I=E2=80=99m trying to make is that the current sit= uation today is a minefield due to a dizzying array of system-dependent = settings. Most modern code will either be running UTF-8 source code or w= ill be converting source code _to_ UTF-8 or many other things will alrea= dy be helplessly broken beyond this one issue.

Unfortunately, we don=E2=80=99t always get to choose th= e code we work on. There is someone on this list using SHIFT_JIS. They p= robably know more about the ins and outs of dealing with utf-8 centric s= ystems from that encoding. Hopefully they can comment more about why thi= s would or would not be a bad idea. 


UTF-8 is the un= ifier that lets us escape this by having a defined and explicit encoding= at the input and output.

Utf-= 8 is pretty good, right now, but I don=E2=80=99t think we should marry t= he language to it. Will it be =E2=80=9Cthe standard=E2=80=9D in 10 years= , 20 years, 100 years? Languages change, cultures change. Some people I = know use a font to change triple equals from a literal =3D=3D=3D to = ;=E2=89=A1. How long until php recognizes that as a literal operator?

But anyway, to get back on topic; I, personal= ly, would rather see something more flexible, with sane defaults for utf= -8.

<= div>
> or the input is meaningless in utf-8 or if chang= ing it to utf-8 and back would result in invalid text?
There shouldn't be input that=E2=80=99s meaningless in UTF-8= if it=E2=80=99s valid in any other encoding. Indeed, I have placed the = burden on the calling code to convert into UTF-8 beforehand, but that=E2= =80=99s not altogether different than asking someone to declare into wha= t encoding the character references ought to be decoded.

There=E2=80=99s a huge performance difference= between converting a string from/to different encodings and instructing= a function what to parse in the current encoding and also be useful whe= n the page itself is not utf8. 


```diff
-html_entity_decode( $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML= 5, =E2=80=98ISO-8859-1=E2=80=99 );
+$html =3D mb_convert_e= ncoding( $html, =E2=80=98UTF-8=E2=80=99, =E2=80=98ISO-8859-1=E2=80=99 );=
+$html =3D decode_html( HTML_TEXT, $html );
+$html =3D mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, =E2= =80=98UTF-8=E2=80=99 );
```

I= f an encoding can go into UTF-8 (which it should) then it should also be= able to return for all supported inputs. That is, we cannot convert int= o UTF-8 and produce a character that is unrepresentable in the source en= coding, because that would imply it was there in the source to begin wit= h. Furthermore, if the HTML decodes into a code point unsupported in the= destination encoding, it would be invalid either directly via decoding,= or indirectly via conversion.

```diff
<= /div>
-=E2=80=9C\x1A=E2=80=9D =3D=3D=3D html_entity_decode( =E2=80=9C= &#x1f170;=E2=80=9D, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98= ISO-8859-1=E2=80=99 );
+=E2=80=9D?=E2=80=9D =3D=3D=3D mb_c= onvert_encoding( decode_html( HTML_TEXT, =E2=80=9C&#x1f170;=E2=80=9D= ), =E2=80=98ISO-8859-1=E2=80=99, =E2=80=98UTF-8=E2=80=99 );
```

This gets really confusing because n= either of these outputs is a proper decoding, as character encodings tha= t don=E2=80=99t support the full Unicode code space cannot adequately re= present all valid HTML inputs. HTML is a Unicode decoding by specificati= on, so even in a browser with `<meta charset=3D=E2=80=9CISO-8859-1=E2= =80=9D>&#x1f170;` the text content will still be `=F0=9F=85=B0`, = not `?` or the invisible ASCII control code SUB.
<= div>
I was of the understanding that meta charset was too = late to set the encoding (but it=E2=80=99s been awhile since I=E2=80=99v= e read the html5 spec) and the charset needed to be set in the html tag = itself. I suppose browsers simply rewind upon hitting meta charset, but = browsers have to deal with all kinds of shenanigans. 

That being said, there is nothing in the spec (that I re= member seeing) stating it was Unicode only; just that it was the default= .

Further, html may be stored in the databa= se of a certain encoding (such as content systems like WordPress or Drup= al) where it may not be straightforward (or desirable) to convert to utf= 8. 


=E2=80=94

I=E2=80=99= m sorry for being long-winded but I think it=E2=80=99s necessary to fram= e these questions in the context of the problem today. We have very freq= uent errors that result from having the wrong defaults and a confusion o= f text encodings. I=E2=80=99ve seen far more problems from source code b= eing UTF-8 and assuming the input is, rather than being anything else (l= ikely ISO-8859-1 if not UTF-8) assuming the the input isn=E2=80=99t.
=




  * It s= hould be possible to convert any string into UTF-8 regardless of its ori= gin character set, and then transitively, if it originated there, it sho= uld be able to convert back if the HTML represents text that is represen= table in the original character set.

There are a number of scripts/languages not yet supported (especi= ally on older machines) that would result in =E2=80=9C=EF=BF=BD=E2=80= =9D and cannot be transcribed back to its original encoding. For example= , there are still new scripts being added as late as two years ago: = ;https://www= .unicode.org/standard/supported.html


  * Con= verting at the boundaries of the application is the way to escape the co= nfusion of wrestling an arbitrary number of different character sets.

I totally agree with this statem= ent, but we should provide tools instead of dictating a policy. 

  * Proper HTML decoding requires a character set capa= ble of representing all of Unicode, as the code points in numeric charac= ter references refer to Unicode Code Points and _not_ any particular cod= e units or byte sequences in any particular encoding.

=
  * Almost every other character set is ASCII compatible= , including UTF-8, making the domain of problems where this arises even = smaller than it might otherwise seem. For example, `&` is `&` in= all of the common character sets.

Have a l= ovely weekend! And sorry for the potentially mis-threaded reply. I could= n=E2=80=99t figure out how to reply to your message directly because the= digest emails were still stuck in 2020 for my account and I didn=E2=80=99= t switch subscriptions until after your email went out, meaning I didn=E2= =80=99t have a copy of your email.

>
=
>=E2=80=94 Rob

=E2=80=94 Rob
--1959572ecc994dad86e6b1463873ef4b--