Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125025 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 857B81A00EA for ; Sat, 17 Aug 2024 05:40:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723873354; bh=Ap4G8DXXLw9Zo98tYtVmZWLJxYUrBncb5IpxL1guEGQ=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=J11tS8MjpN0rcoNzZ/O8i9GlSCu/lk0jMPrzmE4n5Llb24aqAsd5R5pvT1xH3FXmy LoMdeHGVEPXjXfg6Uo9vZGoKPxgbHYV9SYbqh3WFpdybIxdoPjOq3FBBwx2YumaDqB ZDzOCtleQ0OrVqIpbA3r9wVMEb1u6/GGLWzxKG5kFKtvljsClw4wWaKrtpbZYGkteZ jvig6kSQ1Anpd1GuPpmjArPzQs+9ys2gTco9Cy3Lo4dspFMwTwMh+V2BhHtEP+fHXN xzyzhCvgnakHaH8DMBU4LWm/fwW5eswA1AyvhSFfOasZis4IbYoAr7sSoGq9jnn666 d6g6RmcAVMhEg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 0237B180341 for ; Sat, 17 Aug 2024 05:42:28 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: ** X-Spam-Status: No, score=2.1 required=5.0 tests=BAYES_50,BODY_8BITS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, HTML_MESSAGE,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 17 Aug 2024 05:42:24 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id F260E340216 for ; Sat, 17 Aug 2024 05:40:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:in-reply-to:date:date:subject:subject :mime-version:content-type:content-type:message-id:from:from :received:received:received:received:received:received; s= automattic1; t=1723873236; bh=Ap4G8DXXLw9Zo98tYtVmZWLJxYUrBncb5I pxL1guEGQ=; b=TzjJ9dJOZocmyREgZTJ5Ernz43bq63blEirREcB9gSLpWfvG3b oMU+sbptAvTPvlcUwjhHGoEpbvxWjxwIJJycpbswZ/YXP81jK9gqDnnB6nhdGygT LrjMKRcSxB5jCm56iMVsJeho6qQ0xzMMTjj7yv7a70TB+lvJy2ipNpjmx1cuopsY fj9WJhZ148iEr2ZXL3FWHbuATSQoCEIJb5kM8K6HSxJNOazbY5wOcms+s03oJ3rD QwXT9ZCJcvWZ9NPSBjb7BQU3qf7wW66AqkTsoq7j/IzM+0zbXijMGW2X3BLX43Cd P2sY23nIsOhv2IG8OFNs/0YGNDglLBWbzGwg== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KPNFnxwJrcB6 for ; Sat, 17 Aug 2024 05:40:36 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id EC20A3403E2 for ; Sat, 17 Aug 2024 05:40:35 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="HkZfmWXk"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="KvdRkQKm"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="gRKwRibO"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 8FAD7A099C for ; Sat, 17 Aug 2024 05:40:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1723873235; bh=Ap4G8DXXLw9Zo98tYtVmZWLJxYUrBncb5IpxL1guEGQ=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=HkZfmWXkD1Az8sCdhXz2QkbmfV2z7M99kmWLqhjxbFss4y4NYDKBHMdbP1/ACFIn8 itO0seKBvZfyhl16Y7+WfKrw000yTG+055USmgYj4Ir9mrpn3VG8uz2LLkBwrSmZ8Y zLFhWQ++Xrytu/gumUEZPyajDygF1LkyxPzmSNDCq+ds3q3L2CFURfliWNTbvLjEa9 CcZxPMoEB0vk1OVtg39u8oAILHKnegIbtDWGtGp4UTOKXI+WAtuhtrU9BfXIJMYmi4 Zjr5rqwH3OXf+GbVSh5sX/EQnUTaai04ThLSEc5Ns1muuGHCFcGniutG79xI2Sv7Tn tebqPCs7ACsGw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1723873235; bh=Ap4G8DXXLw9Zo98tYtVmZWLJxYUrBncb5IpxL1guEGQ=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=KvdRkQKmASB65Izjqu2ddMS4MFHO6d1/1ye3kGujd9fJktvWQVc9fX0NHaDPt/W1R Rp7EKVXMQ+bk0X0Xcba77nqFwVQrxKwNfLbvtjauKJrCdWljmAPqylFao+6Q1OxdKo qTpYqbOGa1OI3O6JySQLc4bwxD3JjZsI+K0+oAxoTiG9h+bAPW2SEWiTAaY3MX8nJ7 bi9cwzy3U66UhDrYBhQE3nCYthmEeqgfK4jVFRVCktmGdZ4nd3qRwXJmFGQsCpyRNc VfqM/fkh4kyFR7RIfHwWKvmIRdWR77FSKsWy24rg++EDRtRao2grysVY/uKe9wa0De rhJwZWHH8Nz6A== Received: from mail-yw1-f198.google.com (mail-yw1-f198.google.com [209.85.128.198]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 6CD0EA0350 for ; Sat, 17 Aug 2024 05:40:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1723873235; bh=Ap4G8DXXLw9Zo98tYtVmZWLJxYUrBncb5IpxL1guEGQ=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=gRKwRibOKTs574xunDWot4tLDqZRqKx++CSFZKjEIQ0qXYhXeFnjD2/Kg8V6qps30 U32Rhu9AddOWwZ+zXY4KE+9PMkeGmgt3R+wGUG1ug+iTntNq4jyjGk/FofTFz5q5NJ xcsBX0BEBPHt577xDSf6Nl3DhpSmm34CNdYkKK4Y0liTNKf3DZqMXPw8ArHJfCk4Si /i7YFQzAjqf+W9iZDlRjbOwgfUDhxPZDRUEw7ofcuUkVQTEU3euV5Z0bpq/sG0wOfe NK4D7PnQcYZp6lwfaDaRmJ4sw1eix7NAlYptVlj811+0DDs0bh25MJRYSWQ+elV+qK KCwgrX26TK30g== Received: by mail-yw1-f198.google.com with SMTP id 00721157ae682-6b46a237b48so11230517b3.1 for ; Fri, 16 Aug 2024 22:40:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723873235; x=1724478035; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4p7dwWdEet0X9QK35ViYv1YDQZUfGyyVPZxXCgPDC5U=; b=V+kg01W5DdJ8B7vaLpV/jmnpBjJSZ3XKAb1J3+mWIBpNP20IZe5iODuv0jvhcKsGeA yLWqgfDujFUkwOtzvisNm68KVWXYkimQdmnfJ/Hysoy2ow5gpFxlK/emRpSYYyN6plh2 H7vFEEoVX0iIbtPXaiThMye4J5KDfKpoz2q8k+VyD/utCX3TLC6FYA/socA+lVCNeAzA yW7QdVH+xGwUfS3a5CH80z3HwtmeqdAjoSlR7fY92J6k6yGP7CZlHA5qqy1w5YTxqLW7 UZxqm2jZsrwySlH0jERZoQ/t2FLuX+MC3r6hGT3RUf6wnjdIcR0pmJWWCzsxkqLZvF84 PxCQ== X-Gm-Message-State: AOJu0YzI2Gs1USJ5oByxWBmpivIwVvaEIqKw2eiRJ4fpnuvhhSwAH2SN QZdHiGZeSHtORxrKCIyoRFM+vvhobfxmUY7/t8MLz4ub5WGyNZeckkUf72bH8WxF1yW5KZ1LZBw WN5wv/uHxW2jtDL79fiYl3JR0CDxJ4oRzIWad35OQWPNaDVasOtoSdKAy0F2H4Zk= X-Received: by 2002:a05:690c:60c5:b0:61b:3304:b702 with SMTP id 00721157ae682-6b1b73a894amr56695067b3.5.1723873234650; Fri, 16 Aug 2024 22:40:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEwPztyyXpBwl1CiJ5/J8ZmaGykiSv1r6Fc/v8LUx80qXvLu9ZIADohk+OiNuSa+C/pob6Ubg== X-Received: by 2002:a05:690c:60c5:b0:61b:3304:b702 with SMTP id 00721157ae682-6b1b73a894amr56694897b3.5.1723873233946; Fri, 16 Aug 2024 22:40:33 -0700 (PDT) Received: from smtpclient.apple (ip70-162-86-48.ph.ph.cox.net. [70.162.86.48]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2d3e2330f26sm2946542a91.0.2024.08.16.22.40.32 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Aug 2024 22:40:33 -0700 (PDT) X-Google-Original-From: Dennis Snell Message-ID: Content-Type: multipart/alternative; boundary="Apple-Mail=_54182BAB-AD14-4FA0-9327-6083FA6595C1" Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\)) Subject: Re: [PHP-DEV] [RFC] Re: Decoding HTML and the Ambiguous Ampersand Date: Fri, 16 Aug 2024 22:40:21 -0700 In-Reply-To: <6f421f13-d363-4b75-9d55-9d61ef4806c9@app.fastmail.com> Cc: internals@lists.php.net To: Rob Landers References: <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com> <47CB3E5E-1246-471A-B3BE-CE23BEFDDDF7@automattic.com> <6f421f13-d363-4b75-9d55-9d61ef4806c9@app.fastmail.com> X-Mailer: Apple Mail (2.3774.600.62) From: dennis.snell@automattic.com (Dennis Snell) --Apple-Mail=_54182BAB-AD14-4FA0-9327-6083FA6595C1 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Aug 16, 2024, at 6:53=E2=80=AFPM, Rob Landers = wrote: >=20 > Hey Dennis, >=20 > This looks like top posting because you=E2=80=99ve got a lot to read = =E2=80=94 and well written =E2=80=94 but I want to reply to some points = inline.=20 Rob, no worries! I love your questions and I love being able to work = together again, even in some limited fashion. Let me prefix this for you = and for everyone on the list: this is really hairy stuff, and can at = times require concentrated focus. When I started down this path a long = time ago I knew very little about it. I=E2=80=99ve been knee-deep in it = for years and now I feel like I learn something new every day that I = didn=E2=80=99t know before. >=20 > On Fri, Aug 16, 2024, at 20:43, Dennis Snell wrote: >> >On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote >>=20 >> Thanks for the question, Rob, I hope this finds you well! >>=20 >> >The RFC mentions that encoding must be utf-8. How are programmers = supposed to work with this if the php file itself isn=E2=80=99t utf-8 >>=20 >> =46rom my experience it=E2=80=99s the opposite case that is more = important to consider. That is, what happens when we mix UTF-8 source = code with latin1 or UTF-8 source HTML with the system-set locale. I = tried to hint at this scenario in the "Character encodings and UTF-8=E2=80= =9D section. >>=20 >> Let=E2=80=99s examine the fundamental breakdown case: >>=20 >> ```php >> =E2=80=9C=C3=A9=E2=80=9D =3D=3D=3D decode_html( =E2=80=9Cé=E2=80=9D= ); >> ``` >>=20 >> If the source is UTF-8 there=E2=80=99s no problem. If the source is = ISO-8859-1 this will fail because xE9 is on the left while xC3 xA9 is on = the right. _Except_ if `zend.multibyte=3D1` and = (`zend.script_encoding=3Diso-8859-1` _or_ if = `declare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` is set). The source = code may or may not be converted into a different encoding based on = configurations that most developers won=E2=80=99t have access to, or = won=E2=80=99t examine. >>=20 >> Even with source code in ISO-8859-1, the `zend.script_encoding` and = `zend.multibyte` set, `html_entity_decode()` _still_ reports UTF-8 = unless `zend.default_charset` is set _or_ one of the `iconv` or = `mbstring` internal charsets is set. >=20 > I just want to pause here and say, =E2=80=9Choly crap.=E2=80=9D That = is quite complex and those edges seem sharp! >=20 >>=20 >> My point I=E2=80=99m trying to make is that the current situation = today is a minefield due to a dizzying array of system-dependent = settings. Most modern code will either be running UTF-8 source code or = will be converting source code _to_ UTF-8 or many other things will = already be helplessly broken beyond this one issue. >=20 > Unfortunately, we don=E2=80=99t always get to choose the code we work = on. There is someone on this list using SHIFT_JIS. They probably know = more about the ins and outs of dealing with utf-8 centric systems from = that encoding. Hopefully they can comment more about why this would or = would not be a bad idea.=20 This is, in fact, one of my primary motivations for standardizing on = UTF-8. Keep in mind that HTML not only has a set of character encodings = that must be supported, but also a requirement that = parsers=C2=A0not=C2=A0support additional encodings = outside of that = list. This is based on security grounds, for good and even more = complicated reasons. Of all of the required supported character sets, all roundtrip through = UTF-8 as long as they aren=E2=80=99t modified. In fact, almost every = character set out there should round-trip in this way, because the = Unicode Consortium=E2=80=99s goal as far as I understand it is to = capture every possible character in writing in a single universal = character set. This appears first in the introduction to the HTML = specification and is = reiterated throughout the document: HTML requires the use of UTF-8 = , though allows legacy encodings = (there really are no =E2=80=9Cinvalid=E2=80=9D HTML documents because = parse errors have deterministic resolutions). >=20 >>=20 >> UTF-8 is the unifier that lets us escape this by having a defined and = explicit encoding at the input and output. >=20 > Utf-8 is pretty good, right now, but I don=E2=80=99t think we should = marry the language to it. Will it be =E2=80=9Cthe standard=E2=80=9D in = 10 years, 20 years, 100 years? Languages change, cultures change. Some = people I know use a font to change triple equals from a literal =3D=3D=3D = to =E2=89=A1. How long until php recognizes that as a literal operator? >=20 > But anyway, to get back on topic; I, personally, would rather see = something more flexible, with sane defaults for utf-8. To guard against a future where UTF-8 is replaced is planning for the = most extremely unlikely scenario. UTF-8 is the most universal standard = for interchange of text content, prevalent in software, systems, and = programming languages, even those with UTF-16 internals. It=E2=80=99s a good moment to remind ourselves, however, that Unicode = defines a tables of character =E2=80=9Ccode points=E2=80=9D which are a = mapping from a natural number to a character. UTF-8 is an algorithm for = storing those natural numbers in byte sequences. We absolutely can plan for over-extensibility, and this is what I=E2=80=99= ve seen happen with the existing HTML functions in PHP (with options to = choose what to decode, which entities to use, into which encoding to = decode, etc). There=E2=80=99s an appearance of an awareness of text = encoding, but the design of the function interfaces lead people to make = decisions that open up all sorts of doors to corruption and security = exploits. So it wouldn=E2=80=99t matter to my RFC if another encoding were = standardized as long as one encoding is standardized. Today, I see no = legitimate competition to UTF-8. The only encodings that come close are = the two UTF-16 variants because of their prevalence in Java, JavaScript, = and ObjectiveC strings, but the UTF-16 variable-width encoding suffers a = number of shortcomings compared to UTF-8 without providing much value in = exchange. When the day comes that UTF-8 is deprecated or replaced, major swaths of = the internet will need overhaul far beyond PHP. Or at least, I have a = hard time imaging that going any other way. >=20 >>=20 >> > or the input is meaningless in utf-8 or if changing it to utf-8 and = back would result in invalid text? >>=20 >> There shouldn't be input that=E2=80=99s meaningless in UTF-8 if = it=E2=80=99s valid in any other encoding. Indeed, I have placed the = burden on the calling code to convert into UTF-8 beforehand, but = that=E2=80=99s not altogether different than asking someone to declare = into what encoding the character references ought to be decoded. >=20 > There=E2=80=99s a huge performance difference between converting a = string from/to different encodings and instructing a function what to = parse in the current encoding and also be useful when the page itself is = not utf8.=20 It definitely seems this way when examining a single function in = isolation, but I would challenge folks to look out in the wild in = practice how these functions are used. Typically I see strings = transcoded multiple times and usually based on the wrong encoding. For = example, WordPress currently looks at its defined =E2=80=9Cblog_charset=E2= =80=9D to perform decoding, but most of the time it gets HTML input that = input isn=E2=80=99t encoded in the blog charset. What would be a performance win would be to decode and encode text at = application boundaries so it can be converted once, processed in a = pipeline where everything agrees on the encoding, and finally once more = on output. In a UTF-8 world this requires no conversion at all, and = UTF-8 is the overwhelming majority of code in web applications today. --- We can keep in mind too that there are two encodings in the picture. The = HTML source document may be encoded in one encoding while the output = might need to appear in another. Consider HTML stored in a database as = latin1/ISO-8859-1. It stores =E2=80=9C=C3=A9 is é=E2=80=9D, except = that unlike in this email, the leading character =C3=A9 is the single = byte xE9. This output likely should be sent to a browser as UTF-8. It=E2=80=99s = acceptable to send latin1, but most pages will have characters = unrepresentable in latin1. The backend then in decoding the HTML must go = ahead and internally convert the input character encoding so that the =C3=A9= becomes the two byte sequence xC3 xA9 and then decode é as xC3 = xA9. Were the input =E2=80=9Cf170;=E2=80=9D it simply could not decode = into any single-byte encoding, failing to be able to decode the HTML. = `html_entity_decode()` simply leaves that encoding in place. This kind = of behavior tends to lead to double-encoding of the character = references, and what the browser gets is `&#x1f170;` instead of = `=F0=9F=85=B0`. >=20 >>=20 >> ```diff >> -html_entity_decode( $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, = =E2=80=98ISO-8859-1=E2=80=99 ); >> +$html =3D mb_convert_encoding( $html, =E2=80=98UTF-8=E2=80=99, = =E2=80=98ISO-8859-1=E2=80=99 ); >> +$html =3D decode_html( HTML_TEXT, $html ); >> +$html =3D mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, = =E2=80=98UTF-8=E2=80=99 ); >> ``` >>=20 >> If an encoding can go into UTF-8 (which it should) then it should = also be able to return for all supported inputs. That is, we cannot = convert into UTF-8 and produce a character that is unrepresentable in = the source encoding, because that would imply it was there in the source = to begin with. Furthermore, if the HTML decodes into a code point = unsupported in the destination encoding, it would be invalid either = directly via decoding, or indirectly via conversion. >>=20 >> ```diff >> -=E2=80=9C\x1A=E2=80=9D =3D=3D=3D html_entity_decode( = =E2=80=9C🅰=E2=80=9D, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, = =E2=80=98ISO-8859-1=E2=80=99 ); >> +=E2=80=9D?=E2=80=9D =3D=3D=3D mb_convert_encoding( decode_html( = HTML_TEXT, =E2=80=9C🅰=E2=80=9D ), =E2=80=98ISO-8859-1=E2=80=99, = =E2=80=98UTF-8=E2=80=99 ); >> ``` >>=20 >> This gets really confusing because neither of these outputs is a = proper decoding, as character encodings that don=E2=80=99t support the = full Unicode code space cannot adequately represent all valid HTML = inputs. HTML is a Unicode decoding by specification, so even in a = browser with `🅰` = the text content will still be `=F0=9F=85=B0`, not `?` or the invisible = ASCII control code SUB. >=20 > I was of the understanding that meta charset was too late to set the = encoding (but it=E2=80=99s been awhile since I=E2=80=99ve read the html5 = spec) and the charset needed to be set in the html tag itself. I suppose = browsers simply rewind upon hitting meta charset, but browsers have to = deal with all kinds of shenanigans.=20 The algorithm for determining a document character set is = straightforward, albeit with many steps. META elements within the first = kilobyte of a document may determine the inferred encoding if one is not = provided externally or from an HTTP header. Fun fact, if you find `` then a browser will properly set the = document encoding to UTF-8 (just as it ignores DOCTYPE declarations and = treats all `text/html` content as HTML5). A while ago I ran some analysis on roughly 300,000 pages from a list of = top-ranked domains. You can examine the raw data = and find some interesting bits in there. For instance, many = HTML documents claim multiple incompatible character sets. Thankfully = the HTML specification is clear on how to handle these situations. There = really aren=E2=80=99t any shenanigans since 2008 when HTML5 formalized = the parsing error modes. >=20 > That being said, there is nothing in the spec (that I remember seeing) = stating it was Unicode only; just that it was the default. See the above note on character encoding vs. Unicode character set. = Unicode is in the introduction to the HTML spec and mentioned = throughout. It=E2=80=99s in the =E2=80=9Cbig picture=E2=80=9D at the = start. Even encodings like ISO-2022-JP and GB18030 map to Unicode code = points and represent different ways to represent those in sequences of = bytes. >=20 > Further, html may be stored in the database of a certain encoding = (such as content systems like WordPress or Drupal) where it may not be = straightforward (or desirable) to convert to utf8.=20 See above again: this is actually one of the most dangerous parts of = suggesting in a function signature that a developer pick a character = encoding, particularly since it invites incompatible decoding of the = source document. It=E2=80=99s completely fine to store content in a = database in another encoding, and many legacy systems do. Those are best = served by converting when reading from the database into UTF-8 and then = encoding from UTF-8 when saving into the database. The database = character set confusions make HTML=E2=80=99s look simple, but those are = out of scope for this RFC. Stating clearly that the function expects = UTF-8 is about the best way I=E2=80=99ve seen in practice to partner = with application developers both to educate them and help them = accomplish their goals. The primary point to consider here is that these legacy systems = unintentionally oversimplify the state of encoded text. Typically they = are running UTF-8 source code matching against a mixture of encodings = from various inputs, only one of which is the database. For example, = these systems will often assume that the encoding of the content in the = database is the same encoding outbound to a browser, inbound in `$_POST` = parameters, and escaped in `$_GET` query arguments. If the database is = not using UTF-8 these assumptions are almost always wrong, and thus = security issues abound. >=20 >>=20 >> =E2=80=94 >>=20 >> I=E2=80=99m sorry for being long-winded but I think it=E2=80=99s = necessary to frame these questions in the context of the problem today. = We have very frequent errors that result from having the wrong defaults = and a confusion of text encodings. I=E2=80=99ve seen far more problems = from source code being UTF-8 and assuming the input is, rather than = being anything else (likely ISO-8859-1 if not UTF-8) assuming the the = input isn=E2=80=99t. >=20 >=20 >=20 >>=20 >> * It should be possible to convert any string into UTF-8 regardless = of its origin character set, and then transitively, if it originated = there, it should be able to convert back if the HTML represents text = that is representable in the original character set. >=20 > There are a number of scripts/languages not yet supported (especially = on older machines) that would result in =E2=80=9C=EF=BF=BD=E2=80=9D and = cannot be transcribed back to its original encoding. For example, there = are still new scripts being added as late as two years ago: = https://www.unicode.org/standard/supported.html >=20 It=E2=80=99s absolutely true that new scripts are added, and someone = else can confirm or correct me, but typically these appear first in = Unicode, since Unicode has attempted to already swallow up all recorded = digital text. When new scripts appear, it=E2=80=99s usually because = someone found evidence of their use in physical writings and there was = no previous digital record of them. Do you have examples of languages that have digital records which are = supported in the HTML specification which would result in substitution = when decoding? Since HTML only encodes Unicode code points, I think the = problem is that HTML cannot represent these characters, if they exist. This is unrelated to UTF-8 because new scripts and characters and emoji = get assigned the natural numbers - the code points. It=E2=80=99s up to = text encodings to represent those indices into the character database = tables. >>=20 >> * Converting at the boundaries of the application is the way to = escape the confusion of wrestling an arbitrary number of different = character sets. >=20 > I totally agree with this statement, but we should provide tools = instead of dictating a policy.=20 It=E2=80=99s my intention never to take control from a developer. In = this situation that freedom appears by converting before and after. For = most situations, for most systems, the most reliable, safe, and = convenient thing to do will be to assume UTF-8, or check if a string is = UTF-8 and reject otherwise. In those situations doing nothing is the = right behavior, and preserves the correct parse within the domain in = which `decode_html()` operates (which again, preservation or proper = decoding cannot happen if it attempts to decode into `latin1`, as HTML = documents are entirely Unicode documents and every single-byte encoding = is unable to capture this). This is why I personally feel strongly about having safe defaults = instead of dangerous ones, as is the unfortunate case with = `html_entity_decode()`. All the better if we can educate each other = through the function interfaces to clarify what is happening and what = the expectations are or need to be. >=20 >>=20 >> * Proper HTML decoding requires a character set capable of = representing all of Unicode, as the code points in numeric character = references refer to Unicode Code Points and _not_ any particular code = units or byte sequences in any particular encoding. >>=20 >> * Almost every other character set is ASCII compatible, including = UTF-8, making the domain of problems where this arises even smaller than = it might otherwise seem. For example, `&` is `&` in all of the common = character sets. >>=20 >> Have a lovely weekend! And sorry for the potentially mis-threaded = reply. I couldn=E2=80=99t figure out how to reply to your message = directly because the digest emails were still stuck in 2020 for my = account and I didn=E2=80=99t switch subscriptions until after your email = went out, meaning I didn=E2=80=99t have a copy of your email. >>=20 >> > >> >=E2=80=94 Rob >=20 > =E2=80=94 Rob Warmly, Dennis Snell --Apple-Mail=_54182BAB-AD14-4FA0-9327-6083FA6595C1 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
On Aug 16, = 2024, at 6:53=E2=80=AFPM, Rob Landers <rob@bottled.codes> = wrote:

Hey Dennis,

This looks like top posting because you=E2=80=99ve= got a lot to read =E2=80=94 and well written =E2=80=94 but I want to = reply to some points = inline. 

Rob, no = worries! I love your questions and I love being able to work together = again, even in some limited fashion. Let me prefix this for you and for = everyone on the list: this is really hairy stuff, and can at times = require concentrated focus. When I started down this path a long time = ago I knew very little about it. I=E2=80=99ve been knee-deep in it for = years and now I feel like I learn something new every day that I = didn=E2=80=99t know before.


On Fri, Aug 16, = 2024, at 20:43, Dennis Snell wrote:
>On = Fri, Aug 16, 2024, at 02:59, Dennis Snell = wrote

Thanks for the question, Rob, I hope = this finds you well!

>The RFC mentions = that encoding must be utf-8. How are programmers supposed to work with = this if the php file itself isn=E2=80=99t = utf-8

=46rom my experience it=E2=80=99s the = opposite case that is more important to consider. That is, what happens = when we mix UTF-8 source code with latin1 or UTF-8 source HTML with the = system-set locale. I tried to hint at this scenario in the "Character = encodings and UTF-8=E2=80=9D = section.

Let=E2=80=99s examine the = fundamental breakdown = case:

```php
=E2=80=9C=C3=A9=E2= =80=9D =3D=3D=3D decode_html( =E2=80=9C&#xe9;=E2=80=9D = );
```

If the source is UTF-8 = there=E2=80=99s no problem. If the source is ISO-8859-1 this will fail = because xE9 is on the left while xC3 xA9 is on the right. _Except_ if = `zend.multibyte=3D1` and (`zend.script_encoding=3Diso-8859-1` _or_ if = `declare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` is set). The source = code may or may not be converted into a different encoding based on = configurations that most developers won=E2=80=99t have access to, or = won=E2=80=99t examine.

Even with source = code in ISO-8859-1, the `zend.script_encoding` and `zend.multibyte` set, = `html_entity_decode()` _still_ reports UTF-8 unless = `zend.default_charset` is set _or_ one of the `iconv` or `mbstring` = internal charsets is set.

I just want to pause here and say, =E2=80=9Choly = crap.=E2=80=9D That is quite complex and those edges seem = sharp!


My point I=E2=80=99m trying = to make is that the current situation today is a minefield due to a = dizzying array of system-dependent settings. Most modern code will = either be running UTF-8 source code or will be converting source code = _to_ UTF-8 or many other things will already be helplessly broken beyond = this one issue.

Unfortunately, we don=E2=80=99t always get to = choose the code we work on. There is someone on this list using = SHIFT_JIS. They probably know more about the ins and outs of dealing = with utf-8 centric systems from that encoding. Hopefully they can = comment more about why this would or would not be a bad = idea. 

This is, in = fact, one of my primary motivations for standardizing on UTF-8. Keep in = mind that HTML not only has a set of character encodings that must be = supported, but also a requirement = that parsers not support additional = encodings outside of that list. This is based on security = grounds, for good and even more complicated = reasons.

Of all of the required supported = character sets, all roundtrip through UTF-8 as long as they aren=E2=80=99t= modified. In fact, almost every character set out there should = round-trip in this way, because the Unicode Consortium=E2=80=99s goal as = far as I understand it is to capture every possible character in writing = in a single universal character set. This appears first in the introduction to = the HTML specification and is reiterated throughout the = document: HTML = requires the use of UTF-8, though allows legacy encodings (there = really are no =E2=80=9Cinvalid=E2=80=9D HTML documents because parse = errors have deterministic resolutions).



UTF-8 is the unifier that = lets us escape this by having a defined and explicit encoding at the = input and output.

Utf-8 is pretty good, right now, but I don=E2=80=99= t think we should marry the language to it. Will it be =E2=80=9Cthe = standard=E2=80=9D in 10 years, 20 years, 100 years? Languages change, = cultures change. Some people I know use a font to change triple equals = from a literal =3D=3D=3D to =E2=89=A1. How long until php = recognizes that as a literal operator?

But anyway, to get back on topic; I, personally, = would rather see something more flexible, with sane defaults for = utf-8.

To guard against a = future where UTF-8 is replaced is planning for the most extremely = unlikely scenario. UTF-8 is the most universal standard for interchange = of text content, prevalent in software, systems, and programming = languages, even those with UTF-16 = internals.

It=E2=80=99s a good moment to remind = ourselves, however, that Unicode defines a tables of character =E2=80=9Cco= de points=E2=80=9D which are a mapping from a natural number to a = character. UTF-8 is an algorithm for storing those natural numbers in = byte sequences.

We absolutely can plan for = over-extensibility, and this is what I=E2=80=99ve seen happen with the = existing HTML functions in PHP (with options to choose what to decode, = which entities to use, into which encoding to decode, etc). There=E2=80=99= s an appearance of an awareness of text encoding, but the design of the = function interfaces lead people to make decisions that open up all sorts = of doors to corruption and security = exploits.

So it wouldn=E2=80=99t matter to my = RFC if another encoding were standardized as long as = one encoding is standardized. Today, I see no legitimate = competition to UTF-8. The only encodings that come close are the two = UTF-16 variants because of their prevalence in Java, JavaScript, and = ObjectiveC strings, but the UTF-16 variable-width encoding suffers a = number of shortcomings compared to UTF-8 without providing much value in = exchange.

When the day comes that UTF-8 is = deprecated or replaced, major swaths of the internet will need overhaul = far beyond PHP. Or at least, I have a hard time imaging that going any = other way.



> or the input is = meaningless in utf-8 or if changing it to utf-8 and back would result in = invalid text?

There shouldn't be input = that=E2=80=99s meaningless in UTF-8 if it=E2=80=99s valid in any other = encoding. Indeed, I have placed the burden on the calling code to = convert into UTF-8 beforehand, but that=E2=80=99s not altogether = different than asking someone to declare into what encoding the = character references ought to be decoded.

There=E2=80=99s = a huge performance difference between converting a string from/to = different encodings and instructing a function what to parse in the = current encoding and also be useful when the page itself is not = utf8. 

It = definitely seems this way when examining a single function in isolation, = but I would challenge folks to look out in the wild in practice how = these functions are used. Typically I see strings transcoded multiple = times and usually based on the wrong encoding. For example, WordPress = currently looks at its defined =E2=80=9Cblog_charset=E2=80=9D to perform = decoding, but most of the time it gets HTML input that input = isn=E2=80=99t encoded in the blog = charset.

What would be a = performance win would be to decode and encode text at application = boundaries so it can be converted once, processed in a pipeline where = everything agrees on the encoding, and finally once more on output. In a = UTF-8 world this requires no conversion at all, and UTF-8 is the = overwhelming majority of code in web applications = today.

---

We can keep = in mind too that there are two encodings in the picture. The = HTML source document may be encoded in one encoding while the output = might need to appear in another. Consider HTML stored in a database as = latin1/ISO-8859-1. It stores =E2=80=9C=C3=A9 is &#xE9;=E2=80=9D, = except that unlike in this email, the leading character =C3=A9 is the = single byte xE9.

This output likely should be = sent to a browser as UTF-8. It=E2=80=99s acceptable to send latin1, but = most pages will have characters unrepresentable in latin1. The backend = then in decoding the HTML must go ahead and internally convert the input = character encoding so that the =C3=A9 becomes the two byte sequence xC3 = xA9 and then decode &#xE9; as xC3 = xA9.

Were the input =E2=80=9C&#1f170;=E2=80=9D= it simply could not decode into any single-byte encoding, failing to be = able to decode the HTML. `html_entity_decode()` simply leaves that = encoding in place. This kind of behavior tends to lead to = double-encoding of the character references, and what the browser gets = is `&amp;#x1f170;` instead of `=F0=9F=85=B0`.



```diff
-html_entity_decode( = $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98ISO-8859-1=E2=80=99= );
+$html =3D mb_convert_encoding( $html, =E2=80=98UTF-8=E2= =80=99, =E2=80=98ISO-8859-1=E2=80=99 );
+$html =3D = decode_html( HTML_TEXT, $html );
+$html =3D = mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, =E2=80=98UTF-8=E2= =80=99 );
```

If an encoding = can go into UTF-8 (which it should) then it should also be able to = return for all supported inputs. That is, we cannot convert into UTF-8 = and produce a character that is unrepresentable in the source encoding, = because that would imply it was there in the source to begin with. = Furthermore, if the HTML decodes into a code point unsupported in the = destination encoding, it would be invalid either directly via decoding, = or indirectly via = conversion.

```diff
-=E2=80=9C\= x1A=E2=80=9D =3D=3D=3D html_entity_decode( =E2=80=9C&#x1f170;=E2=80=9D= , ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98ISO-8859-1=E2=80=99 = );
+=E2=80=9D?=E2=80=9D =3D=3D=3D mb_convert_encoding( = decode_html( HTML_TEXT, =E2=80=9C&#x1f170;=E2=80=9D ), = =E2=80=98ISO-8859-1=E2=80=99, =E2=80=98UTF-8=E2=80=99 = );
```

This gets really = confusing because neither of these outputs is a proper decoding, as = character encodings that don=E2=80=99t support the full Unicode code = space cannot adequately represent all valid HTML inputs. HTML is a = Unicode decoding by specification, so even in a browser with `<meta = charset=3D=E2=80=9CISO-8859-1=E2=80=9D>&#x1f170;` the text = content will still be `=F0=9F=85=B0`, not `?` or the invisible ASCII = control code SUB.

I was of the understanding that meta charset was = too late to set the encoding (but it=E2=80=99s been awhile since I=E2=80=99= ve read the html5 spec) and the charset needed to be set in the html tag = itself. I suppose browsers simply rewind upon hitting meta charset, but = browsers have to deal with all kinds of = shenanigans. 

The = algorithm for determining a document character set is straightforward, = albeit with many steps. META elements within the first kilobyte of a = document may determine the inferred encoding if one is not provided = externally or from an HTTP header. Fun fact, if you find `<meta = charset=3D=E2=80=9Cutf16=E2=80=9D>` then a browser will properly set = the document encoding to UTF-8 (just as it ignores DOCTYPE declarations = and treats all `text/html` content as HTML5).

A = while ago I ran some analysis on roughly 300,000 pages from a list of = top-ranked domains. You can examine the raw data and find some = interesting bits in there. For instance, many HTML documents claim = multiple incompatible character sets. Thankfully the HTML specification = is clear on how to handle these situations. There really aren=E2=80=99t = any shenanigans since 2008 when HTML5 formalized the parsing error = modes.


That being said, there is nothing in the spec = (that I remember seeing) stating it was Unicode only; just that it was = the default.

See the = above note on character encoding vs. Unicode character set. Unicode is = in the introduction to the HTML spec and mentioned throughout. It=E2=80=99= s in the =E2=80=9Cbig picture=E2=80=9D at the start. Even encodings like = ISO-2022-JP and GB18030 map to Unicode code points and represent = different ways to represent those in sequences of = bytes.


Further, html may be stored in the database of a = certain encoding (such as content systems like WordPress or Drupal) = where it may not be straightforward (or desirable) to convert to = utf8. 

See above again: = this is actually one of the most dangerous parts of suggesting in a = function signature that a developer pick a character encoding, = particularly since it invites incompatible decoding of the source = document. It=E2=80=99s completely fine to store content in a database in = another encoding, and many legacy systems do. Those are best served by = converting when reading from the database into UTF-8 and then encoding = from UTF-8 when saving into the database. The database character set = confusions make HTML=E2=80=99s look simple, but those are out of scope = for this RFC. Stating clearly that the function expects UTF-8 is about = the best way I=E2=80=99ve seen in practice to partner with application = developers both to educate them and help them accomplish their = goals.

The primary point to consider here is = that these legacy systems unintentionally oversimplify the state of = encoded text. Typically they are running UTF-8 source code matching = against a mixture of encodings from various inputs, only one of which is = the database. For example, these systems will often assume that the = encoding of the content in the database is the same encoding outbound to = a browser, inbound in `$_POST` parameters, and escaped in `$_GET` query = arguments. If the database is not using UTF-8 these assumptions are = almost always wrong, and thus security issues = abound.



=E2=80=94

I=E2=80=99= m sorry for being long-winded but I think it=E2=80=99s necessary to = frame these questions in the context of the problem today. We have very = frequent errors that result from having the wrong defaults and a = confusion of text encodings. I=E2=80=99ve seen far more problems from = source code being UTF-8 and assuming the input is, rather than being = anything else (likely ISO-8859-1 if not UTF-8) assuming the the input = isn=E2=80=99t.




  * It should be possible to convert any = string into UTF-8 regardless of its origin character set, and then = transitively, if it originated there, it should be able to convert back = if the HTML represents text that is representable in the original = character set.

There are a number of scripts/languages not yet = supported (especially on older machines) that would result = in =E2=80=9C=EF=BF=BD=E2=80=9D and cannot be transcribed back to = its original encoding. For example, there are still new scripts being = added as late as two years ago: https://www.unico= de.org/standard/supported.html


It=E2=80=99s = absolutely true that new scripts are added, and someone else can confirm = or correct me, but typically these appear first in Unicode, = since Unicode has attempted to already swallow up all recorded digital = text. When new scripts appear, it=E2=80=99s usually because someone = found evidence of their use in physical writings and there was no = previous digital record of them.

Do you have = examples of languages that have digital records which are supported in = the HTML specification which would result in substitution when decoding? = Since HTML only encodes Unicode code points, I think the problem = is that HTML cannot represent these characters, if they = exist.

This is unrelated to UTF-8 because new = scripts and characters and emoji get assigned the natural numbers - the = code points. It=E2=80=99s up to text encodings to represent those = indices into the character database tables.


  * Converting at the boundaries of the = application is the way to escape the confusion of wrestling an arbitrary = number of different character sets.

I totally agree = with this statement, but we should provide tools instead of dictating a = policy. 

It=E2=80=99s = my intention never to take control from a developer. In this situation = that freedom appears by converting before and after. For most = situations, for most systems, the most reliable, safe, and convenient = thing to do will be to assume UTF-8, or check if a string is UTF-8 and = reject otherwise. In those situations doing nothing is the right = behavior, and preserves the correct parse within the domain in which = `decode_html()` operates (which again, preservation or proper decoding = cannot happen if it attempts to decode into `latin1`, as HTML documents = are entirely Unicode documents and every single-byte = encoding is unable to capture this).

This is = why I personally feel strongly about having safe defaults instead of = dangerous ones, as is the unfortunate case with `html_entity_decode()`. = All the better if we can educate each other through the function = interfaces to clarify what is happening and what the expectations are or = need to be.



  * Proper HTML = decoding requires a character set capable of representing all of = Unicode, as the code points in numeric character references refer to = Unicode Code Points and _not_ any particular code units or byte = sequences in any particular = encoding.

  * Almost every other = character set is ASCII compatible, including UTF-8, making the domain of = problems where this arises even smaller than it might otherwise seem. = For example, `&` is `&` in all of the common character = sets.

Have a lovely weekend! And sorry for = the potentially mis-threaded reply. I couldn=E2=80=99t figure out how to = reply to your message directly because the digest emails were still = stuck in 2020 for my account and I didn=E2=80=99t switch subscriptions = until after your email went out, meaning I didn=E2=80=99t have a copy of = your email.

>
>=E2=80=94 = Rob

=E2=80=94 = Rob


Warml= y,
Dennis Snell

= --Apple-Mail=_54182BAB-AD14-4FA0-9327-6083FA6595C1--