Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124988 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 169511A00BD for ; Fri, 16 Aug 2024 18:43:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723833910; bh=9ROo1faeDOJITDgl8tg8pCDC4keH6wvQHejkGsCplhw=; h=From:Subject:Date:References:To:In-Reply-To:From; b=Bimzly4MrwLf09bb6DbyqZo+9VYBjlx4K6GiwvnvlrXa/alTap/4+qMFp4px1vNkw FOlEH3ElSsYeiBYWjWN7KqBfXNlTiNFiZRj+5diEaIH2JWF09rAtgYBvUcQ9cMLsI4 SRgIRh5NXaI5BhRkFLBoj7PDCWrvjtUy86gRcXwMaL+udZF9jq8mR7xpxtDW+xoROt hzgSvIUFkJklMkPYhC90GAZUy8DZWZDKr2J6JL8JK2k32+3uAUZ4AYiifHpb4qIFBW rMGqt7e9UVYz5KwetLry6+8hwFUPv8MAsyTFYItGUJTDI0PIF2PMFy3V0Ia6m2v0Gk 8OcoMdEauN97A== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 58B5718005C for ; Fri, 16 Aug 2024 18:45:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: ** X-Spam-Status: No, score=2.1 required=5.0 tests=BAYES_50,BODY_8BITS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 16 Aug 2024 18:45:08 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 33D0B3404D4 for ; Fri, 16 Aug 2024 18:43:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:message-id:in-reply-to:references:date:date:subject :subject:mime-version:content-transfer-encoding:content-type :content-type:from:from:received:received:received:received :received:received; s=automattic1; t=1723833800; bh=9ROo1faeDOJI TDgl8tg8pCDC4keH6wvQHejkGsCplhw=; b=HUfATWoFdnj7yRQsa1Fe2GV1Dv+I jHEOas+JA3hclq8dpBurNMkLGGbE7FHDtaUEGDXEa3PUrjEit/x7S72XNwXHoze3 xBZRNdB++8Jd11KpTA3HTlTW6pDnmz1BmllSN+XSpVr2Spc6SLpvYZcdZsWxbtwB 2o1cw5jab+VhB7NA8sB/FmB+96i5L5LCRcK/rI8cM/zpLoPq4qOsNK3Jjb1d5oWo icnPdSLMsx+srcxI2RndGW+kGrjC6LpAeQiZlMWkdtNe0zf6R55QShiBeK/Zl03V yaUJNc9YMSFv1aFz0xrE8aFDwGE7R40pQ3pIZGosHvafICtx0CGSKbzkuA== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ximlaQtHcmeY for ; Fri, 16 Aug 2024 18:43:20 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id B40AD3404C4 for ; Fri, 16 Aug 2024 18:43:20 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="Hki9zmNf"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="R6D7Oatu"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="HJV/T990"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id A501DA038F for ; Fri, 16 Aug 2024 18:43:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1723833800; bh=9ROo1faeDOJITDgl8tg8pCDC4keH6wvQHejkGsCplhw=; h=From:Subject:Date:References:To:In-Reply-To:From; b=Hki9zmNff1NILqj2T/nNze40m0Bpmd0cnKjmpgl/ODyyRzoibzlhpJJBvB04kqnZF ZcBTBeKxLAyIbzjogHJYF2ObJ8TnthySdQ+etTDVlXWahcT9DB0Lb84IQye+opUmlA OITqQ6uIj6VSGNcSJ6k6DDkfcLgzqLeyP+cSeiOn5aEvXpaAplNfJIqBi6kLN3B6W+ /pK+vQp6ewjzucwk2HBMA4mpmZ6wHEfTNxOvReFBnUXNxlNyO2APLz/MFEAvWAAHaU rmXMfTA5Hgxvx9GXZVHQNc2tiPM7W8Uf/ytVSra/Y87NITbL/FJF0bMWeJmLFzJ4hN 0RaeK0LFwc6hQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1723833800; bh=9ROo1faeDOJITDgl8tg8pCDC4keH6wvQHejkGsCplhw=; h=From:Subject:Date:References:To:In-Reply-To:From; b=R6D7OatusGp2OxsTDBrtFyfGdlWkVNXjMFzNEhloYSAHu2MU5+9NUmDfnkWGMZUjE mpgeR8h6jOH09C/kG1inKTS1CN3Y60flgRmN/E/HsEQI/D0yw/xhfYR1lHZy2y6kIe 7KOcOXiMiS8pXX++ND1Thc19IIDxlhgkuDjzqIymIxe1LNFDmlowNKikiqlvvX2Xo8 RZbyzrdQS01JDVyOoah+O+tbspc1kdE/rnT3j0UgCKFkdZ/Q/w1vkKuhL5NKPdy2P8 Ee2/FiYxTkxc2xasyqKLD2NmS0Z/M4uMs7KP09aUPTAGnr8m+RLosC7krIJB5lb6bH aft5/4JKr4Q3A== Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com [209.85.216.72]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 94F36A0385 for ; Fri, 16 Aug 2024 18:43:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1723833800; bh=9ROo1faeDOJITDgl8tg8pCDC4keH6wvQHejkGsCplhw=; h=From:Subject:Date:References:To:In-Reply-To:From; b=HJV/T9904n4Kvss+ZQX1knzVJxnJ8kCsazZk+A0AWqGLRuty9K5ze6+ffOFyWLzXk eahAf/quiEhPVO9QmXej119vUTBKbABhcOfaf0+8u9FqRkIxbePV/yxZkq6Ahtp5Qu xub1ZkP3BL7E+T2MQk9KtpHCjW24JVeveB6GUrU6bD+fskdBLIDJBq4doHYVcEK0I2 D/FXFfn84ILNUR0S3Bl3M6IZY+FyPOwMzuOLkrIb9UVDJwVZvRZXAe1VLvDubWuGMD uhLDhoTpLz+bhXj07RvFh7v1h5iCs+Q33nA5azYwaJ9TWO0VrmxpolLVxY+pjkNInT m9o5cxhRSgg5g== Received: by mail-pj1-f72.google.com with SMTP id 98e67ed59e1d1-2d3c02f6b57so2006718a91.2 for ; Fri, 16 Aug 2024 11:43:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723833800; x=1724438600; h=message-id:in-reply-to:to:references:date:subject:mime-version :content-transfer-encoding:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XDTlKJWjFi3MPT4/fNAqDsmO/urOteH+aYEn2HpRgd8=; b=TL2ptk4MG9cVozSmcQQsJGDz443fY/a/MieD44SJ8z75uvL/2VOcSBQb2B2yKgc5+4 e4506GETFU3Dyf5im5iiNye0WZ7v00kpwZL4rnsbU+15mokHd/tefKL/hiLyNN/RHOmG Gg+HkatWogZHNfcytnJ7qjrsB52oZmQiBUQyvLB+y9TUbmZ8dmSvFlRalIKiwz4xY8a5 WNYl61RlqBAGYdcYgpPucyBgn+IMo7XX7USIqiZfAmy+aREkXvkibAFyulbCTOPcKBJY V8+PM2eo9N9xfSudI3EuZQWOXkw9Sd6vimN28cJQBueqGbwBR1LXMggdBvsSb119wctc sG/w== X-Gm-Message-State: AOJu0YwXGeho8RJnHXc2owK+7unoRcL9J8Mm2WJ2f3D0L/NXVFdyckGj 0CE0LRvpgow5ntEZ6L3XePHv086xJAZzgcUHZ8CvOdiq+u+bK3EcjqBP7wUqxVOvZXu3oJaVH+9 O5owoKn5qlwnelEW4ODf8wGVQohDlFgip84eTlxZDmTt56WqS3p+h1sCqHmudYnk= X-Received: by 2002:a17:90b:10d:b0:2c3:2557:3de8 with SMTP id 98e67ed59e1d1-2d3e02e21c2mr3927842a91.33.1723833799605; Fri, 16 Aug 2024 11:43:19 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFNeY7VqfiaLxARfDZ8n2buecpg8sT0WHK414KhvFXDK6Yo/aIfU3OW4QEmMWrXknaS5ORIUA== X-Received: by 2002:a17:90b:10d:b0:2c3:2557:3de8 with SMTP id 98e67ed59e1d1-2d3e02e21c2mr3927822a91.33.1723833798949; Fri, 16 Aug 2024 11:43:18 -0700 (PDT) Received: from smtpclient.apple (wsip-174-76-46-206.ph.ph.cox.net. [174.76.46.206]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2d3ac8544f6sm5989656a91.47.2024.08.16.11.43.18 for (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Aug 2024 11:43:18 -0700 (PDT) X-Google-Original-From: Dennis Snell Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\)) Subject: [PHP-DEV] [RFC] Re: Decoding HTML and the Ambiguous Ampersand Date: Fri, 16 Aug 2024 11:43:07 -0700 References: <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com> To: Internals In-Reply-To: <1FD3A9B0-D46F-4589-A803-3CC2347EC7DF@automattic.com> Message-ID: <47CB3E5E-1246-471A-B3BE-CE23BEFDDDF7@automattic.com> X-Mailer: Apple Mail (2.3774.600.62) From: dennis.snell@automattic.com (Dennis Snell) >On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote: >>=20 >>> On Jul 9, 2024, at 4:55 PM, Dennis Snell = wrote: >>>=20 >>> Greetings all, >>>=20 >>> The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function = has a number of issues that I=E2=80=99d like to correct. >>>=20 >>> - It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named = character references. >>> - 106 of these are named character references which do not require a = trailing semicolon, such as `´` >>> - It=E2=80=99s unaware of the ambiguous ampersand rule, which allows = these 106 in special circumstances. >>>=20 >>> HTML5 asserts that the list of named character references will not = expand in the future. It can be found authoritatively at the following = URL: >>>=20 >>> https://html.spec.whatwg.org/entities.json >>>=20 >>> The ambiguous ampersand rule smoothes over legacy behavior from = before HTML5 where ampersands were not properly encoded in attribute = values, specifically in URL values. For example, in a query string for a = search, one might find `?q=3Ddog¬=3Dcat`. The `¬` in that value = would decode to U+AC (=C2=AC), but since it=E2=80=99s in an attribute = value it will be left as plaintext. Inside normal HTML markup it would = transform into `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when = numeric character references are found at the end of a string or = boundary without the semicolon. >>>=20 >>> The function signature of `html_entity_decode()` does not currently = allow for correcting this behavior. I=E2=80=99d like to propose an RFC = or a bug fix which either extends the function (perhaps by adding a new = flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new = function. For the missing character references I wonder if it would be = enough to add them to the list of default translatable references. >>>=20 >>> One challenge with the existing function is that the concept of the = translation table stands in contrast with the fixed and static nature of = HTML5=E2=80=99s replacement tables. A new function or set of functions = could open up spec-compliant decoding while providing helpful methods = that are necessary in many common server-side operations: >>>=20 >>> - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99, = $raw_text, $input_encoding =3D =E2=80=98utf-8' )` >>> - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80= =99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99 = )` >>> - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2= =80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99= )` >>>=20 >>> These methods are handy for inspecting things like encoded attribute = values in a memory-efficient and processing-efficient way, when it=E2=80=99= s not necessary to decode the entire value. In common situations, one = encounters data-URIs with potentially megabytes of image data and = processing only the first few or tens of bytes can save a lot of = overhead. >>>=20 >>> We=E2=80=99re exploring pure-PHP solutions to these problems in = WordPress in attempts to improve the reliability and safety of handling = HTML. I=E2=80=99d love to hear your thoughts and know if anyone is = willing to work with me to create an RFC or directly propose patches. = We=E2=80=99ve created a step function which allows finding the next = character reference and decoding it separately, enabling some novel = features like highlighting the character references in source text. >>>=20 >>> Should I propose an RFC for this? >>>=20 >>> Warmly, >>> Dennis Snell >>> Automattic Inc. >>=20 >> All, >>=20 >> I have submitted an RFC draft for including the proposed feature from = this issue. Thanks to everyone who helped me in this process. It=E2=80=99s= my first RFC, so I apologize in advance for any mistakes I=E2=80=99ve = made in the process. >>=20 >> https://wiki.php.net/rfc/decode_html >>=20 >> This is proposed for a future PHP version after 8.4. >>=20 >> Warmly, >> Dennis Snell > >Hey Dennis, Thanks for the question, Rob, I hope this finds you well! >The RFC mentions that encoding must be utf-8. How are programmers = supposed to work with this if the php file itself isn=E2=80=99t utf-8 =46rom my experience it=E2=80=99s the opposite case that is more = important to consider. That is, what happens when we mix UTF-8 source = code with latin1 or UTF-8 source HTML with the system-set locale. I = tried to hint at this scenario in the "Character encodings and UTF-8=E2=80= =9D section. Let=E2=80=99s examine the fundamental breakdown case: ```php =E2=80=9C=C3=A9=E2=80=9D =3D=3D=3D decode_html( =E2=80=9Cé=E2=80=9D = ); ``` If the source is UTF-8 there=E2=80=99s no problem. If the source is = ISO-8859-1 this will fail because xE9 is on the left while xC3 xA9 is on = the right. _Except_ if `zend.multibyte=3D1` and = (`zend.script_encoding=3Diso-8859-1` _or_ if = `declare(encoding=3D=E2=80=98iso-8859-1=E2=80=99)` is set). The source = code may or may not be converted into a different encoding based on = configurations that most developers won=E2=80=99t have access to, or = won=E2=80=99t examine. Even with source code in ISO-8859-1, the `zend.script_encoding` and = `zend.multibyte` set, `html_entity_decode()` _still_ reports UTF-8 = unless `zend.default_charset` is set _or_ one of the `iconv` or = `mbstring` internal charsets is set. My point I=E2=80=99m trying to make is that the current situation today = is a minefield due to a dizzying array of system-dependent settings. = Most modern code will either be running UTF-8 source code or will be = converting source code _to_ UTF-8 or many other things will already be = helplessly broken beyond this one issue. UTF-8 is the unifier that lets us escape this by having a defined and = explicit encoding at the input and output. > or the input is meaningless in utf-8 or if changing it to utf-8 and = back would result in invalid text? There shouldn't be input that=E2=80=99s meaningless in UTF-8 if it=E2=80=99= s valid in any other encoding. Indeed, I have placed the burden on the = calling code to convert into UTF-8 beforehand, but that=E2=80=99s not = altogether different than asking someone to declare into what encoding = the character references ought to be decoded. ```diff -html_entity_decode( $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, = =E2=80=98ISO-8859-1=E2=80=99 ); +$html =3D mb_convert_encoding( $html, =E2=80=98UTF-8=E2=80=99, = =E2=80=98ISO-8859-1=E2=80=99 ); +$html =3D decode_html( HTML_TEXT, $html ); +$html =3D mb_convert_encoding( $html, =E2=80=98ISO-8859-1=E2=80=99, = =E2=80=98UTF-8=E2=80=99 ); ``` If an encoding can go into UTF-8 (which it should) then it should also = be able to return for all supported inputs. That is, we cannot convert = into UTF-8 and produce a character that is unrepresentable in the source = encoding, because that would imply it was there in the source to begin = with. Furthermore, if the HTML decodes into a code point unsupported in = the destination encoding, it would be invalid either directly via = decoding, or indirectly via conversion. ```diff -=E2=80=9C\x1A=E2=80=9D =3D=3D=3D html_entity_decode( =E2=80=9C🅰=E2= =80=9D, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, =E2=80=98ISO-8859-1=E2=80= =99 ); +=E2=80=9D?=E2=80=9D =3D=3D=3D mb_convert_encoding( decode_html( = HTML_TEXT, =E2=80=9C🅰=E2=80=9D ), =E2=80=98ISO-8859-1=E2=80=99, = =E2=80=98UTF-8=E2=80=99 ); ``` This gets really confusing because neither of these outputs is a proper = decoding, as character encodings that don=E2=80=99t support the full = Unicode code space cannot adequately represent all valid HTML inputs. = HTML is a Unicode decoding by specification, so even in a browser with = `🅰` the text = content will still be `=F0=9F=85=B0`, not `?` or the invisible ASCII = control code SUB. =E2=80=94 I=E2=80=99m sorry for being long-winded but I think it=E2=80=99s = necessary to frame these questions in the context of the problem today. = We have very frequent errors that result from having the wrong defaults = and a confusion of text encodings. I=E2=80=99ve seen far more problems = from source code being UTF-8 and assuming the input is, rather than = being anything else (likely ISO-8859-1 if not UTF-8) assuming the the = input isn=E2=80=99t. * It should be possible to convert any string into UTF-8 regardless of = its origin character set, and then transitively, if it originated there, = it should be able to convert back if the HTML represents text that is = representable in the original character set. * Converting at the boundaries of the application is the way to escape = the confusion of wrestling an arbitrary number of different character = sets. * Proper HTML decoding requires a character set capable of = representing all of Unicode, as the code points in numeric character = references refer to Unicode Code Points and _not_ any particular code = units or byte sequences in any particular encoding. * Almost every other character set is ASCII compatible, including = UTF-8, making the domain of problems where this arises even smaller than = it might otherwise seem. For example, `&` is `&` in all of the common = character sets. Have a lovely weekend! And sorry for the potentially mis-threaded reply. = I couldn=E2=80=99t figure out how to reply to your message directly = because the digest emails were still stuck in 2020 for my account and I = didn=E2=80=99t switch subscriptions until after your email went out, = meaning I didn=E2=80=99t have a copy of your email. > >=E2=80=94 Rob=