Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125457 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 3B3AD1A00BD for ; Fri, 6 Sep 2024 19:00:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1725649362; bh=mvHTnAjCh8vk1AzaMArF5lrMz7ro8oc5Rt7idlEHRMc=; h=From:Subject:Date:References:To:In-Reply-To:From; b=Gzp74g5mpBLKVVbiXXGHw/p0VqTcTeU/6f2XcVi/EwhuYIeCPvpUTOqj9N2iRst8+ 1VqTjFAgQt8l6gQxj/dQA1pYVQqqUFl/v/cSEH+Whtb2j+8ZFIMfdA5ar43K+OXZuu r4YE30QaV4Ncww1R+H0n6vUqnWLQ7skuOeKMP6Ss8WZOpQlZ77m2pJXiHs0coLdqM0 lFpkOJWIKYwzGGxhdkj7vNgNnoZCb81bV3qWQ64ZAYfJhSrh/nWxIUgllFO4Axus/r iFvLCCOtIP4EbLYvh6NUMn3lX5tyf0UbhwezTjvtfD18pbLFOstPxnIIelpcAlKHzZ dJDBmkjQKuvJw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 420851801D9 for ; Fri, 6 Sep 2024 19:02:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 6 Sep 2024 19:02:40 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 58F9D340B3A for ; Fri, 6 Sep 2024 19:00:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:message-id:in-reply-to:references:date:date:subject :subject:mime-version:content-type:content-type:from:from :received:received:received:received:received:received; s= automattic1; t=1725649241; bh=mvHTnAjCh8vk1AzaMArF5lrMz7ro8oc5Rt 7idlEHRMc=; b=aaEAryvo4Dv0gORV0EVGoy3dqRHGGkveepE3lyWBLA2dLU+AnO EqCpqSmVkXrJLqvZSLtN20YTibq+69niIL5PM+owuA6df4FvG2QkPMwKxFsgAhw1 BKVpo7hfWIo/jJUNYKNl3WH83vQEWTq9Y43qCQZUKJzxdTXURvzOCeD5cegMDUW7 8m2hm4JTXExHZO9vf8wLdRIpayDf49/KOQKyczmA9MlqKyxWoe35RmZd7lpoIakn f7ydJ4koWTpM2rDJIGwL8V3okOQXsK7d6IeWmtdB09t7Tf+VzJf3ry78nxUhPjXx w4eG8fZs5J0bdUzuUZ6QbzvHLMyrxxFRQ9hA== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LLvshXB2mRSy for ; Fri, 6 Sep 2024 19:00:41 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id D698A340BD0 for ; Fri, 6 Sep 2024 19:00:40 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="mR8JiIw0"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="XN7hfmle"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="h33xUHis"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 7AEBBA0561 for ; Fri, 6 Sep 2024 19:00:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1725649240; bh=mvHTnAjCh8vk1AzaMArF5lrMz7ro8oc5Rt7idlEHRMc=; h=From:Subject:Date:References:To:In-Reply-To:From; b=mR8JiIw0TFJYrhOfj6av4ANurANqDwQTJ7qMs0CGzRXQ9eNJergFyiVTQSs3KjK4f wY1M7s+ZjjFbYlA8BioCF0Aw619uuIZqNJ9Mr3kBzBXXKpdNfcINVQaSfjY6DRMB6c rHu8DksMNiGk1GLTCVFzT7xUFAFGnPEOqe9vl6zeT+lHPOE9/wIPshxAQT+Z0gyy/8 PO44510I1uI4w9j+wbLAapotoOJtgV93897SQr93Ko3ypHTXVEOYg3pGDUmlDdPNUP MURRno+TxUZUXJfpIkYgpFf5nJaHUNIBMUazpNHIDYfijZ4ycvpJ/h7bis+XVyRYVy b04yDDN1dx4Ag== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1725649240; bh=mvHTnAjCh8vk1AzaMArF5lrMz7ro8oc5Rt7idlEHRMc=; h=From:Subject:Date:References:To:In-Reply-To:From; b=XN7hfmleLGZfocNnNKpZF6cFSIkKaXbgeFMha8i9RmmNEk/ywLGu5grBtT6l9Wh4W YN/IT4GSUur+9h4TxdsgKSMQ4IaTX9OnfaUnPG/tpPu6pdsdyvYN/nm4vvvy3dT3eh r5DIuHb+ezYzobezEOPnFjeUjKGjHDyztOb2JUwXhX3PLMuk4Y/oVw35F9o11XxqIk PvM45ghlMWTYqHagIYlbgTauGVFjXxX83B9aTO6QQg+hOotL4FxMEaLz0hzdiDjVTC PwyolgBA69rWJMxnlQ62XD3NPLrb6M4XheKDNM3Vb82MWo5Y6ax+rw7OgKeUR+ikFW F5iVuBcLIRl+Q== Received: from mail-pg1-f197.google.com (mail-pg1-f197.google.com [209.85.215.197]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 52B60A0916 for ; Fri, 6 Sep 2024 19:00:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1725649240; bh=mvHTnAjCh8vk1AzaMArF5lrMz7ro8oc5Rt7idlEHRMc=; h=From:Subject:Date:References:To:In-Reply-To:From; b=h33xUHis15Ke7H7kKd4kYxy/sWqJgQrxOovZZmmEBqjQ2xCvgNRD1ecDNB0iLeTwl 4pmYaiE6A+rP11EQ0UcGb73koLk06A+r3St7O0zeTgaGo/E7WYCKrMM5llE1tTuEtW 2iqX2KyDfZwUl5NaAYq+JKofr4y7d6kSNNyfmSyCBRAL0EWZC56UMvilpL3q2So1rj CJqP6j62a9oS4mViylPYk4D7Kaxm97Ap7xlxDtWPq2RiEwt30PJu3bDvQ/eNXmtLyd gQQzN7NHDXwvDA3JQMGIzvqIjSzLRL27AgaKMcfpon+H5Tdat6aKEMc7LDSi53hx3C bnKNkNtz0ufrw== Received: by mail-pg1-f197.google.com with SMTP id 41be03b00d2f7-7b696999c65so2521079a12.3 for ; Fri, 06 Sep 2024 12:00:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725649239; x=1726254039; h=message-id:in-reply-to:to:references:date:subject:mime-version:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=vnCozlEQUcXu4qvi+mdzgp6NCuSZrBetxV2UmiZlr10=; b=URrYiUpBJdaPQ6yXDKnPcu3nKjV03VzPPrcT4h55ngpfj6gw7oCafZSXZKbXmGeJMf VvrN0Gu1Ij9i/kZpX/ZDgj3uZHDjbTnqnktRUlIuQ35nnoQPY1ebf3g6qXd1Kfe4EZsT g17R7kSIukHfJ7y7Q6RMtIjctiTAKldVt79xqkeYVSqZL5qSo7ZvORhSS4y4kmldpy5I 6AoMl4B2I4VM1hLErTZBL00UI9VmixWk13tUIQlduWWWhSA/QbCwTDSIMvD/RI5yGPEX xeImLgnhR3sp/XLBeB2KNT+5jXH4A3n7oXdTpiDtngQFoNzW3SjmHUhXQ2NKAqJOXhyj uvWQ== X-Gm-Message-State: AOJu0YwPpGHN5GQ7zNrYQBIm6rQhlnSFls0ZsOoS481Gx+GBEvg+4VuB cGdJJwkNvy4yD8RhZh4Zv/AwOvkbKeFr6JtM7MT+zGYhNw+jqm/I8IaUIaKPb2PEdaJybd4Y+Fw ZnOnLoIZP/v6N36EmUMDCAPgJjOP/PSyb6QRAAYq/XsD823UsNo62I792Gx6t43k= X-Received: by 2002:a17:903:41c2:b0:207:816:6b7c with SMTP id d9443c01a7336-20708166eb0mr5804945ad.13.1725649239058; Fri, 06 Sep 2024 12:00:39 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHc6GY2hGD+DkbGAUHU291a/h/zJS1pI/mtD3jnR9QadW5wdEDlNJ7+yfhR8nQNsjI0BNGMUg== X-Received: by 2002:a17:903:41c2:b0:207:816:6b7c with SMTP id d9443c01a7336-20708166eb0mr5804565ad.13.1725649238374; Fri, 06 Sep 2024 12:00:38 -0700 (PDT) Received: from smtpclient.apple (ip70-190-253-181.ph.ph.cox.net. [70.190.253.181]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-206ae91306asm46280695ad.52.2024.09.06.12.00.37 for (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Sep 2024 12:00:37 -0700 (PDT) X-Google-Original-From: Dennis Snell Content-Type: multipart/alternative; boundary="Apple-Mail=_2BF1BB89-8674-4F50-A51C-9D3125D5A4F6" Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\)) Subject: [PHP-DEV] Re: [RFC] Decoding HTML and the Ambiguous Ampersand Date: Fri, 6 Sep 2024 12:00:34 -0700 References: To: Internals In-Reply-To: Message-ID: <2CC0B174-B27E-4278-9435-32A1A3885D01@automattic.com> X-Mailer: Apple Mail (2.3776.700.51) From: dennis.snell@automattic.com (Dennis Snell) --Apple-Mail=_2BF1BB89-8674-4F50-A51C-9D3125D5A4F6 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 All, I have updated the RFC document by adding a section on the proposed = HtmlContext enum, with some extra contexts than were originally = discussed (but which were added to the implementation). As I=E2=80=99ve been a bit distracted this has taken a bit of a backseat = but I am still interested in keeping it moving forward. https://wiki.php.net/rfc/decode_html=EF=BF=BC Warmly, Dennis Snell > On Jul 9, 2024, at 4:55=E2=80=AFPM, Dennis Snell = wrote: >=20 > Greetings all, >=20 > The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function has = a number of issues that I=E2=80=99d like to correct. >=20 > - It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named = character references. > - 106 of these are named character references which do not require a = trailing semicolon, such as `´` > - It=E2=80=99s unaware of the ambiguous ampersand rule, which allows = these 106 in special circumstances. >=20 > HTML5 asserts that the list of named character references will not = expand in the future. It can be found authoritatively at the following = URL: >=20 > https://html.spec.whatwg.org/entities.json >=20 > The ambiguous ampersand rule smoothes over legacy behavior from before = HTML5 where ampersands were not properly encoded in attribute values, = specifically in URL values. For example, in a query string for a search, = one might find `?q=3Ddog¬=3Dcat`. The `¬` in that value would = decode to U+AC (=C2=AC), but since it=E2=80=99s in an attribute value it = will be left as plaintext. Inside normal HTML markup it would transform = into `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when numeric = character references are found at the end of a string or boundary = without the semicolon. >=20 > The function signature of `html_entity_decode()` does not currently = allow for correcting this behavior. I=E2=80=99d like to propose an RFC = or a bug fix which either extends the function (perhaps by adding a new = flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new = function. For the missing character references I wonder if it would be = enough to add them to the list of default translatable references. >=20 > One challenge with the existing function is that the concept of the = translation table stands in contrast with the fixed and static nature of = HTML5=E2=80=99s replacement tables. A new function or set of functions = could open up spec-compliant decoding while providing helpful methods = that are necessary in many common server-side operations: >=20 > - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99, = $raw_text, $input_encoding =3D =E2=80=98utf-8' )` > - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80= =99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99 = )` > - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2= =80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99= )` >=20 > These methods are handy for inspecting things like encoded attribute = values in a memory-efficient and processing-efficient way, when it=E2=80=99= s not necessary to decode the entire value. In common situations, one = encounters data-URIs with potentially megabytes of image data and = processing only the first few or tens of bytes can save a lot of = overhead. >=20 > We=E2=80=99re exploring pure-PHP solutions to these problems in = WordPress in attempts to improve the reliability and safety of handling = HTML. I=E2=80=99d love to hear your thoughts and know if anyone is = willing to work with me to create an RFC or directly propose patches. = We=E2=80=99ve created a step function which allows finding the next = character reference and decoding it separately, enabling some novel = features like highlighting the character references in source text. >=20 > Should I propose an RFC for this? >=20 > Warmly, > Dennis Snell > Automattic Inc. --Apple-Mail=_2BF1BB89-8674-4F50-A51C-9D3125D5A4F6 Content-Type: multipart/related; type="text/html"; boundary="Apple-Mail=_5D4F6767-AF19-412E-85DD-521922476B67" --Apple-Mail=_5D4F6767-AF19-412E-85DD-521922476B67 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 All, I have = updated the RFC document by adding a section on the proposed HtmlContext = enum, with some extra contexts than were originally discussed (but which = were added to the implementation).

As I=E2=80=99ve = been a bit distracted this has taken a bit of a backseat but I am still = interested in keeping it moving forward.