Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125102 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 4C70C1A00BD for ; Thu, 22 Aug 2024 23:02:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724367868; bh=jw1QzB9AhsnSVjR/QR2eDfxv9YsdrIdBS3n1u1JoH+I=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=nr1i1rTT0F4dNvJj27gv71Re74Mz1hKcu1KnAO9m6JNDNUEY1kKHutxkN/hpEz2hi WlnIuIV2JhTm1yD12A/VU8PHA7Ehk8mSzh/NRHLFHExZCSdnodhgSHGX7flMhrTYys 0UcmT22SN8Ciu21ZJhWRJBgIJ61TO4NNYwCXKc4dvJOGD/FlBBvvk7U8Gq8UyI7goe Fcx7QeUbuPQBMnvPyotFURN0NlAuggo30gaoHNjnWcIbEBMbOCQoIAAHcZGsM3ZXPI mS6I7X9kyCwQ7+47pCo8B+8JtQrFfXyA0RwGR+5995PgzhYmw+L4w40yh3yAT02Swx 3SUGnPXUPqaoQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id C394C18006F for ; Thu, 22 Aug 2024 23:04:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 22 Aug 2024 23:04:27 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 1B3B03407F1 for ; Thu, 22 Aug 2024 23:02:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:in-reply-to:date:date:subject:subject :mime-version:content-type:content-type:message-id:from:from :received:received:received:received:received:received; s= automattic1; t=1724367755; bh=jw1QzB9AhsnSVjR/QR2eDfxv9YsdrIdBS3 n1u1JoH+I=; b=NcncIkk+ksvc/g/uTl6Z1sSjHqn0phVdwdjcA2/Jl4NsiP7M7Z W+8x5LgBNfbj/Em7A/D9icxY8f2H1sbsXHBKjRL8ZkCc24VYIOOQ0YTA8hnV8eqK e7C8a3mS0QFrKnP4iYuiFoNN6o+2mDd0/iOPRJlFVFlT/JePWxtSmd1RewzyeZkw DafsY4EN68viZ5GK+nuUKT+zeUbAQIrG7efalZ/59s3kIQ6FTmMWtufRd5uMtW/g /qUchxnCFGA+fk/WHrUpFZsHjYyEleBtTQFZT9NxR5HrXpF/Hdahj8AyNqItBwfF gm9OMnowB4FpfeUL2eCIeoqYOinU8IWkDBow== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VK-U2OTN9UHg for ; Thu, 22 Aug 2024 23:02:35 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 5BDD7340785 for ; Thu, 22 Aug 2024 23:02:35 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="Av82jBLz"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="S+KY+Ew8"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="DZ57n/tG"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id F2CEDA0764 for ; Thu, 22 Aug 2024 23:02:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724367755; bh=jw1QzB9AhsnSVjR/QR2eDfxv9YsdrIdBS3n1u1JoH+I=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=Av82jBLzCdU0y36ktQ5VoGjjnKjFASOmXDGVDsVu0+57/JxnFp1zrLzGcreoGcdmP CWtijX2GvZ4DpgjlTwKWO4TAIJdpR7CF1z0RBqWuG+qQVxyUea3lkqxE3Nicbu174L qntfqueVroo74JEUMirha1LQv5WFJxDDsbpXC5VizsfkwkOF10zaHrMNonePNNahUj r7gAf9JzYQ3+h4+a/jWk4a9AU92Z5SNLFdXav11Cqi918vo5VPsj3myicb3K1IO55r Fqwxlc28g+Rh8YVGMZtgMTPL4+LNW8whzaLrexRHwmnVS9TL1d70jQI8I0rv0JKWt4 mRf1IwwTMmokA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1724367755; bh=jw1QzB9AhsnSVjR/QR2eDfxv9YsdrIdBS3n1u1JoH+I=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=S+KY+Ew8xA0twDZzbe8bPHiRJ7aeEnqLNNKTjcGcU7DjEMK4TPAg/GV38AwAveDMh j+aZx7BZQO9aHWzGRWDATCDkKKhdKT6nHM1WN8PQNq51XM2JUmpUAXixLA9KMbroXi Jw5nXr+Q8P0aqu9D1bHMM+6I010FYP9BKsYoiWo6Jcvwbw972tiMKIo16lL9k2Zm64 IJLH7LtnUPDM0OyubC5FoNh8DlUkFyFm2KpkGOTctXfqqVRz99BMIi90rOGP1Xc7Co hKd1UUuv+QuqiINCm/WqPKeKkU2DsErEg9v9TjwzjO4CWsU+XAaLX5O+mx6jri3FXW ZcdD7RruWDH6w== Received: from mail-io1-f70.google.com (mail-io1-f70.google.com [209.85.166.70]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id C11A9A0247 for ; Thu, 22 Aug 2024 23:02:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724367754; bh=jw1QzB9AhsnSVjR/QR2eDfxv9YsdrIdBS3n1u1JoH+I=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=DZ57n/tGLGGEC2wHJwmL0Ii9Q7VwByfv1EqU0+Jnm4LWwgkUU9KeBsB2QbrHz+au4 RpWwhsqgk9XtjS/MPiF6orgya0fHkGGj5HmbEH7MjAqeAYhLvZXn/Qcjq6GK21YCSj /zE/1Nwooi90JkAhAyBMlNJZX3354/7dpv4SfZuHWyw+j7rr+d++rPG2aK8pcgQY2U j8N1BNoUKKcYNKE8Dog9qt3twkw22TY1E5U5HZn5d8iAv3OoJOoHX00dJbDmf2u006 ORmdAQhaT73U3yfGWpvIWZ2id17OyR/h0As2uvdgMj74tgm7ZBMQvIcaGwycSsJk2o BTm7hVGF6Vz6w== Received: by mail-io1-f70.google.com with SMTP id ca18e2360f4ac-81f9612b44fso158037939f.2 for ; Thu, 22 Aug 2024 16:02:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724367754; x=1724972554; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=r1eeOEOzlAzvPMCnVJlgKPPpFUmPNTGSddv2S76MUec=; b=e3sxA++5d6appqQYSJlPCoFT15WeYUXC/iR5rkORijOkQqJ2DkKiSLctFbRJHlgD/1 MBceFPwzlNne7oldGH0kjrlt5ShiW0F0yjoNcxk98A2TmvmkCUty5k6aIkWneRFF91fS 4LqfbdfYNs4Reiton8dx1Hk8OWuAVJSu0cBlLxa4FveDx5TIW5kAuvYXy0DNQk47QP7u bTRqH0RhEbJ2XqHqMUCGyKfArpieFxlcfjTCajoI6O7t2iFRozJGzDzYYcU+X9Qty8Le x/rr/Si0HvP4yZ7E5l1iMYo7ZgOTS9Zo3NW17KjwU6vAMo1Gmaix91MBbzjtVeQZ2/4O VD3g== X-Gm-Message-State: AOJu0YxlTOnssD1qeOGOY5omllR2puMjRmlkEug4Jge6Eem/Tueb+Jq+ z/tZ1NcKBqG7PNiO659rU0x25KiQkO6NkJ2Z2Y1qoILugaOPDnm7SWxOWtoASZ4I2t5GMArAiLQ 0Wlw3vD17Gm/Sm/a/fyOdm262ZGwQchESdbEL7UU5IQPfxMuhJb2mKAtFQx4V62o= X-Received: by 2002:a05:6602:1506:b0:81f:8c39:2a5c with SMTP id ca18e2360f4ac-82787380566mr94051739f.12.1724367753801; Thu, 22 Aug 2024 16:02:33 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHOVLYyiCQNuq0tDFIdsYpoUc3PnmcK62IiNMJbFpPN0sCH8GYFWePj46eFDs8xvtdHdQsiPw== X-Received: by 2002:a05:6602:1506:b0:81f:8c39:2a5c with SMTP id ca18e2360f4ac-82787380566mr94044339f.12.1724367752810; Thu, 22 Aug 2024 16:02:32 -0700 (PDT) Received: from smtpclient.apple ([75.104.82.217]) by smtp.gmail.com with ESMTPSA id ca18e2360f4ac-8253d5ccf3fsm80841639f.33.2024.08.22.16.02.28 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Aug 2024 16:02:32 -0700 (PDT) X-Google-Original-From: Dennis Snell Message-ID: <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_7F078510-A5C0-4280-B32C-4EFB47B329A3" Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\)) Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand Date: Thu, 22 Aug 2024 18:02:13 -0500 In-Reply-To: Cc: Internals To: Niels Dossche References: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> X-Mailer: Apple Mail (2.3776.700.51) From: dennis.snell@automattic.com (Dennis Snell) --Apple-Mail=_7F078510-A5C0-4280-B32C-4EFB47B329A3 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Aug 22, 2024, at 5:01=E2=80=AFPM, Niels Dossche = wrote: >=20 > On 20/08/2024 00:45, Dennis Snell wrote: >>=20 >>> On Jul 9, 2024, at 4:55=E2=80=AFPM, Dennis Snell = wrote: >>>=20 >>> Greetings all, >>>=20 >>> The `html_entity_decode( =E2=80=A6 ENT_HTML5 =E2=80=A6 )` function = has a number of issues that I=E2=80=99d like to correct. >>>=20 >>> - It=E2=80=99s missing 720 of HTML5=E2=80=99s specified named = character references. >>> - 106 of these are named character references which do not require = a trailing semicolon, such as `´` >>> - It=E2=80=99s unaware of the ambiguous ampersand rule, which = allows these 106 in special circumstances. >>>=20 >>> HTML5 asserts that the list of named character references will not = expand in the future. It can be found authoritatively at the following = URL: >>>=20 >>> https://html.spec.whatwg.org/entities.json = >>>=20 >>> The ambiguous ampersand rule smoothes over legacy behavior from = before HTML5 where ampersands were not properly encoded in attribute = values, specifically in URL values. For example, in a query string for a = search, one might find `?q=3Ddog¬=3Dcat`. The `¬` in that value = would decode to U+AC (=C2=AC), but since it=E2=80=99s in an attribute = value it will be left as plaintext. Inside normal HTML markup it would = transform into `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when = numeric character references are found at the end of a string or = boundary without the semicolon. >>>=20 >>> The function signature of `html_entity_decode()` does not currently = allow for correcting this behavior. I=E2=80=99d like to propose an RFC = or a bug fix which either extends the function (perhaps by adding a new = flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new = function. For the missing character references I wonder if it would be = enough to add them to the list of default translatable references. >>>=20 >>> One challenge with the existing function is that the concept of the = translation table stands in contrast with the fixed and static nature of = HTML5=E2=80=99s replacement tables. A new function or set of functions = could open up spec-compliant decoding while providing helpful methods = that are necessary in many common server-side operations: >>>=20 >>> - `html_decode( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99= , $raw_text, $input_encoding =3D =E2=80=98utf-8' )` >>> - `html_text_contains( =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2= =80=99, $raw_haystack, $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99= )` >>> - `html_text_starts_with( =E2=80=98attribute=E2=80=99 | = =E2=80=98data=E2=80=99, $raw_haystack, $needle, $input_encoding =3D = =E2=80=98utf-8=E2=80=99 )` >>>=20 >>> These methods are handy for inspecting things like encoded attribute = values in a memory-efficient and processing-efficient way, when it=E2=80=99= s not necessary to decode the entire value. In common situations, one = encounters data-URIs with potentially megabytes of image data and = processing only the first few or tens of bytes can save a lot of = overhead. >>>=20 >>> We=E2=80=99re exploring pure-PHP solutions to these problems in = WordPress in attempts to improve the reliability and safety of handling = HTML. I=E2=80=99d love to hear your thoughts and know if anyone is = willing to work with me to create an RFC or directly propose patches. = We=E2=80=99ve created a step function which allows finding the next = character reference and decoding it separately, enabling some novel = features like highlighting the character references in source text. >>>=20 >>> Should I propose an RFC for this? >>>=20 >>> Warmly, >>> Dennis Snell >>> Automattic Inc. >>=20 >> Thanks everyone for your feedback so far on the `decode_html()` RFC = [https://wiki.php.net/rfc/decode_html = ] >>=20 >> I=E2=80=99ve updated it replacing the new constants with a new = `HtmlContext` enum, and the interface seems much nicer this way. I = particularly like how PHP enforces passing a valid value, vs. hoping = that the right flag is used. >>=20 >> Additionally I added a section that I previously forgot, which = highlights the source of the infamous mojibake/gremlins. HTML has = special rules for remapping the C1 control characters, as if they had = been stored or recorded for Windows-1251. >>=20 >> Warmly, >> Dennis Snell >>=20 >=20 > Hi Dennis >=20 > +1 on the concept. > I just have two concerns: Thanks Niels. I appreciate the help you=E2=80=99ve already provided on = this process, and the work you=E2=80=99ve done with lexbor. >=20 > 1) I'm not so sure that the name "decode_html" is self-descriptive = enough, it sounds very generic. The name is not very important to me. For the sake of history, the = reason I have chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an = HTML parser, this is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80= =9D content and decoding it into a =E2=80=9Cplain PHP string.=E2=80=9D The existing `html_entity_decode()` is very close in naming but ties = this concept into _entities_, and overlooks other basic text decoding = concerns (newline normalization and NULL byte handling). Originally I had =E2=80=9Cutf8=E2=80=9D in the name but someone else = thought it was too long and specific. I want the name to educate = developers and also be terse. Naming is hard. > 2) I would strongly suggest to explore an implementation based on = Lexbor. I'm pretty confident that it can be done by reusing the internal = APIs. The advantage is that it will be less code to maintain. You pull = off some fancy tricks in your implementation for performance reasons, = but that also adds to complexity and maintenance burden. Also since this = is C, we must be extra careful when implementing tricks. Yeah I agree and I=E2=80=99ll share more below. The tricks I=E2=80=99m = using in my PR implementing the RFC are partly there to propose adoption = into PHP and partly there to get a real sense of my algorithm vs. those = found in Chrome, Firefox, Safari, and lexbor. I=E2=80=99ve attempted to = build a search algorithm for named character references that optimizes = for cache locality in contrast to algorithmic complexity where RAM = access is assumed to be free. My code isn=E2=80=99t currently well document and doesn=E2=80=99t meet = the PHP-src coding standards, but the algorithm is pretty basic and easy = to explain. It=E2=80=99s also =E2=80=9Cunoptimized=E2=80=9D for C, = mostly. I think there are still large gains to be made that so far = I=E2=80=99ve been unable to visualize incorporating into the lexbor = parser. For example, `decode_html()` assumes we=E2=80=99re starting = already with a span of text that is HTML text. We=E2=80=99re not making = conditional decisions on whether the next byte produces a token that = escapes out of the text parsing mode. > If we could have a single implementation, that would be great. I do = understand of course your concern that DOM is not a required extension, = and therefore basing the internals on Lexbor makes it tied to the DOM = extension which may not be available. I however suspect that a large = chunk of people needing a function like this have DOM available (as DOM = is required by many HTML-processing-related packages). I can also look = into it sometime soon if you want; anyway feel free to ping me. I=E2=80=99m also very open to lexbor-based approaches but I=E2=80=99ve = so-far found it more complicated than I expected. In some part this is = because it involves setting up the parser and state machine for the HTML = specification and much of the actual decoding can be safely done without = this. The other part is the extension aspect. I hear you, that you would = expect calling code to have the DOM extensions available, but that=E2=80=99= s simply not the case when developing a platform like WordPress, which I = do. We don=E2=80=99t have control over the servers or environments where = people are deploying this, and the availability of the DOM extensions is = low enough that WordPress code simply cannot use `DOMDocument` (even = though it shouldn=E2=80=99t because of the wild problems that has for = attempting to parse HTML). People resort to `html_entity_decode()` because that=E2=80=99s the only = option. In WordPress we now have a spec-compliant decoder, but as it=E2=80= =99s in user-space PHP its performance is far below what=E2=80=99s = possible. I=E2=80=99d love your help in setting up lexbor=E2=80=99s state machine = to decode text nodes. I=E2=80=99d love it even more if this could be = part of the PHP language. It constantly surprises me that _the language = of the web_ (PHP) doesn=E2=80=99t have the tools to speak _the language = of the web_ (HTML). This RFC is all about taking a step towards ensuring = that PHP developers can rely on PHP to be a reliable middle-man between = the HTML domain and the PHP domain. In other words, requiring the DOM extension or `DOM\HtmlDocument` would = be such a non-starter for WordPress (accounting for 43% of the web = today) that it would completely unavailable. >=20 > And I do have the following thoughts: > 1) We should amend the ENT_HTML5 related docs already that it's not = compliant. > 2) Perhaps ENT_HTML5 should be deprecated. E.g. you could say in your = RFC that ENT_HTML5 will be deprecated in the release after the version = that will have decode_html(). The reason I suggest the release _after_ = and not the _same_ release is because I strongly believe that we should = have at least one version where the proper alternative is available = without forcing a deprecation to users already. I love this suggestion. Just for reference, since I=E2=80=99ve looked = before and not found it. Can someone indicate where to find the PHP = function documentation? There are a number of updates I would love to = propose but I don=E2=80=99t know where to find the content that appears = in https://www.php.net/manual/en/function.html-entity-decode.php, for = instance. >=20 > Kind regards > Niels Mad respect to the work you=E2=80=99ve brought to lexbor and to PHP. = I=E2=80=99m excited to start relying on `\DOM\HtmlDocument` and have = started using it in my benchmarks and HTML analysis as we develop the = WordPress HTML API (a streaming, low memory-overhead, reentrant HTML = parsing and manipulation framework in user-space PHP). Dennis Snell --Apple-Mail=_7F078510-A5C0-4280-B32C-4EFB47B329A3 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
On Aug 22, = 2024, at 5:01=E2=80=AFPM, Niels Dossche <dossche.niels@gmail.com> = wrote:

On 20/08/2024 00:45, = Dennis Snell wrote:

On Jul 9, 2024, at = 4:55=E2=80=AFPM, Dennis Snell <dennis.snell@a8c.com> = wrote:

Greetings all,

The `html_entity_decode( =E2=80=A6 = ENT_HTML5 =E2=80=A6 )` function has a number of issues that I=E2=80=99d = like to correct.

 - It=E2=80=99s missing 720 of HTML5=E2=80=99= s specified named character references.
 - 106 of these are = named character references which do not require a trailing semicolon, = such as `&acute`
 - It=E2=80=99s unaware of the ambiguous = ampersand rule, which allows these 106 in special = circumstances.

HTML5 asserts that the list of named character = references will not expand in the future. It can be found = authoritatively at the following URL:

https://html.spec.what= wg.org/entities.json <https://html.spec.what= wg.org/entities.json>

The ambiguous ampersand rule = smoothes over legacy behavior from before HTML5 where ampersands were = not properly encoded in attribute values, specifically in URL values. = For example, in a query string for a search, one might find = `?q=3Ddog&not=3Dcat`. The `&not` in that value would decode to = U+AC (=C2=AC), but since it=E2=80=99s in an attribute value it will be = left as plaintext. Inside normal HTML markup it would transform into = `?q=3Ddog=C2=AC=3Dcat`. There are related nuances when numeric character = references are found at the end of a string or boundary without the = semicolon.

The function signature of `html_entity_decode()` does = not currently allow for correcting this behavior. I=E2=80=99d like to = propose an RFC or a bug fix which either extends the function (perhaps = by adding a new flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably = creates a new function. For the missing character references I wonder if = it would be enough to add them to the list of default translatable = references.

One challenge with the existing function is that the = concept of the translation table stands in contrast with the fixed and = static nature of HTML5=E2=80=99s replacement tables. A new function or = set of functions could open up spec-compliant decoding while providing = helpful methods that are necessary in many common server-side = operations:

  - `html_decode( =E2=80=98attribute=E2=80=99 | = =E2=80=98data=E2=80=99, $raw_text, $input_encoding =3D =E2=80=98utf-8' = )`
  - `html_text_contains( =E2=80=98attribute=E2=80=99 | = =E2=80=98data=E2=80=99, $raw_haystack, $needle, $input_encoding =3D = =E2=80=98utf-8=E2=80=99 )`
  - `html_text_starts_with( = =E2=80=98attribute=E2=80=99 | =E2=80=98data=E2=80=99, $raw_haystack, = $needle, $input_encoding =3D =E2=80=98utf-8=E2=80=99 )`

These = methods are handy for inspecting things like encoded attribute values in = a memory-efficient and processing-efficient way, when it=E2=80=99s not = necessary to decode the entire value. In common situations, one = encounters data-URIs with potentially megabytes of image data and = processing only the first few or tens of bytes can save a lot of = overhead.

We=E2=80=99re exploring pure-PHP solutions to these = problems in WordPress in attempts to improve the reliability and safety = of handling HTML. I=E2=80=99d love to hear your thoughts and know if = anyone is willing to work with me to create an RFC or directly propose = patches. We=E2=80=99ve created a step function which allows finding the = next character reference and decoding it separately, enabling some novel = features like highlighting the character references in source = text.

Should I propose an RFC for this?

Warmly,
Dennis = Snell
Automattic Inc.

Thanks everyone for your = feedback so far on the `decode_html()` RFC [https://wiki.php.net/rfc/dec= ode_html <https://wiki.php.net/rfc/dec= ode_html>]

I=E2=80=99ve updated it replacing the new = constants with a new `HtmlContext` enum, and the interface seems much = nicer this way. I particularly like how PHP enforces passing a valid = value, vs. hoping that the right flag is used.

Additionally I = added a section that I previously forgot, which highlights the source of = the infamous mojibake/gremlins. HTML has special rules for remapping the = C1 control characters, as if they had been stored or recorded for = Windows-1251.

Warmly,
Dennis Snell


Hi Dennis

+1 on = the concept.
I just have two = concerns:

Thanks Niels. I appreciate = the help you=E2=80=99ve already provided on this process, and the work = you=E2=80=99ve done with lexbor.


1) I'm not so sure that = the name "decode_html" is self-descriptive enough, it sounds very = generic.

The name is not very = important to me. For the sake of history, the reason I have chosen = =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an HTML parser, this is = focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=9D content and = decoding it into a =E2=80=9Cplain PHP = string.=E2=80=9D

The existing = `html_entity_decode()` is very close in naming but ties this concept = into _entities_, and overlooks other basic text decoding concerns = (newline normalization and NULL byte = handling).

Originally I had =E2=80=9Cutf8=E2=80=9D= in the name but someone else thought it was too long and specific. I = want the name to educate developers and also be terse. Naming is = hard.

2) I = would strongly suggest to explore an implementation based on Lexbor. I'm = pretty confident that it can be done by reusing the internal APIs. The = advantage is that it will be less code to maintain. You pull off some = fancy tricks in your implementation for performance reasons, but that = also adds to complexity and maintenance burden. Also since this is C, we = must be extra careful when implementing tricks. =

Yeah I agree and I=E2=80=99l= l share more below. The tricks I=E2=80=99m using in my PR implementing = the RFC are partly there to propose adoption into PHP and partly there = to get a real sense of my algorithm vs. those found in Chrome, Firefox, = Safari, and lexbor. I=E2=80=99ve attempted to build a search algorithm = for named character references that optimizes for cache locality in = contrast to algorithmic complexity where RAM access is assumed to be = free.

My code isn=E2=80=99t currently well = document and doesn=E2=80=99t meet the PHP-src coding standards, but the = algorithm is pretty basic and easy to explain. It=E2=80=99s also = =E2=80=9Cunoptimized=E2=80=9D for C, mostly. I think there are still = large gains to be made that so far I=E2=80=99ve been unable to visualize = incorporating into the lexbor parser. For example, `decode_html()` = assumes we=E2=80=99re starting already with a span of text that is HTML = text. We=E2=80=99re not making conditional decisions on whether the next = byte produces a token that escapes out of the text parsing = mode.

If we = could have a single implementation, that would be great. I do understand = of course your concern that DOM is not a required extension, and = therefore basing the internals on Lexbor makes it tied to the DOM = extension which may not be available. I however suspect that a large = chunk of people needing a function like this have DOM available (as DOM = is required by many HTML-processing-related packages). I can also look = into it sometime soon if you want; anyway feel free to ping = me.

I=E2=80=99m also very open = to lexbor-based approaches but I=E2=80=99ve so-far found it more = complicated than I expected. In some part this is because it involves = setting up the parser and state machine for the HTML specification and = much of the actual decoding can be safely done without = this.

The other part is the extension aspect. I = hear you, that you would expect calling code to have the DOM extensions = available, but that=E2=80=99s simply not the case when developing a = platform like WordPress, which I do. We don=E2=80=99t have control over = the servers or environments where people are deploying this, and the = availability of the DOM extensions is low enough that WordPress code = simply cannot use `DOMDocument` (even though it shouldn=E2=80=99t = because of the wild problems that has for attempting to parse = HTML).

People resort to `html_entity_decode()` = because that=E2=80=99s the only option. In WordPress we now have a = spec-compliant decoder, but as it=E2=80=99s in user-space PHP its = performance is far below what=E2=80=99s = possible.

I=E2=80=99d love your help in setting = up lexbor=E2=80=99s state machine to decode text nodes. I=E2=80=99d love = it even more if this could be part of the PHP language. It constantly = surprises me that _the language of the web_ (PHP) doesn=E2=80=99t have = the tools to speak _the language of the web_ (HTML). This RFC is all = about taking a step towards ensuring that PHP developers can rely on PHP = to be a reliable middle-man between the HTML domain and the PHP = domain.

In other words, requiring the DOM = extension or `DOM\HtmlDocument` would be such a non-starter for = WordPress (accounting for 43% of the web today) that it would completely = unavailable.


And I do have the following = thoughts:
1) We should amend the = ENT_HTML5 related docs already that it's not compliant.
2) Perhaps ENT_HTML5 should be deprecated. = E.g. you could say in your RFC that ENT_HTML5 will be deprecated in the = release after the version that will have decode_html(). The reason I = suggest the release _after_ and not the _same_ release is because I = strongly believe that we should have at least one version where the = proper alternative is available without forcing a deprecation to users = already.

I love this suggestion. = Just for reference, since I=E2=80=99ve looked before and not found it. = Can someone indicate where to find the PHP function documentation? There = are a number of updates I would love to propose but I don=E2=80=99t know = where to find the content that appears in htt= ps://www.php.net/manual/en/function.html-entity-decode.php, for = instance.


Kind regards
Niels

Mad respect = to the work you=E2=80=99ve brought to lexbor and to PHP. I=E2=80=99m = excited to start relying on `\DOM\HtmlDocument` and have started using = it in my benchmarks and HTML analysis as we develop the WordPress HTML = API (a streaming, low memory-overhead, reentrant HTML parsing and = manipulation framework in user-space = PHP).

Dennis Snell

= --Apple-Mail=_7F078510-A5C0-4280-B32C-4EFB47B329A3--