Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125191 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id A21751A00BD for ; Sat, 24 Aug 2024 20:31:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724531596; bh=jcGc5d6nIOV2qQPcydqTJL04Hj3yZ61l+ZyTB1Shuy8=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=iY70LxcUBSV9sUeieb2SJe6e9wsebcCmko68fN8bInXFpskPi03XBAK7E4DKjMBM1 ru60y26Vrn6+1RF4lYcKGTQPTwDwrJtC1cRWWntIHR5GNEnM9v7bIn95fUiGgn6imy +71fYXnUsGDCEYvcaq0YHSMg847+8D/tI2yFEeIuXrV1Lx2U5vLuNAaThUKuPGfUxc 0gjUbeXdiq2A4r4ps1UPUJ++vqWxmXtqfA8ICZszOfP6RqMk1zo4/TINVqGPaWTq2G oniJqWeV0FT4Zxz8aN//C9pzpeCNKHEwPk71/ovjJh4BiBbgeNiNitXNaIU3Mych0l eeVh0dGmewnKg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 5E77A180054 for ; Sat, 24 Aug 2024 20:33:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 24 Aug 2024 20:33:14 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id BFFF73408F0 for ; Sat, 24 Aug 2024 20:31:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:in-reply-to:date:date:subject:subject :mime-version:content-type:content-type:message-id:from:from :received:received:received:received:received:received; s= automattic1; t=1724531482; bh=jcGc5d6nIOV2qQPcydqTJL04Hj3yZ61l+Z yTB1Shuy8=; b=TPDg3S0O79dTFtm4WWFgILn9OlyAkpgZ1kL+DBaGrEWOWR6v4R mgt5P3G4Tr6ciJ6IIGV6+MWzIs6XZg1UP92Y/8hfShHvGPL6nC39/oW58IK72jcv i6VGWqGnTbKG9bbPJVLj9CzznQtlOdDV+Zl2e2NVh2xMCxC1G335A7qGtEFcLqnf Y1Qs1CvBFS7R7yYMohW+sLV/dcG4zyWgsgZELVp4EQ/Xfo6bqcNPfIP2Q6vYtNiD te3FYlfBSwMDzQ+mPli6+lPK7OcGjc2+Z0MUvrWjghD5+5GHvjOZtRv/dHYr6kON wwsDKzQ35g3ERJRiFGIbXcDlNvvFS/2aiO5g== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9gmR0isTqqR9 for ; Sat, 24 Aug 2024 20:31:22 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 3E002340308 for ; Sat, 24 Aug 2024 20:31:22 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="eW7YUPvD"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="RrsQZO00"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="SxKv1+wK"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 2D95AA078B for ; Sat, 24 Aug 2024 20:31:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724531482; bh=jcGc5d6nIOV2qQPcydqTJL04Hj3yZ61l+ZyTB1Shuy8=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=eW7YUPvDOHXx9EfSi7dpDVRJ7yHigjgloXo9P+M0SuY5QIBwVIwM6nGHS5zQ8c8T8 AkcceP2gOJdo/+6BFRcNFCbqPLO38QzJNYH/ae4UT9DsrQZk5gG5zhnsGfz7HVrKDS JfxBHEpc/Y8Wq3FBbjNLkhpoqCsYEy2myvOgRdICL2CqMwigFoY4Bjt7L6LsoxM4/1 5TmsVJCTXd8IgqCCxXnG2m3uK57HU1AX/WSRWk+cQXFsKXIz9JxMbPMu9cRqw6+CE3 PDiym3QQwK8IZXUK68mDEo9JGHo0d+H/x1viYi9uDgPJCqLtTeabMzdAgc5I5ulQCH X7qQaZHpw4k8Q== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1724531482; bh=jcGc5d6nIOV2qQPcydqTJL04Hj3yZ61l+ZyTB1Shuy8=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=RrsQZO00SMONHza24FbVY/lPCvDqF2gyHGOGf8Is4wLGKzmsszB3aX8hfOQ+TC9Jk FRgV/9GBFoGnkKVrHN1RMt3FFn4pe3w59Kt7rF/eg5fb8OshQRxWPMO8lv4oiLhEy6 rtYJ3GH6y2bUtgITcT3hZZS2gnjIgInpBQD+bMIIgFRGgQU+S8AiymYjR0zTXQalzg BDJwXuE37aprRveSTrkETgjx/6yJvRvsZSaMBYXhq4hGnjGZfMMSw4oKtLm6OX79j1 D8KyrnBsDoZ6LfN0yy9hZVgVZSVu7Zc8WSqWOAUlRRSVCBBr7HzscjU2tin2zt8RJy Rf63CV80YXCPg== Received: from mail-il1-f200.google.com (mail-il1-f200.google.com [209.85.166.200]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 166A2A02FB for ; Sat, 24 Aug 2024 20:31:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724531482; bh=jcGc5d6nIOV2qQPcydqTJL04Hj3yZ61l+ZyTB1Shuy8=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=SxKv1+wKucjtzUHjk/mMyGIP7LloXAuxMuiVL5KPuUz1y3dXCrlZOBkmlL4hvnvKd SJHm4OOJMXRaAjYiF4JFZfq9ozvCO+CCuZVLo0zq2DiWVWFE74XvwNvFuHpcLWX/0f zezol/MtK149uJ0cOsi7Q4HSiKszSEOYW29AxRctZII3yq4RUYCMyJmDnCNBS10fy+ wiZyim7LTCDkPrCE6SECBCSb4Tn2EcyIMtDLPZErL6DYc0/AjcPnuTIfcLykWAl/52 QoX4bGaiu8WQJfoh9JlxGeWsFTzcjt49BFSy7f4w5Z6UPeDk5J50II2uI9nr5ldYn4 m4kb3aXW9WQKg== Received: by mail-il1-f200.google.com with SMTP id e9e14a558f8ab-39d2df2e561so38295345ab.2 for ; Sat, 24 Aug 2024 13:31:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724531481; x=1725136281; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=7kgfNFQGpQcfilMtzHeY0xBwBFjP9wfEY2F/E1m9lK8=; b=QGhCcgMJsL0fpAfVbsPLdaY4J1PzWwuP8gXcnxg4TghDjK5p6+CwKBNXpL7Nhm3GnL tsjqEdJfR2xsz4y9d2tgSGaop2TUKL6nRc5DS7OtOdyr2NoJaWVCA7nMus/Iy5etET6i YuNvv3k3tXNmitXxh4hkxx6NhO0ot7vW3XmG3qk6DKmvHe1ZkUxIW+/6GuZFl582QyMC C/Xnb5g1ybNcjiK9Iiaod2NNE9+65t2SECWAyBdvk/wOSj2gD++IZQke/6brBaCP9koO Hd4bx+hPl+7TQNSXzBqT7FsGkLdGR7sDLj0vFsmCMf7Q7StxhimpbNpeMUuWoGm/wrxL y1nQ== X-Forwarded-Encrypted: i=1; AJvYcCUwERGMBhTx/OMC++X2EkUSigr0ioDUNh7NjZ1cVO5Lpl62r6QavrNbZnL/ZRVL0oZovAmRP+XXEQY=@lists.php.net X-Gm-Message-State: AOJu0YzuYpmVfR/ZnyQED3F1hWPAIv1FuXVWYFTjyu7uCAeQIYPCi3mI 8RzJuszKFq9obMBoXGC+nEmKd7h3Bnjo5gzriK0+rzmkBWASk2CzA+4qnMq/kjxm+gHepLhMxvo qc2tClGXgcebucZDBYZhLYfvXrZJ572f1Fu6X9mKmkRpNPlDAHw8jVUc= X-Received: by 2002:a05:6e02:1d9b:b0:39b:3502:f4f2 with SMTP id e9e14a558f8ab-39e3c9c0807mr61359375ab.18.1724531481451; Sat, 24 Aug 2024 13:31:21 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGnBXus1nPuhjOtShDOEZ8CqV7shwboLfVoO8tgfHtBOt7CEBxZc7Oe+xCJh4GNLPb8zG5GuQ== X-Received: by 2002:a05:6e02:1d9b:b0:39b:3502:f4f2 with SMTP id e9e14a558f8ab-39e3c9c0807mr61359155ab.18.1724531480863; Sat, 24 Aug 2024 13:31:20 -0700 (PDT) Received: from smtpclient.apple (ip70-171-161-83.om.om.cox.net. [70.171.161.83]) by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-39d73ed015asm21809275ab.56.2024.08.24.13.31.20 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 24 Aug 2024 13:31:20 -0700 (PDT) X-Google-Original-From: Dennis Snell Message-ID: <90B08F35-06D5-4D5C-BA7B-B7116EE18769@automattic.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_784EE0E2-94F8-47D6-9D26-DDED17B097D5" Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\)) Subject: Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand Date: Sat, 24 Aug 2024 15:31:17 -0500 In-Reply-To: Cc: Niels Dossche , Internals To: Jakob Givoni References: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> <7ED2EE07-D7C6-43A4-A4E1-E9928E8B8D31@automattic.com> X-Mailer: Apple Mail (2.3776.700.51) From: dennis.snell@automattic.com (Dennis Snell) --Apple-Mail=_784EE0E2-94F8-47D6-9D26-DDED17B097D5 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 On Aug 24, 2024, at 2:56=E2=80=AFPM, Jakob Givoni = wrote: >=20 > Hi Dennis, >=20 > Overall it sounds like a reasonable RFC. > =20 > > Dennis: > > > > > Niels: > > > > > > I'm not so sure that the name "decode_html" is self-descriptive = enough, it sounds very generic. > > > > The name is not very important to me. For the sake of history, the = reason I have chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an = HTML parser, this is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80= =9D content and decoding it into a =E2=80=9Cplain PHP string.=E2=80=9D >=20 > Why not make it two methods called "decode_html_text" and = "decode_html_attribute"? > Consider the following reasons: > 1. The function doesn't actually decode html as such, it decodes = either an html text node string or an html attribute string. Thanks Jakob. In WordPress I did just this. https://developer.wordpress.org/reference/classes/wp_html_decoder/ Part of the reason for that was the inability to require something like = an enum (due to PHP version support requirements). The Enum solution = feels very nice too. > 2. Saves the $context parameter and the constants/enums, making the = call significantly shorter.=20 In my PR I=E2=80=99ve actually expanded the Enum to include a few other = contexts. I feel like there=E2=80=99s a balance we have to do if we want = to ride the line between fully reliable and fully convenient. On one = hand, we could say =E2=80=9Cdon=E2=80=99t send the text content of a = SCRIPT element to this function!=E2=80=9D But on the other hand, that = kind of forces people to expect that SCRIPT content is different. With the Enum there is that in-built training material when someone = looks and finds `Attribute | BodyText | ForeignText | Script | Style` = (the contexts I=E2=80=99ve explored in my PR).=20 We could make the same argument for `decode_html_script()` and = `decode_foreign_text_node()` and `decode_html_style()`. Somehow the = context feels cleaner to me, and like a single entry point for learning = instead of five. > 3. It feels like decoding either text or attribute are two = significantly different things. I admit I could be wrong, if code like = decode_html($e->isAttritbute() ? HtmlContext::Attribute : = HtmlContext::Text, $e->getContent()) is likely to be seen. None of these contexts are significantly different, which is one of the = major dangers of using `html_entity_decode()`. The results will look = just about right most of the time. It=E2=80=99s the subtle differences = that matter most, I suppose. Thankfully, in most places I=E2=80=99ve = seen them blurred together, the intent of the code someone is writing = understands which is which. preg_replace_callback( =E2=80=98~]+href=3D=E2=80=9C([^=E2=80=9D]+)=E2=80=9D[^>]*>([^= <]+)~=E2=80=99, function ( $m ) { $title =3D str_replace( =E2=80=98]=E2=80=99, =E2=80=98\]=E2=80= =99, html_entity_decode( $m[2] ) ); $url =3D str_replace( =E2=80=98)=E2=80=99, =E2=80=98\)=E2=80=99= , html_entity_decode( $m[1] ) ); return =E2=80=9C[{$title}]({$url})=E2=80=9D; } $post_content ); The lesson I have drawn is that people frequently have what they = understand to be a text node or an attribute value, but they aren=E2=80=99= t aware that they are supposed to decode differently, and they also = aren=E2=80=99t reaching to interact with a full parser to get these = values. If PHP could train people as they use these functions, purely = through their interfaces, I think that could help elevate the level of = reliability out there in the wild, as long as they aren=E2=80=99t too = cumbersome (hence explicitly no default context argument _or_ using = separately-named functions). Having the Enum I think enhances the ease with which people can reliably = also decode things like SCRIPT and STYLE nodes. =E2=80=9CI know = `html_decode_text()` but I don=E2=80=99t know what the rules for SCRIPT = are or if they=E2=80=99re different so I=E2=80=99ll just stick with = that.=E2=80=9D vs =E2=80=9CMy IDE suggests that `Script` is a different = context, that=E2=80=99s interesting, I=E2=80=99ll try that and see how = it=E2=80=99s different." > But I somehow don't foresee a lot of situations where text and = attribute strings end up in the same code path? The underlying reason I started this work was in support of building an = HTML parser. We have a streaming parser which relies on a different = parsing model than those built purely on the state machine in the = specification, taking advantage of what we can to eek out performance in = PHP code. For this, the strings are in the same path, and in this work = I=E2=80=99ve come across a number of other common use-cases where the = flow is the same but the decoder needs to know the context. - Normalizing HTML from =E2=80=9Ctag soup=E2=80=9D to standard = serialized form. - Sanitizing code wanting to inspect values from different parts of the = markup. - Sanitizing rules engines providing configurations or DSLs for = sanitization. - Live optimizers or analyzers to improve the output HTML leaving a = server. It=E2=80=99s one of those things that when it becomes trivial to start = getting reliable transforms from the HTML syntax to the decoded text, = more opportunities appear that never seemed practical before. >=20 > A couple of other options that would silence anyone opposed to = implicitly favouring utf-8: > html_text_to_utf8 and html_attribute_to_utf8 The names started with these =F0=9F=98=80. I do agree that it gets a bit = excessive though to the point where it risks people not adopting them = purely because they don=E2=80=99t want to type that long of a name every = time they use it. Perhaps some of these =F0=9F=99=83 str_from_html( HtmlContext $context, string $html ): string {} utf8_from_html( HtmlContext $context, string $html ): string {} html_to_utf8( HtmlContext $context, string $html ): string {} >=20 > Best, > Jakob > =20 Thanks for your input. I=E2=80=99m grateful for the discussions and that = people are sharing. Dennis Snell --Apple-Mail=_784EE0E2-94F8-47D6-9D26-DDED17B097D5 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 On Aug 24, = 2024, at 2:56=E2=80=AFPM, Jakob Givoni <jakob@givoni.dk> = wrote:

Hi = Dennis,

Overall it sounds like a reasonable RFC.
  =
> Dennis:
>
> > Niels:
> = >
> > I'm not so sure that the name "decode_html" is = self-descriptive enough, it sounds very generic.
>
> The = name is not very important to me. For the sake of history, the reason I = have chosen =E2=80=9Cdecode HTML=E2=80=9D is because, unlike an HTML = parser, this is focused on taking a snippet of HTML =E2=80=9Ctext=E2=80=9D= content and decoding it into a =E2=80=9Cplain PHP string.=E2=80=9D
Why not make it two methods called "decode_html_text" and = "decode_html_attribute"?
Consider the following = reasons:
1. The function doesn't actually decode html as such, = it decodes either an html text node string or an html attribute = string.

Thanks Jakob. = In WordPress I did just this.
https://developer.wordpress.org/reference/classes/wp_html_decoder/

Part of the reason for that was the inability to = require something like an enum (due to PHP version support = requirements). The Enum solution feels very nice = too.

2. = Saves the $context parameter and the constants/enums, making the call = significantly = shorter. 

In my = PR I=E2=80=99ve actually expanded the Enum to include a few other = contexts. I feel like there=E2=80=99s a balance we have to do if we want = to ride the line between fully reliable and fully = convenient. On one hand, we could say =E2=80=9Cdon=E2=80=99t send = the text content of a SCRIPT element to this function!=E2=80=9D But on = the other hand, that kind of forces people to expect that SCRIPT content = is different.

With the Enum there is that = in-built training material when someone looks and finds `Attribute | = BodyText | ForeignText | Script | Style` (the contexts I=E2=80=99ve = explored in my PR). 

We could make the = same argument for `decode_html_script()` and = `decode_foreign_text_node()` and `decode_html_style()`. Somehow the = context feels cleaner to me, and like a single entry point for learning = instead of five.

3. It feels like decoding either text or attribute are = two significantly different things. I admit I could be wrong, if code = like decode_html($e->isAttritbute() ? HtmlContext::Attribute : = HtmlContext::Text, $e->getContent()) is likely to be = seen.

None of these = contexts are significantly different, which is one of the = major dangers of using `html_entity_decode()`. The results will look = just about right most of the time. It=E2=80=99s the subtle differences = that matter most, I suppose. Thankfully, in most places I=E2=80=99ve = seen them blurred together, the intent of the code someone is writing = understands which is which.

    = preg_replace_callback(
        = =E2=80=98~<a[^>]+href=3D=E2=80=9C([^=E2=80=9D]+)=E2=80=9D[^>]*>= ;([^<]+)</a>~=E2=80=99,
        = function ( $m ) {
            = $title =3D str_replace( =E2=80=98]=E2=80=99, =E2=80=98\]=E2=80=99, = html_entity_decode( $m[2] ) );
        =     $url =3D str_replace( =E2=80=98)=E2=80=99, =E2=80=98\)=E2=80= =99, html_entity_decode( $m[1] ) );
      =       return =E2=80=9C[{$title}]({$url})=E2=80=9D;
        }
        = $post_content
    );

The = lesson I have drawn is that people frequently have what they understand = to be a text node or an attribute value, but they aren=E2=80=99t aware = that they are supposed to decode differently, and they also aren=E2=80=99t= reaching to interact with a full parser to get these values. If PHP = could train people as they use these functions, purely through their = interfaces, I think that could help elevate the level of reliability out = there in the wild, as long as they aren=E2=80=99t = too cumbersome (hence explicitly no default context argument = _or_ using separately-named functions).

Having = the Enum I think enhances the ease with which people can reliably also = decode things like SCRIPT and STYLE nodes. =E2=80=9CI know = `html_decode_text()` but I don=E2=80=99t know what the rules for SCRIPT = are or if they=E2=80=99re different so I=E2=80=99ll just stick with = that.=E2=80=9D vs =E2=80=9CMy IDE suggests that `Script` is a different = context, that=E2=80=99s interesting, I=E2=80=99ll try that and see how = it=E2=80=99s different."

 But I somehow don't foresee a lot = of situations where text and attribute strings end up in the same code = path?

The underlying reason I = started this work was in support of building an HTML parser. We have a = streaming parser which relies on a different parsing model than those = built purely on the state machine in the specification, taking advantage = of what we can to eek out performance in PHP code. For this, the strings = are in the same path, and in this work I=E2=80=99ve come across a number = of other common use-cases where the flow is the same but the decoder = needs to know the context.

 - Normalizing = HTML from =E2=80=9Ctag soup=E2=80=9D to standard serialized = form.
 - Sanitizing code wanting to inspect values from = different parts of the markup.
 - Sanitizing rules = engines providing configurations or DSLs for = sanitization.
 - Live optimizers or analyzers to improve = the output HTML leaving a server.

It=E2=80=99s = one of those things that when it becomes trivial to start getting = reliable transforms from the HTML syntax to the decoded text, more = opportunities appear that never seemed practical = before.


A couple of other options that would = silence anyone opposed to implicitly favouring = utf-8:
html_text_to_utf8 and = html_attribute_to_utf8

T= he names started with these =F0=9F=98=80. I do agree that it gets a bit = excessive though to the point where it risks people = not adopting them purely because they don=E2=80=99t want to = type that long of a name every time they use it. Perhaps some of these = =F0=9F=99=83

    str_from_html( = HtmlContext $context, string $html ): string = {}

    utf8_from_html( HtmlContext = $context, string $html ): string {}

  =   html_to_utf8( HtmlContext $context, string $html ): string = {}


Best,
Jakob
 

Thanks for your input. I=E2=80=99m = grateful for the discussions and that people are = sharing.

Dennis Snell

= --Apple-Mail=_784EE0E2-94F8-47D6-9D26-DDED17B097D5--