Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125100 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id D8F161A00BD for ; Thu, 22 Aug 2024 22:00:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724364139; bh=Ir+KF3Ia0JnWAyKW4kR9OlLFVvhk+CKScfhq8Tzrxjw=; h=Date:Subject:To:References:From:In-Reply-To:From; b=h3lYvO5SOhHvB+TUaEum1dLpr3/L8th0Q56uFdwPC/I+Y7EJXOWGmivSWCVezF/aI j+vjlfj3rzlB6H2CA10Cvk0r2rRzdGYVPVSA06z/Kde/aKuqmF6yrkiM6Dq/EjKN6M G2JOWLK4peLNlJJ84ru/L39O8y7rwCxabMt4xojWpTuoj6MMngU67o9A2dZ5qOr5pO lxhksIs7HdpKbOd0Mka1cawzOaTafvTxBKvC2zReKFKAIP0cSOiD0E3Ov41JJ3fn0R 4nxKRxw3rIpcgAlWLf2z13NnjpUKcpGwEhSOwuD180FvdH35iL81BsakSTEXicDy+H QVnx8C1OHiloQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 79F6A180068 for ; Thu, 22 Aug 2024 22:02:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com [209.85.221.43]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 22 Aug 2024 22:02:18 +0000 (UTC) Received: by mail-wr1-f43.google.com with SMTP id ffacd0b85a97d-3718eaf4046so855897f8f.3 for ; Thu, 22 Aug 2024 15:00:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724364026; x=1724968826; darn=lists.php.net; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=HSfZtCw/HT91pkrD5gn0HZMDqMQHV+y5zbe1WTIFrVA=; b=fg0l7TysfwvJUJKXjt/j+7CXQ546YJGKZfggVSG7mtdvQ7KsnYJ5CZlb11ELvJGBNI bbieeAyR5KpuUHsmHq7gvLV8wDBTRX5MmL3TcXrVEAHeW3GZd4xXfneDnaic2fDsl6AY ZikEf98DSzEQnawEH9Z6MHFxgyymqcvMekAxSIaQKOXf6BK7kmXOEEHJmxphHD84FwSw S5xGmysZTii2WsFjzg1z6d/3xBQ0HHFuwT1GrlSVRTsCEOW3+khEAC7CmUnBUyq/M3cN GlVGKoL5zi2NVMIDUudinGEHN2THEskQUy3UEXHjyJg1c29GRRjJ8PJh8/8Mzo4yybBm tqSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724364026; x=1724968826; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HSfZtCw/HT91pkrD5gn0HZMDqMQHV+y5zbe1WTIFrVA=; b=qIUxzJdtHGmWrY6Ht4rZb51pPPkr+/MZfczkaWVu16zIxwCtCkI5BeL4CXEnOnwyEH i8eyPTANXAAkYSb1uxls8p+xqvn5f8q+OSgNe046FauAHydM0ica4axY9aHRV+/q9Kpe xLID/lHB0AO2IdjSqi79CchxDfrwN7Er2gOrsjTzpV/hPgjRb7VhWi6YiuALl5uaG6dN Gz6psUlC20u3wQ3GzakZg0Gh2RgwUoXLvzqP8uRJXrl8bdRRfgAj8Ed6tV12UV9dtGgv 1P1Zp+kNjoM48uAc0LYLzUi59o7XNagtd6Cnw98On83MhVjVdUX148FqNXWxVu7SeL7t eBuA== X-Gm-Message-State: AOJu0YwUsdCbwNUsLJUkHrJ4vZ7iDfKpA8lGrXy/ecOVUQKftP2B8k/d oickbbmD5bvWVHSIzn317oLzPH7iFzJTyokjFcMp4wDnqax3OFJjygdm0g== X-Google-Smtp-Source: AGHT+IFq652CzYeqkQ9cmXXwyr3+BKnN8JaIfk2ZDFZWAoF8I8IY0bPe6WfjIjbBsG9z55nsbzKSvw== X-Received: by 2002:a05:6000:104d:b0:368:7f8f:ca68 with SMTP id ffacd0b85a97d-37311863a47mr123435f8f.30.1724364025467; Thu, 22 Aug 2024 15:00:25 -0700 (PDT) Received: from ?IPV6:2a02:1811:cc83:ee50:280e:1e36:3a00:824? (ptr-dtfv08akcem5xburtic.18120a2.ip6.access.telenet.be. [2a02:1811:cc83:ee50:280e:1e36:3a00:824]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-42abeffc622sm75050495e9.41.2024.08.22.15.00.25 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 22 Aug 2024 15:00:25 -0700 (PDT) Message-ID: Date: Fri, 23 Aug 2024 00:01:47 +0200 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV] Re: [RFC] Decoding HTML and the Ambiguous Ampersand To: internals@lists.php.net References: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> Content-Language: en-US In-Reply-To: <76D9E1DA-57CE-45C3-8E3E-B08A0B70FB60@a8c.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: dossche.niels@gmail.com (Niels Dossche) On 20/08/2024 00:45, Dennis Snell wrote: > >> On Jul 9, 2024, at 4:55 PM, Dennis Snell wrote: >> >> Greetings all, >> >> The `html_entity_decode( … ENT_HTML5 … )` function has a number of issues that I’d like to correct. >> >>  - It’s missing 720 of HTML5’s specified named character references. >>  - 106 of these are named character references which do not require a trailing semicolon, such as `´` >>  - It’s unaware of the ambiguous ampersand rule, which allows these 106 in special circumstances. >> >> HTML5 asserts that the list of named character references will not expand in the future. It can be found authoritatively at the following URL: >> >> https://html.spec.whatwg.org/entities.json >> >> The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 where ampersands were not properly encoded in attribute values, specifically in URL values. For example, in a query string for a search, one might find `?q=dog¬=cat`. The `¬` in that value would decode to U+AC (¬), but since it’s in an attribute value it will be left as plaintext. Inside normal HTML markup it would transform into `?q=dog¬=cat`. There are related nuances when numeric character references are found at the end of a string or boundary without the semicolon. >> >> The function signature of `html_entity_decode()` does not currently allow for correcting this behavior. I’d like to propose an RFC or a bug fix which either extends the function (perhaps by adding a new flag like `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the missing character references I wonder if it would be enough to add them to the list of default translatable references. >> >> One challenge with the existing function is that the concept of the translation table stands in contrast with the fixed and static nature of HTML5’s replacement tables. A new function or set of functions could open up spec-compliant decoding while providing helpful methods that are necessary in many common server-side operations: >> >>   - `html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8' )` >>   - `html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, $input_encoding = ‘utf-8’ )` >>   - `html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, $input_encoding = ‘utf-8’ )` >> >> These methods are handy for inspecting things like encoded attribute values in a memory-efficient and processing-efficient way, when it’s not necessary to decode the entire value. In common situations, one encounters data-URIs with potentially megabytes of image data and processing only the first few or tens of bytes can save a lot of overhead. >> >> We’re exploring pure-PHP solutions to these problems in WordPress in attempts to improve the reliability and safety of handling HTML. I’d love to hear your thoughts and know if anyone is willing to work with me to create an RFC or directly propose patches. We’ve created a step function which allows finding the next character reference and decoding it separately, enabling some novel features like highlighting the character references in source text. >> >> Should I propose an RFC for this? >> >> Warmly, >> Dennis Snell >> Automattic Inc. > > Thanks everyone for your feedback so far on the `decode_html()` RFC [https://wiki.php.net/rfc/decode_html ] > > I’ve updated it replacing the new constants with a new `HtmlContext` enum, and the interface seems much nicer this way. I particularly like how PHP enforces passing a valid value, vs. hoping that the right flag is used. > > Additionally I added a section that I previously forgot, which highlights the source of the infamous mojibake/gremlins. HTML has special rules for remapping the C1 control characters, as if they had been stored or recorded for Windows-1251. > > Warmly, > Dennis Snell > Hi Dennis +1 on the concept. I just have two concerns: 1) I'm not so sure that the name "decode_html" is self-descriptive enough, it sounds very generic. 2) I would strongly suggest to explore an implementation based on Lexbor. I'm pretty confident that it can be done by reusing the internal APIs. The advantage is that it will be less code to maintain. You pull off some fancy tricks in your implementation for performance reasons, but that also adds to complexity and maintenance burden. Also since this is C, we must be extra careful when implementing tricks. If we could have a single implementation, that would be great. I do understand of course your concern that DOM is not a required extension, and therefore basing the internals on Lexbor makes it tied to the DOM extension which may not be available. I however suspect that a large chunk of people needing a function like this have DOM available (as DOM is required by many HTML-processing-related packages). I can also look into it sometime soon if you want; anyway feel free to ping me. And I do have the following thoughts: 1) We should amend the ENT_HTML5 related docs already that it's not compliant. 2) Perhaps ENT_HTML5 should be deprecated. E.g. you could say in your RFC that ENT_HTML5 will be deprecated in the release after the version that will have decode_html(). The reason I suggest the release _after_ and not the _same_ release is because I strongly believe that we should have at least one version where the proper alternative is available without forcing a deprecation to users already. Kind regards Niels