Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121175 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 24666 invoked from network); 29 Sep 2023 16:06:58 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Sep 2023 16:06:58 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 2F6731804C1 for ; Fri, 29 Sep 2023 09:06:55 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS24940 176.9.0.0/16 X-Spam-Virus: No X-Envelope-From: Received: from chrono.xqk7.com (chrono.xqk7.com [176.9.45.72]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 29 Sep 2023 09:06:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bastelstu.be; s=mail20171119; t=1696003613; bh=r5uCjsH8hRUX9a0vSKJ8b/i5apasp6QEgyOeic1c6H4=; h=Message-ID:Date:MIME-Version:Subject:To:References:From: In-Reply-To:Content-Type:from:to:cc:subject:message-id; b=b4V4UnDXYoNWvz5HUc/bFua4YcFUElh2MIluGsSLEe0S62/yGKxGP+NtuiA2Bh0eF 35z6qL6QJLlSCEEXiIyktqi6pPyQBc+M7dJ6o16QIFbT7awLp5+m66hf+kraQLg8oz 1nXJ3Sl8RZP/B+GVmNSfbCPRnG3kGBK/Nkv1Go9KhhndKyUpv+UpUemZRSOUcaM/jW qFfH4MtKbM5oeROwTJYFGva422icgiNzuScToUouliQ7oP+F/SOpB+Psoo6Hg9p/zw p6kEoWgSmEyhxaYZaVrR9KLGwaEbYNc5Z7K5wJzjvpKMyCMoYLo0fmAbDEfukx6jnM R1YmdIPVoCgUQ== Message-ID: Date: Fri, 29 Sep 2023 18:06:52 +0200 MIME-Version: 1.0 Content-Language: en-US To: Niels Dossche , internals@lists.php.net References: <48c7bb29-a52c-416e-b855-be2746dc7a84@gmail.com> <39900ce4-56b1-2397-ee9c-c9b7086b33cb@mabe.berlin> <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com> In-Reply-To: <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: tim@bastelstu.be (=?UTF-8?Q?Tim_D=c3=bcsterhus?=) Hi On 9/29/23 17:45, Niels Dossche wrote: > Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities. > If one option fails, it will fall through to the next one. > 1. The Content-Type HTTP header from which you loaded the document. How would the new document classes make use of that? The HTTP header is transmitted out-of-band with regard to the actual payload. Is this referring to passing a `http://` path to HTMLDocument::createFromFile()? This would be unusable for everyone who manually downloads the document, e.g. using a PSR-18 HTTP Client. It might actually be necessary to add an encoding parameter to these functions, but it would need to take priority over anything implicit. The current $encoding of the global \DOMDocument has the problem that it doesn't take priority/is ignored entirely. Manually converting the document to UTF-8 before passing it to \DOMDocument has the problem that the meta tag in the document takes priority. In fact I've run into this issue before for the implementation of a rich embed feature. We're downloading the websites using Guzzle and attempt to make sense of them with \DOMDocument. However we can't reliably force the encoding given within the 'content-type' response header, so in some cases we obtain mojibake. This encoding parameter would likely need to be `?string $encoding = null` with everything non-null overwriting implicit detection and null meaning implicit detection in the order of priorities you mentioned. > 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding. > 3. Meta tag in the content. Best regards Tim Düsterhus