Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121176 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 27119 invoked from network); 29 Sep 2023 16:38:32 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Sep 2023 16:38:32 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 78C2418004A for ; Fri, 29 Sep 2023 09:38:31 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com [209.85.221.53]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 29 Sep 2023 09:38:31 -0700 (PDT) Received: by mail-wr1-f53.google.com with SMTP id ffacd0b85a97d-307d58b3efbso12414990f8f.0 for ; Fri, 29 Sep 2023 09:38:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696005510; x=1696610310; darn=lists.php.net; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=KI2qOx9rxyvL7RRnPNwiXg0guY3Aof8RH6Jx6YtKBxQ=; b=fxhJj7v7Qm/BHVeDTi5xioVT2UchqQfHP86WMSTL8Nz16zyIWw5CFZ+htDUprmIo5C WTGwN0iNUe1f4gPNDLm0vdv3J4AIvHgPbkzfPu3+P6YJTut6ENsWI10R0SitJyzu+aq5 Ah1oaq+B1iDcLAuNoKaUaObe1PZbjIpMzQX1GIdZ3rl3OkhCLbwWrmS7TRzhp35H2oCh cRUs+pZqbQaiL0p4fg5Lm7wpLDdT+NiFIQsoFtCUl6yEQngBjDzLBrOuPyxeVpf8U3L6 Tf+rbjTdx7g/EX82o5o475P61sMXINtpvVbmJSQNrPRjK4VSPblMMzbYU2TQx4qFmUI9 kLGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696005510; x=1696610310; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KI2qOx9rxyvL7RRnPNwiXg0guY3Aof8RH6Jx6YtKBxQ=; b=BSTgJMY+fNEmAEKs3qzxh51g3JLNb8N5JuNaiWJfETcx90IYsVtNH+IV8JrKWhXgyW PHMl0tT0IvNV72g0lvEE4ZXgah4KmKSrxfhUAWKi5imvBKFD+EYG+4OsGlOQkBXIBeA6 t6M6IQagovrWwehyQi7NNNTn79+l/DyYcXBKWLB65Kz6Z2PnCssibXTF4CMo5RpbB2HU JCQKAHSF13QeyJF+xLlWJJfeEj1ctZ/B0wS7UKzxhY5NcSN7qUwvwpJ005W1G9k5dM8p 14CU6oKeVuOgg3PHsBHZ5193LAeGMnJx0ZlrEq58D3/xqfseETaduKKnULsZk4geYe55 hJZg== X-Gm-Message-State: AOJu0YznvmAQHYtZySpEAw8P6MrkZFvIUhmRbbKC60dnq6uWBvLklDkL 38nVzFi0x5mO49FhhAZ9iljk39q72X4= X-Google-Smtp-Source: AGHT+IHPwx03edSOTYMrOQdI/44k3CTpKjwKEepSVDYZT1sLdf5R8COu9wj/TS16am/wFz0mplXTdA== X-Received: by 2002:a5d:49c2:0:b0:31c:6420:ff4 with SMTP id t2-20020a5d49c2000000b0031c64200ff4mr4190750wrs.36.1696005509468; Fri, 29 Sep 2023 09:38:29 -0700 (PDT) Received: from ?IPV6:2a02:1811:cc83:ee50:280e:1e36:3a00:824? (ptr-dtfv08akcem5xburtic.18120a2.ip6.access.telenet.be. [2a02:1811:cc83:ee50:280e:1e36:3a00:824]) by smtp.gmail.com with ESMTPSA id u1-20020adfed41000000b003247d3e5d99sm2728314wro.55.2023.09.29.09.38.28 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 29 Sep 2023 09:38:28 -0700 (PDT) Message-ID: <1f3948c1-7b4c-430c-b686-031274dbd328@gmail.com> Date: Fri, 29 Sep 2023 18:38:27 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: internals@lists.php.net References: <48c7bb29-a52c-416e-b855-be2746dc7a84@gmail.com> <39900ce4-56b1-2397-ee9c-c9b7086b33cb@mabe.berlin> <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: dossche.niels@gmail.com (Niels Dossche) Hi Tim On 29/09/2023 18:06, Tim Düsterhus wrote: > Hi > > On 9/29/23 17:45, Niels Dossche wrote: >> Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities. >> If one option fails, it will fall through to the next one. >> 1. The Content-Type HTTP header from which you loaded the document. > > How would the new document classes make use of that? The HTTP header is transmitted out-of-band with regard to the actual payload. > > Is this referring to passing a `http://` path to HTMLDocument::createFromFile()? This would be unusable for everyone who manually downloads the document, e.g. using a PSR-18 HTTP Client. When the stream wrapper contains header information that information is used indeed. That would unfortunately indeed mean it's unusable when manually passed in. > > It might actually be necessary to add an encoding parameter to these functions, but it would need to take priority over anything implicit. The current $encoding of the global \DOMDocument has the problem that it doesn't take priority/is ignored entirely. Manually converting the document to UTF-8 before passing it to \DOMDocument has the problem that the meta tag in the document takes priority. > > In fact I've run into this issue before for the implementation of a rich embed feature. We're downloading the websites using Guzzle and attempt to make sense of them with \DOMDocument. However we can't reliably force the encoding given within the 'content-type' response header, so in some cases we obtain mojibake. > > This encoding parameter would likely need to be `?string $encoding = null` with everything non-null overwriting implicit detection and null meaning implicit detection in the order of priorities you mentioned. I agree. I'll add the optional arguments `?string $override_encoding = null` to XML/HTMLDocument::createFromString and XML/HTMLDocument::createFromFile. I'd call it override_encoding to emphasize it's about overriding the behaviour. > >> 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding. >> 3. Meta tag in the content. > > Best regards > Tim Düsterhus Kinds regards Niels