Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121186 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 50110 invoked from network); 29 Sep 2023 21:18:14 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Sep 2023 21:18:14 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id C76C9180082 for ; Fri, 29 Sep 2023 14:18:13 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 29 Sep 2023 14:18:13 -0700 (PDT) Received: by mail-wr1-f44.google.com with SMTP id ffacd0b85a97d-32483535e51so2407662f8f.0 for ; Fri, 29 Sep 2023 14:18:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696022291; x=1696627091; darn=lists.php.net; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=mNv6ekTWUW/JKCmgDVglhaNixqtyZtf9qCvO/bj8LtM=; b=CRXX9JZJN92pdGBJsSknh/1bTCOzQsdjf5qxRm0F/0ecIBKbj69hk8/2nuSurYimY8 r/YtgIJUs0+nlwujsVbNLr7KF1Hake2QaWYDcZJgZvCM8LtEUr/XumPu5WsROGCToHpW 1JK1ZcYNJyboOn+t329+Tds+mSjfo3aIKuyoFvL9OLajfs9JRA69MumYQsReJDWNJKs0 DqD6s2n5QQCHCjKu9H3jagQ0rayELT9r/ytzGr1/baHcklX70LLqA0NwezO0FR28W4Ru SKRjX4/MJnQ9cBtk9Rh8pJleE9vMDWNjbvo6UjS+011BU9DFUBOPT9xeQnZmhp+kGKho 0X8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696022291; x=1696627091; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mNv6ekTWUW/JKCmgDVglhaNixqtyZtf9qCvO/bj8LtM=; b=fA2Vh2drKt7lXyu5XZwzhdXDfN7/GQCfYEg4GzTSCC5DA6XsBdUat4X2ZgvUnTF4RA 23qlCU4MtcS3e9rOJ7F5bH0DsaZ2UWHe5T7XVg+JdOf1ceiSETBvmedCFFnTCLoShd0h D3Im9XuqWbd5Wqa6dLlVJlRbXDSDm1KSrZ9hVpPGtZvpc5CnriOMNKMd7C/76lZx7Qn3 MRpVMyP1dJyISYNvfsOY4mPqnyg2MGqhlYHaSKI5XDWnK2bENA6Rr+pK4l56EBM+uEEo 7ksI17/iE7DJTfbC/97AD8gJyzsSgauynWNdG+fgKwXnOSRB3JcW6pA7Dpyy1sjKYctx FHRg== X-Gm-Message-State: AOJu0YyJTVM58swOC8b37NZmxdT5gqK9lKgRKmE6Nz09ZyVwe0xVfooE eJHXEkaI3F9bmm5dQi9f8wI= X-Google-Smtp-Source: AGHT+IHreW7p7ZLM0cChkMobMO0T3tbpwERNdb7FsBeH8bxIdlYa6GNW3WRPEbY/kbyTJb6IdSP7aA== X-Received: by 2002:a5d:674d:0:b0:321:68fa:70aa with SMTP id l13-20020a5d674d000000b0032168fa70aamr5051533wrw.9.1696022291318; Fri, 29 Sep 2023 14:18:11 -0700 (PDT) Received: from [192.168.0.59] (178-117-137-225.access.telenet.be. [178.117.137.225]) by smtp.gmail.com with ESMTPSA id s23-20020adf9797000000b003247f732c11sm5622424wrb.76.2023.09.29.14.18.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 29 Sep 2023 14:18:10 -0700 (PDT) Message-ID: <574563c3-6970-43ce-ad4d-860975a8602f@gmail.com> Date: Fri, 29 Sep 2023 23:18:08 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: Dennis Snell Cc: internals@lists.php.net References: <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com> <5F88257F-58CB-45AA-B2CC-ABEA19AD7474@automattic.com> In-Reply-To: <5F88257F-58CB-45AA-B2CC-ABEA19AD7474@automattic.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: dossche.niels@gmail.com (Niels Dossche) Hi Dennis On 9/29/23 20:20, Dennis Snell wrote: >> >>> >>> For both, `XMLDocument::fromEmpty` and `HTMLDocument::createEmpty` there is an argument available to define the encoding but none of the other `createFrom*` methods have this argument. >>> >>> As far as I understand, in the these other cases the encoding gets detected from the content of the passed source but what happens is the source does not contain any information about the encoding?. E.g. you load an XML/HTML document over HTTP, the encoding is defined via HTTP header but the content itself doesn't contain it. >>> >> >> Right, we follow the HTML spec in this regard. Roughly speaking we determine the charset in the following order of priorities. >> If one option fails, it will fall through to the next one. >> 1. The Content-Type HTTP header from which you loaded the document. >> 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend the content with byte markers. This is used to detect encoding. >> 3. Meta tag in the content. >> >> If it could not be determined at all, UTF-8 will be assumed as it's the default in HTML. > > It may sound meticulous, but I’ve tried to emphasize `createFragment()` in what’s being built in WordPress because almost everything being done on HTML within WordPress, and I think within many frameworks, is processing fragments (and usually short ones at that). Formerly I didn’t realize there was much of a difference, but text encoding is one of those differences. It’s my understanding that when parsing a fragment we have to assume an encoding, unless the fragment is starting at a spot in the document before that’s discovered, presumably only if we’ve constructed a Document with a still-unknown encoding. > Just chiming in here to say that while we don't offer a createFragment() in this proposal, it's possible to parse fragments by passing the LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML which you could use then in conjunction with createDocumentFragment(). > So manually setting the encoding of a fragment constructor is not so much overriding as it is supplying, or at least, that’s one of two normative situations. If we create a fragment with a context node carrying an encoding already, then we need to ignore any meta tag that specifies otherwise; likewise if the context node doesn’t carry that encoding we do need to heed it. Sure, I agree it's not overriding in that specific case. In other cases it can be. There may not be an ideal naming that works for all cases. > > I know there’s a huge difference in needs here between people writing scripts to scrape full HTML documents, but it’s not a small fraction of cases where people want to use DOMDocument without having the full HTML from start to finish. In the world I work in it’s usually either for parsing a small fragment to add some attributes or replace a wrapping tag, or for constructing HTML programmatically to avoid escaping issues and make nesting easy. In both of these cases the text encoding is implicit unless the function signature makes it explicit. At this stage in development, we only support some of the “in body” parsing and only support UTF-8, but I thought that it was important enough to add these as arguments to the creator function so that there’s an awareness that these values govern how the parse occurs. > > Surely for `createFromString()` and `createEmpty()` we can make the assumption that no character encoding is set, but I also suspect that a possible majority of the times people use these functions they are likely calling them when `createFragment()` is more appropriate, that they aren’t supplying HTML documents with in-band text encoding information, and so there’s a chance that de-emphasizing the parameter may be technically more accurate and practically less helpful. Thanks for the insight. To be honest, I don't have hard feelings about the naming of the parameter. In a way you could say that override_encoding is still accurate, because the fallback default is UTF-8, so you override the fallback in a sense. As the documentation also emphasizes that the DOM extension works internally with UTF-8, this may align with expectations of programmers, but I'm not sure. I think we should really document the parameter very well in the docs. > > Love seeing all the continued work on this! > Thank you so much for your dedication to it. > > Dennis Snell Kind regards Niels