Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121182 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 39201 invoked from network); 29 Sep 2023 18:20:45 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Sep 2023 18:20:45 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 7321518050B for ; Fri, 29 Sep 2023 11:20:43 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS2635 192.0.84.0/24 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 29 Sep 2023 11:20:42 -0700 (PDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 20C201D9335 for ; Fri, 29 Sep 2023 18:20:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:message-id:content-transfer-encoding:date :date:in-reply-to:subject:subject:mime-version:content-type :content-type:from:from:received:received:received:received :received:received; s=automattic1; t=1696011641; bh=eyxPaxhH5E+G RLHOUHM8GogMlZ+deDwwSt057P+u7ig=; b=U0rdWVgezLCT0gZFYpWWr484OHiM N2IXc3+gTlatlk0bQkTKJ7IvFa2fg75WyARAHxaz8l6A442DeJr12KbQeKJA6Gj6 bTRONW64R18m2+cZ5hke9y0j3AAnxC4ypd/nplUU/zovIHTlUfSiP1+J9IvRS+qR 3liWatd6hZnuLxPUPsoIu4PmMGX07RQfu4T3IrU0loT7N63cD3TDsBbXb0zjB1uh 18A6r13a1S6CEDXbkoTlbQM58wxKjynQNLmMotao2nT6iSyLjNygvPAQS4VvphRZ vUd16KVmIVdBzBsuawYhrysW3m+aQSTmF4HDKoFlUmGo84yLI6x7lLZ3yg== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id su55lOLNh1Q3 for ; Fri, 29 Sep 2023 18:20:41 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id BFAD51D92B5 for ; Fri, 29 Sep 2023 18:20:41 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="gX7Gna2N"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 62C0BA0788 for ; Fri, 29 Sep 2023 18:20:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1696011641; bh=eyxPaxhH5E+GRLHOUHM8GogMlZ+deDwwSt057P+u7ig=; h=From:Subject:In-Reply-To:Date:Cc:References:To:From; b=gX7Gna2NPuGDK9PRCwXemuGVGHbKCtEbQHv4ZfzjxaWesFj/Jt5vbiU56bTmAMkLk K2n71TqX5QnoIOMBQwTGGorN0ImFx69zfXMX8CfYyb1kOPCTxeve+/bt1fYkW27bmI 55f6ahe4ph/8Z5c/lMK+s4M7DqVvPyrufi9HFiX8kJ2N14jOKvT5jOYxypkaNoJkaa gqqHqm4eDyg9ZQpuyO9vTQkDtQZ73ZbPJsEN4XEctsy478SWh9L8426vpu7ODjnGXW Oze9zS/RJXhr5jc68w8vVge+IkfoynsG0JUBVSR80CkTSdb9GmLWaUxk5mni/a11mA QQguvxm6ScppA== Received: from mail-pg1-f198.google.com (mail-pg1-f198.google.com [209.85.215.198]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 4BEBEA0771 for ; Fri, 29 Sep 2023 18:20:41 +0000 (UTC) Received: by mail-pg1-f198.google.com with SMTP id 41be03b00d2f7-578e2187f02so12899669a12.0 for ; Fri, 29 Sep 2023 11:20:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696011640; x=1696616440; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:subject:mime-version:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eyxPaxhH5E+GRLHOUHM8GogMlZ+deDwwSt057P+u7ig=; b=CQgIVnxKEkJpReO6s8nlQUP0qkRJmxOgNEhHBr8p2Yx8JMCH45iCpJh8vPINO67na/ 176JOXbR8KgXpNBMCrBNdMHbGeFwuW+mrTauKHx9lQ8oIJQU0h9bMFLjMTNBg/qTOTv+ FmlllxrhxWdirfXaDcCt7v30JbnOkfKkyISLMiE2XNJSCNE9LPIKdWxijoFn0S6rZGcq 8U8cS0EIEq75AA4ylTq3wEjG7sl25J/GDJd6nEsaeaqGkFkFMzxjEPjhRhNiG+WQkZBu /wUWCHw2Ov8aOvHxfuofQMMg8ul2pnyewkorDcXGlc9iKAFF0JmjO2X0t9jwHLoky2TA sSiw== X-Gm-Message-State: AOJu0YwMscu4luH/7MB4Z6crUmO+VjQ4Hu1beiQt5xynkpnu81xkbzRH jPsAxhl8CJsNoyLl/66KJtT1nB437wJDH6ISFoqjaDS3f++4q9wqbj6iBf/Pyr0an/0cZB5JgCq L4RxmYMy6xpoY26Ws6ge2KdPvZw== X-Received: by 2002:a05:6a20:429c:b0:15a:7d2:7594 with SMTP id o28-20020a056a20429c00b0015a07d27594mr5595013pzj.11.1696011640244; Fri, 29 Sep 2023 11:20:40 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEIdKB1i/euVPMQFCk+T/fNnLp8XpjZF1ySa/F0g2C4wE9ZCkH9DBqPDXhUNSzbZjacYlx7dg== X-Received: by 2002:a05:6a20:429c:b0:15a:7d2:7594 with SMTP id o28-20020a056a20429c00b0015a07d27594mr5594994pzj.11.1696011639877; Fri, 29 Sep 2023 11:20:39 -0700 (PDT) Received: from smtpclient.apple (ip70-162-86-48.ph.ph.cox.net. [70.162.86.48]) by smtp.gmail.com with ESMTPSA id j7-20020a17090276c700b001c0bf60ba5csm17123218plt.272.2023.09.29.11.20.39 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 29 Sep 2023 11:20:39 -0700 (PDT) Reply-To: Dennis Snell X-Google-Original-From: Dennis Snell Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.100.2.1.4\)) In-Reply-To: <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com> Date: Fri, 29 Sep 2023 11:20:28 -0700 Cc: internals@lists.php.net Content-Transfer-Encoding: quoted-printable Message-ID: <5F88257F-58CB-45AA-B2CC-ABEA19AD7474@automattic.com> References: <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com> To: Niels Dossche X-Mailer: Apple Mail (2.3774.100.2.1.4) Subject: Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: internals@lists.php.net ("Dennis Snell via internals") >=20 >>=20 >> For both, `XMLDocument::fromEmpty` and `HTMLDocument::createEmpty` = there is an argument available to define the encoding but none of the = other `createFrom*` methods have this argument. >>=20 >> As far as I understand, in the these other cases the encoding gets = detected from the content of the passed source but what happens is the = source does not contain any information about the encoding?. E.g. you = load an XML/HTML document over HTTP, the encoding is defined via HTTP = header but the content itself doesn't contain it. >>=20 >=20 > Right, we follow the HTML spec in this regard. Roughly speaking we = determine the charset in the following order of priorities. > If one option fails, it will fall through to the next one. > 1. The Content-Type HTTP header from which you loaded the document. > 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE = prepend the content with byte markers. This is used to detect encoding. > 3. Meta tag in the content. >=20 > If it could not be determined at all, UTF-8 will be assumed as it's = the default in HTML. It may sound meticulous, but I=E2=80=99ve tried to emphasize = `createFragment()` in what=E2=80=99s being built in WordPress because = almost everything being done on HTML within WordPress, and I think = within many frameworks, is processing fragments (and usually short ones = at that). Formerly I didn=E2=80=99t realize there was much of a = difference, but text encoding is one of those differences. It=E2=80=99s = my understanding that when parsing a fragment we have to assume an = encoding, unless the fragment is starting at a spot in the document = before that=E2=80=99s discovered, presumably only if we=E2=80=99ve = constructed a Document with a still-unknown encoding. So manually setting the encoding of a fragment constructor is not so = much overriding as it is supplying, or at least, that=E2=80=99s one of = two normative situations. If we create a fragment with a context node = carrying an encoding already, then we need to ignore any meta tag that = specifies otherwise; likewise if the context node doesn=E2=80=99t carry = that encoding we do need to heed it. I know there=E2=80=99s a huge difference in needs here between people = writing scripts to scrape full HTML documents, but it=E2=80=99s not a = small fraction of cases where people want to use DOMDocument without = having the full HTML from start to finish. In the world I work in it=E2=80= =99s usually either for parsing a small fragment to add some attributes = or replace a wrapping tag, or for constructing HTML programmatically to = avoid escaping issues and make nesting easy. In both of these cases the = text encoding is implicit unless the function signature makes it = explicit. At this stage in development, we only support some of the = =E2=80=9Cin body=E2=80=9D parsing and only support UTF-8, but I thought = that it was important enough to add these as arguments to the creator = function so that there=E2=80=99s an awareness that these values govern = how the parse occurs. Surely for `createFromString()` and `createEmpty()` we can make the = assumption that no character encoding is set, but I also suspect that a = possible majority of the times people use these functions they are = likely calling them when `createFragment()` is more appropriate, that = they aren=E2=80=99t supplying HTML documents with in-band text encoding = information, and so there=E2=80=99s a chance that de-emphasizing the = parameter may be technically more accurate and practically less helpful. Love seeing all the continued work on this! Thank you so much for your dedication to it. Dennis Snell=