Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121187 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 51952 invoked from network); 29 Sep 2023 21:39:05 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Sep 2023 21:39:05 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 59ADF180546 for ; Fri, 29 Sep 2023 14:39:03 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.7 required=5.0 tests=BAYES_05,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,HTML_MESSAGE,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS2635 192.0.84.0/24 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 29 Sep 2023 14:39:02 -0700 (PDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 7B7B21D935D for ; Fri, 29 Sep 2023 21:39:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=content-type:content-type:x-mailer:mime-version:references :in-reply-to:message-id:subject:subject:from:from:date:date :received:received:received:received:received:received; s= automattic1; t=1696023542; bh=zQM9SukRXvcdrGxzBiNLOC7K5JE63MEWKh 11Qw4Lvq8=; b=CLktwQevZbqBsqnpycW/CjKj9s2FovTbkJ3OOWRtZbhWV33UZX g9AtP2ofQncA0AgfyghFCcIRVgYmZplyK/pkKjWxYczbTePzhrMKMzC6fNePazDh TpJUhwLcH4B654ox2G8WwqKCIau23jIp7U2rmN70uO+Gbc5cw5Y8pubLZiWxa6hs /k2gKVDBQX3+we4IWarWbaQF0DidztMcmOAxkQiOKJ3pQb20wm8ja1QMTz6EGSqn rnvznKaAvUes4KLZVQgnGY215boKCa36WUmuDJk9XoDcAJeYvfdR/xrl+BcWAG8S vSNR9WKneeIbICo4yEVe3jkAVW4Fy9lrS/4g== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VF-A2a_WDi5X for ; Fri, 29 Sep 2023 21:39:02 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 115D21D9326 for ; Fri, 29 Sep 2023 21:39:02 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="YITwpkKB"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id A9106A0788 for ; Fri, 29 Sep 2023 21:39:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1696023541; bh=zQM9SukRXvcdrGxzBiNLOC7K5JE63MEWKh11Qw4Lvq8=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=YITwpkKB/IXmtlQ/nRq0sJk8cNPLYKnXctGrqha5cv4cfUDfoH8CV8OxaYQf5SepB 8nRmk3m7W0vrKqiAlZgGQ4CpyO0QJ1TnYwbshHUrMu6hp156de2ZeHhC0FOmen5t1+ Bs2cPT+TxCf40WXAjs1xKmZUSHXrDBgHhwJ6MLxuGLqsrvFM1pi8eGmAljqK1caD08 dzYaw88D/1uoigMPe0QgvATXCTkj1JhQWNei09MBW4Nbk0QouQqDZ6iMFqDfm3ArEV JInSbTPOaEPuBGAAvDTB06rMnvYXhcPljn11PdQ02BMSVCUd4h2uj9CJgLi7kJFPJC LJ5ADDz0Biv7w== Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 9122EA0561 for ; Fri, 29 Sep 2023 21:39:01 +0000 (UTC) Received: by mail-pf1-f197.google.com with SMTP id d2e1a72fcca58-690bbc5fabaso19923108b3a.1 for ; Fri, 29 Sep 2023 14:39:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696023541; x=1696628341; h=mime-version:references:in-reply-to:message-id:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2mMJrsOnVwiS+MNGYxZeDTh0/vbg6oJBNZ1ecYhUc4s=; b=d4bjg0IyyWrvDLzccP614J2+jvpCe1tIPLLhjFJ3CMvsxOUCUEXWu1PCMt5MmoErHf Fj9w3jPHHZPs/TjiJS1tkJL5d/axd6XqaOgVXIcqPN2XGo56NHqbnBp/HlGF3gxQ+E4W 0WwWfAJIpuenpgv404rrLyrFZHJzVnQF5wxtrWrgj/XuBdayOpSQaUQc/NtiNCAukUzH F/4NRPbF3n2e8H4FxbqSkLY/PGH9s+u+502HaYIVFdi7KSkmkYr2BDUbKXIqZ4zaSQC/ TJIGVbDBvGy1fa1L0vlW9Ultiy+n5MmTN8Nt/RC++rFVebLeMd+9As6a6Hl5oEXjpSRt YOig== X-Gm-Message-State: AOJu0YyipJVThBS/Mrw4kyIpzY+3y0PkqvNN2seIWtSoewhb1DXqa+4/ tXzOV5cJ5F0mHFIJEnwP69jPo2GxGFXfOWoOv/Nfk0De3zEJMe+GHws6S8gyONuUQy8Y1cVP2Ek acVOJ/bxbI4pIMQdI+A== X-Received: by 2002:a05:6a00:170a:b0:690:b7a1:ac51 with SMTP id h10-20020a056a00170a00b00690b7a1ac51mr5629696pfc.31.1696023540667; Fri, 29 Sep 2023 14:39:00 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE0bd+ltHERrawul65saw23zDNLhq19Ho2lyPaolW62qkotRhlg+NYBmdkDXS6kezcG95vjYA== X-Received: by 2002:a05:6a00:170a:b0:690:b7a1:ac51 with SMTP id h10-20020a056a00170a00b00690b7a1ac51mr5629678pfc.31.1696023540141; Fri, 29 Sep 2023 14:39:00 -0700 (PDT) Received: from [192.168.0.135] (wsip-98-175-181-52.ph.ph.cox.net. [98.175.181.52]) by smtp.gmail.com with ESMTPSA id m11-20020aa78a0b000000b0068fc48fcaa8sm4514114pfa.155.2023.09.29.14.38.59 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 29 Sep 2023 14:38:59 -0700 (PDT) Date: Fri, 29 Sep 2023 14:38:59 -0700 (PDT) X-Google-Original-Date: 29 Sep 2023 14:38:58 -0700 Reply-To: Dennis Snell X-Google-Original-From: Dennis Snell To: Niels Dossche Cc: Message-ID: <23E1FB16-8ED9-4EF5-B5E2-D9136AF638D2@automattic.com> In-Reply-To: <574563c3-6970-43ce-ad4d-860975a8602f@gmail.com> References: <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com> <5F88257F-58CB-45AA-B2CC-ABEA19AD7474@automattic.com> <574563c3-6970-43ce-ad4d-860975a8602f@gmail.com> MIME-Version: 1.0 X-Mailer: Unibox (443:23.0.0) Content-Type: multipart/alternative; boundary="=_E92E9851-16D0-490C-AE6C-A148ADF0FD8E" Subject: Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: internals@lists.php.net ("Dennis Snell via internals") --=_E92E9851-16D0-490C-AE6C-A148ADF0FD8E Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > Just chiming in here to say that while we don't offer a createFragment() = in this proposal, it's possible to parse fragments by passing the LIBXML_HT= ML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML= which you could use then in conjunction with createDocumentFragment(). It=E2=80=99s not my understanding that this is right here, because fragment= parsing implies more than having or not having the HTML and BODY elements = implicitly. >=C2=A0Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding= of implied html/body... elements. The HTML5 spec defines fragment parsing as starting within a context node w= hich exists within a broader document. For example, many people will parse = a string of HTML that should form the contents of an LI element. They are g= rabbing that HTML from a database somewhere, from user input. If that HTML = contains =E2=80=9C=E2=80=9D then our behavior diverges. In a fragment = parser it would close out the list we started with but in full document par= sing mode the end tag would be ignored, a parse error. If the goal is to en= sure that user input doesn=E2=80=99t break out and change the page, then it= =E2=80=99s important to use fragment parsing and grab the inner contents of= that LI context node. This can be valuable to have as a tool to guard against injection attacks o= r against accidentally breaking the page someone is building, because the f= ragment parser is aware of its environment. It becomes even more important = when parsing within RCDATA or RAWTEXT sections. For example, if wanting to = parse and analyze or manipulate a web page=E2=80=99s title then the parser = should treat everything as plaintext until it reaches the end or encounters= a closing TITLE tag. If trying to do this with `createFromString()` then i= t=E2=80=99s up to the caller to remember to prepend and then remove the env= ironment, `createFromString( =E2=80=98=E2=80=99 . $page_title . = =E2=80=98=E2=80=99 )`. The fragment parser would be similar in prac= tice, but more explicit and hard to misunderstand in these circumstances. This is complicated stuff. I understand that the spec provides for a wide v= ariety of use-cases and needs, and that it=E2=80=99s hard to pin down exact= ly what a spec-compliant parser is supposed to do in all situations (it dep= ends), so I=E2=80=99m only wanting to share from the perspective of people = doing a lot of small HTML manipulation. There=E2=80=99s not much code out t= here using the fragment parser, but I can=E2=80=99t help but think that par= t of the reason is because it=E2=80=99s not exposed where it ought to be. Have a great weekend! Dennis Snell >=20 --=_E92E9851-16D0-490C-AE6C-A148ADF0FD8E--