Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121188 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 53719 invoked from network); 29 Sep 2023 21:55:04 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Sep 2023 21:55:04 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A76EF1804C1 for ; Fri, 29 Sep 2023 14:55:03 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 29 Sep 2023 14:55:03 -0700 (PDT) Received: by mail-wm1-f48.google.com with SMTP id 5b1f17b1804b1-4065f29e933so10334555e9.1 for ; Fri, 29 Sep 2023 14:55:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696024502; x=1696629302; darn=lists.php.net; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=3gGLtAL1FvpJgq2J4r+S7oIrBjO1pnt4AmQCk2X4c0Y=; b=fj4UO8mIZTNBjCFMb3DSATRP97FYg4nfPMWoLzbXsp9+djJcwqAkU6O0SchmoyI7Lo Yfnlq5ReAUfODv1CxX65m0a4kUDk/kP5t+jLb1UcqohLlrFMdkp8tvfOp///rHt1tu+r tr848JXVOY6pj8GEvvxxNZyjxswfQDveGZENhfUCQZrsf0JgbRe7G91+b97oqzixG1ec JFuMedYnSpYz11WdxdUhiWhdP2U/COo9yInBgYmKPII/IrGPt40J+dJ2FyRFKihTjhQi b9Pq+nECWrnrH5wxIsKjqpFVAnssDmcMaUBSU2e28+8+PRV26Kza1eZRYW9R/n5rrTq0 IOvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696024502; x=1696629302; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=3gGLtAL1FvpJgq2J4r+S7oIrBjO1pnt4AmQCk2X4c0Y=; b=UHUP89Qyu4w8wzBzwCLlEwoxoK2seJXwFJtOsb1ctxOj1Zp2ULuSK3VmmZ7nEnEawG cOGGqdnS6qBfFPHTpuI/YWQAYhI9rvrdhkINvhZb+p9haWqwzGBCDpXMfS+1rXQ2q2vk hzA2AJXYsKM9bnT/iAM0rop8HLxWqUXFzgC372i29HTXDKg1GMmIEtI+LA51NpuOcFTn lFDZ+UdFJquRrnUFZZmllx6q2IM+WGjhbs8Y9UjvIuYWXrjiDTKe1RQuhdWBq9TJZbLu K8LuO78evfKAJ1R+slIvcJenLN8UdIp8CmoxbKlOUvDYYe40zQNUKcELOQSprJCTnKQY 9SkQ== X-Gm-Message-State: AOJu0Yxbc28gxVMwckIfjiJc3S6kQF+op8Y0xsq1e8yNApwKWumer7P9 rpa+entFyoAsC/hj2u1kORU= X-Google-Smtp-Source: AGHT+IFWULTraz28gIZ8N2z4rolCSFZc4oZgZF4g6GkB4OnsSOPad2XV60ojZYE1I+BkpXdgG6xEQA== X-Received: by 2002:a1c:4b13:0:b0:405:3dee:3515 with SMTP id y19-20020a1c4b13000000b004053dee3515mr5057248wma.27.1696024501600; Fri, 29 Sep 2023 14:55:01 -0700 (PDT) Received: from [192.168.0.59] (178-117-137-225.access.telenet.be. [178.117.137.225]) by smtp.gmail.com with ESMTPSA id 6-20020a05600c22c600b00406447b798bsm2198883wmg.37.2023.09.29.14.55.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 29 Sep 2023 14:55:01 -0700 (PDT) Message-ID: <4a50ed71-ee63-4db3-86d6-e463b551aa34@gmail.com> Date: Fri, 29 Sep 2023 23:55:00 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: Dennis Snell Cc: internals@lists.php.net References: <0c2d7e79-2c74-4263-81ec-ac8832ca50bb@gmail.com> <5F88257F-58CB-45AA-B2CC-ABEA19AD7474@automattic.com> <574563c3-6970-43ce-ad4d-860975a8602f@gmail.com> <23E1FB16-8ED9-4EF5-B5E2-D9136AF638D2@automattic.com> In-Reply-To: <23E1FB16-8ED9-4EF5-B5E2-D9136AF638D2@automattic.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support From: dossche.niels@gmail.com (Niels Dossche) Hi Dennis On 9/29/23 23:38, Dennis Snell wrote: >> Just chiming in here to say that while we don't offer a createFragment() in this proposal, it's possible to parse fragments by passing the LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML which you could use then in conjunction with createDocumentFragment(). > > > It’s not my understanding that this is right here, because fragment parsing implies more than having or not having the HTML and BODY elements implicitly. Right. I plan on adding innerHTML/outerHTML in the near future. This RFC is a prerequisite for that. As those properties invoke the html fragment parser this somewhat accomplishes what you'd like. Additionally in the future we might also expose the fragment parser in a more low-level API. Depends on the demand of users and other feature requests that come in. > > >>  Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements. > > > The HTML5 spec defines fragment parsing as starting within a context node which exists within a broader document. For example, many people will parse a string of HTML that should form the contents of an LI element. They are grabbing that HTML from a database somewhere, from user input. If that HTML contains “” then our behavior diverges. In a fragment parser it would close out the list we started with but in full document parsing mode the end tag would be ignored, a parse error. If the goal is to ensure that user input doesn’t break out and change the page, then it’s important to use fragment parsing and grab the inner contents of that LI context node. > > > This can be valuable to have as a tool to guard against injection attacks or against accidentally breaking the page someone is building, because the fragment parser is aware of its environment. It becomes even more important when parsing within RCDATA or RAWTEXT sections. For example, if wanting to parse and analyze or manipulate a web page’s title then the parser should treat everything as plaintext until it reaches the end or encounters a closing TITLE tag. If trying to do this with `createFromString()` then it’s up to the caller to remember to prepend and then remove the environment, `createFromString( ‘’ . $page_title . ‘’ )`. The fragment parser would be similar in practice, but more explicit and hard to misunderstand in these circumstances. > You're right, it is dangerous indeed to place the burden of dealing with wrapping and unwrapping on the user, as mistakes are bound to happen and they could result in very bad injection attacks. innerHTML would help, a low-level fragment parser API maybe even more. I'd have to think about that, but that's for future work. > > This is complicated stuff. I understand that the spec provides for a wide variety of use-cases and needs, and that it’s hard to pin down exactly what a spec-compliant parser is supposed to do in all situations (it depends), so I’m only wanting to share from the perspective of people doing a lot of small HTML manipulation. There’s not much code out there using the fragment parser, but I can’t help but think that part of the reason is because it’s not exposed where it ought to be. > > > Have a great weekend! > Dennis Snell >> > Thanks for the discussion and sharing your insight. Likewise, have a great weekend. Kind regards Niels