Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126965 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 1A39D1A00BC for ; Thu, 27 Mar 2025 22:49:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1743115637; bh=cUzDLBZuJFlFLehpz3mBPnsJeJK9FKEDv1GkpcHXi3c=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=diXrp0sI1ayUVpNxHfz49bx9Bey6buaPgtLlPs7diZdbHY2jfpSJ78wvPEPYsfqLX DuAQOuzoZ6QmR9mHwrKnXDCZ/Dmb5/vJpqQE7j+9p2r9L7kk+HgqF13/jTpHWxIA+y +2vtTPUHAzvwixknApbEV4JIYuLuuaPhqgEl2191+2YPPYXUnAJTKPBvduZoYIqi+9 PNa2+Sqcvvow3Oy3wjMAXln3ryBZmem3A9dWNC69gxjiy2XzdbhwLqpf7SzhM6UROZ wFNr0qxOKnBu8YrUGGP6Fprz8O3Z7mFgtHeJHhdToqlK/U8o4O3vlHlEOqu5vm02DU Nqu6H5sA+rKFQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 7BA18180071 for ; Thu, 27 Mar 2025 22:47:16 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 27 Mar 2025 22:47:16 +0000 (UTC) Received: by mail-wm1-f47.google.com with SMTP id 5b1f17b1804b1-4394036c0efso10470135e9.2 for ; Thu, 27 Mar 2025 15:49:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743115783; x=1743720583; darn=lists.php.net; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=Xai7cUNVzvnCHubuYrmOWyvD85duyqaq2TDEx9AYTuw=; b=K6TWhgPNWPxR7UzS92J9hYCTnQlow0Ur3dzbDf20LUiTmH0WdW+LsHfMNJy6HMAFUz U+WH/ZfOORDEQv8hnfiUvNqICTzyhnIQ5EqyWHbQ37He6KZzyY0kHJdrktHYU0DEBzvJ LlW3hOSZmo/y6/ZffmQZEva4VF1noSoO3R2iZk6e0VN40Y3AjWMa4ol097FNpMC3TXDE SzWi0bXqpu8cjazwoep4pyzQyFAp4zbnnvK41sj2TzMPtuNmOBcHg8zHXqhQyEbR3dED B7VOf588NmmgUIQQkLAkwhMRapH0kIox+YG8i3YavgiPWt7Y5q3Kn1AiS95OJTa3ggEc X2+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743115783; x=1743720583; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Xai7cUNVzvnCHubuYrmOWyvD85duyqaq2TDEx9AYTuw=; b=gwjxhjhK82LATGd9m1Xcbioth8o7UT6pFyQdGDyzaABhyl2Rxi9ZRdquPhliZLiUgv lsfSNY0nvSFEaTuO4S8nVEHmG8+bS2e4SlJOl8r7/+hBY5MdNS5ahboegfl/1Bi/S8ZJ OYJFwZxAVzjF9T50bIRQrknEyR/0YQG7PXJlKNe6l2cpZG2Zhcia0J/FDh2LoK5X7TSQ eIaphtP0KsKenIi95X1fyJuFZjbPWNYPhu8ol13oSqsWZlSakL3n4QgXcABMQytKRatg R67AfXHSo0V0YMQ6FosMuchpSPoi4p4M0AjRczOmzGFk3LQt/IJdm6SK7b8rOi+XXmJD kGNg== X-Gm-Message-State: AOJu0YwqfdNyadg2MqBlbpUEoY6BPxrw6dWO+MpWoIlHHcviJOVSkd3y E7+fbnuUbc7+SnZaNs+zqiMwgSHQyO2rAX7/mpp4oRKV3f3JzRrvbY9TkA== X-Gm-Gg: ASbGnctguBUw+smyOW7LxOEjKiZczQo89DExIto5Ngr2cfywkKPH//VDnvXHz83DVYG qys9P2AZDWI6xD+eeFbhFbm9xjGMDZLtJk3HrcSUm76lexcYmHPoewMcIE8PEV0T0+Aqjkyi6Pq J1TjgpczPzN1oiYjKvIawu3Wm06EN/slAp9gRkWqloMwNIFktThgNKQTnOevYs+LtW1uwlM8ogU m/0yEU9W6aUT8n7TdjdhU15j6upl3KOG5/ZUohKDrbaTMo/sQcIjcZ2DG8O2Tek1rXLwPX4/mW6 hJYxRIHqkSC1SwItB6kjXsuEXju7XndAvTp/MmEr2XR7ps6lIYdHAoiQ/RumVh/2FDPzEIqIhnT 3mFmrsfhLjkos+hPyBamAfWpvtxYsuslyYjkRDKuP3sWBDO6WITYYkimoS3OzpclfUIgNmA4C2J 8+nlpIUMQRfxxNPKzuGijU6tc= X-Google-Smtp-Source: AGHT+IEA1ZPtX9oHsyZRU05neHdycUCQgQhmFIFFeUBM8G7IEJSUnvap9sXQ0JK2nEGmC77RGzmqHQ== X-Received: by 2002:a05:600c:3509:b0:43d:585f:ebf5 with SMTP id 5b1f17b1804b1-43d84f60bf4mr46175505e9.1.1743115783085; Thu, 27 Mar 2025 15:49:43 -0700 (PDT) Received: from ?IPV6:2a02:1811:3716:cb00:7dd9:306a:6d7a:6bb3? (ptr-9c16nbdd3wk4vqvfwz7.18120a2.ip6.access.telenet.be. [2a02:1811:3716:cb00:7dd9:306a:6d7a:6bb3]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43d82dedd6fsm54967865e9.7.2025.03.27.15.49.39 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 27 Mar 2025 15:49:41 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------0tE44KDqB04ouNlCDUqpm5ED" Message-ID: Date: Thu, 27 Mar 2025 23:49:39 +0100 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: =?UTF-8?B?TcOhdMOpIEtvY3Npcw==?= Cc: PHP Internals List References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <9bf11a89-39d9-457b-b0ea-789fd07d7370@gmail.com> <6430b9ed-638d-4247-9fa9-d1a9148c382b@gmail.com> <2e95e8fe-7cf0-493f-bd0a-9fff0956baaa@gmail.com> Content-Language: fr In-Reply-To: From: nyamsprod@gmail.com (Ignace Nyamagana Butera) This is a multi-part message in MIME format. --------------0tE44KDqB04ouNlCDUqpm5ED Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 27/03/2025 22:04, Máté Kocsis wrote: > Hi Ignace, > > While implementing the polyfill I am finding easier DX wise to > make the constructor private and use instead named constructors > for instantiation. I would be in favor of > > `Uri::parse` and `Uri::tryParse` like it is done currently with > Enum and the `from` and `tryfrom` named constructors. > > My reasoning is as follow: > >  there's no right way or wrong way to instantiate an URI there are > only contexts. While the parse method is all about parsing a > string, one could legitimately use other named constructors like > `Uri::fromComponents` which would take for instance the result of > parse_url to build a new URI. This can become handy in the case of > RFC3986 URI if you need to create an new URI not related to the > http scheme and that do not use all the components like the email, > data or FTP schemes. > >  By allowing creating URI based on their respective components > value you make it easier for dev to use the class. Also this means > that if we want to have a balance API then a `toComponents` method > should come hand in hand with the named constructor. > > I would understand if that idea to add both components related > methods is rejected, they could be implemented on userland, but > the main point was to prove that from the VO or the developer POV > in absence of a clearly defined instantiation process, having a > traditional constructor fails to convey all the different way to > create an URI. > > There are a few things which came to my mind: > - Currently, the underlying C libraries don't support a > `fromComponents` feature. How I could naively imagine this to work is > that the components are recomposed to a URI string based on the > relevant algorithm (for RFC 3986: > https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then > this string is parsed and validated. Unfortunately, I recently > realized that this approach may leave room for some kind of parsing > confusion attack, namely when the scheme is for example "https", the > authority is empty, and the path is "example.com > ". This will result in a https://example.com URI. > I believe a similar bug is not possible with the rest of the > components because they have their delimiters. So possibly some other > solution will be needed, or maybe adding some additional validation (?). > > - Nicolas raised my awareness that if URIs didn't have a proper > constructor, then one wouldn't be able to use URI objects as parameter > default values, like below: > function (Uri $foo = new Uri('blah')) > I think this omission would cause some usability regression. For this > reason, it may make sense to have a distinguished way of instantiating > an Uri. > > - I have a similar feeling for a toComponents() method as for another > named constructor instead of __construct(): I am not completely > against it, but I'm not totally convinced about it. > > Máté > Hi Máté, for RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then this string is parsed and validated. Unfortunately, I recently realized that this approach may leave room for some kind of parsing confusion attack, namely when the scheme is for example "https", the authority is empty, and the path is "example.com ". This will result in a https://example.com URI. I believe a similar bug is not possible with the rest of the components because they have their delimiters. So possibly some other solution will be needed, or maybe adding some additional validation (?). This is not correct according to RFC3986 https://datatracker.ietf.org/doc/html/rfc3986#section-3 *When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). * So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes. This is also one of the many reasons why at least for RFC3986 the path component can never be `null` but that's another discussion. Like I said having a `fromComponenta` named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements. I have one last question regarding the URI implementations which are raised by my polyfill: Did you also took into account the delimiters when submitting data via the withers ? In other words is ```php $uri->withQuery('?foo=bar'); //the same as $uri->withQuery('foo=bar'); ``` I know it is the case in of the WHATWG specification but I do not know if you kept this behaviour in your implementation for the WhatWgUrl for the Rfc3986 or for both. I would lean toward not accepting this "normalization" but since this is not documented in the RFC I wanted to know what is the expected behaviour. Thanks for the hard work --------------0tE44KDqB04ouNlCDUqpm5ED Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit


On 27/03/2025 22:04, Máté Kocsis wrote:
Hi Ignace,

While implementing the polyfill I am finding easier DX wise to make the constructor private and use instead named constructors for instantiation. I would be in favor of 

`Uri::parse` and `Uri::tryParse` like it is done currently with Enum and the `from` and `tryfrom` named constructors.

My reasoning is as follow:

 there's no right way or wrong way to instantiate an URI there are only contexts. While the parse method is all about parsing a string, one could legitimately use other named constructors like `Uri::fromComponents` which would take for instance the result of parse_url to build a new URI. This can become handy in the case of RFC3986 URI if you need to create an new URI not related to the http scheme and that do not use all the components like the email, data or FTP schemes.

 By allowing creating URI based on their respective components value you make it easier for dev to use the class. Also this means that if we want to have a balance API then a `toComponents` method should come hand in hand with the named constructor.

I would understand if that idea to add both components related methods is rejected, they could be implemented on userland, but the main point was to prove that from the VO or the developer POV in absence of a clearly defined instantiation process, having a traditional constructor fails to convey all the different way to create an URI.

 
There are a few things which came to my mind:
- Currently, the underlying C libraries don't support a `fromComponents` feature. How I could naively imagine this to work is that the components are recomposed to a URI string based on the relevant algorithm (for RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then this string is parsed and validated. Unfortunately, I recently realized that this approach may leave room for some kind of parsing confusion attack, namely when the scheme is for example "https", the authority is empty, and the path is "example.com". This will result in a https://example.com URI. I believe a similar bug is not possible with the rest of the components because they have their delimiters. So possibly some other solution will be needed, or maybe adding some additional validation (?).

- Nicolas raised my awareness that if URIs didn't have a proper constructor, then one wouldn't be able to use URI objects as parameter default values, like below:
function (Uri $foo = new Uri('blah'))
I think this omission would cause some usability regression. For this reason, it may make sense to have a distinguished way of instantiating an Uri.

- I have a similar feeling for a toComponents() method as for another named constructor instead of __construct(): I am not completely against it, but I'm not totally convinced about it.

Máté

Hi Máté,

for RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then this string is parsed and validated. Unfortunately, I recently realized that this approach may leave room for some kind of parsing confusion attack, namely when the scheme is for example "https", the authority is empty, and the path is "example.com". This will result in a https://example.com URI. I believe a similar bug is not possible with the rest of the components because they have their delimiters. So possibly some other solution will be needed, or maybe adding some additional validation (?).

This is not correct according to RFC3986 https://datatracker.ietf.org/doc/html/rfc3986#section-3


When authority is present, the path must either be empty or begin with a slash ("/") character.  When authority is not present, the path cannot begin with two slash characters ("//"). 

So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes. 
This is also one of the many reasons why at least for RFC3986 the path component can never be `null` but that's another discussion. Like I said having a `fromComponenta` named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I have one last question regarding the URI implementations which are raised by my polyfill:

Did you also took into account the delimiters when submitting data via the withers ? In other words is

```php
$uri->withQuery('?foo=bar');
//the same as 
$uri->withQuery('foo=bar');
```

I know it is the case in of the WHATWG specification but I do not know if you kept this behaviour in your implementation for the WhatWgUrl for the Rfc3986 or for both. I would lean toward not accepting this "normalization" but since this is not documented in the RFC I wanted to know what is the expected behaviour.

Thanks for the hard work 



--------------0tE44KDqB04ouNlCDUqpm5ED--