Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126978 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 41F4C1A00BC for ; Sun, 30 Mar 2025 12:36:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1743338030; bh=wFTTtCj47wk97uvzYvyR6HhtIo6oQQ/IdY34uoSuv6w=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=A+237Wx1aQ6lHoauLxS2xnqOLy3mTcNvy3G5NdSSseXXKl2ZU2gzeWY12wAkDADv7 b3CorH0axCzqDDtNq5ydsZuAB6rvfqbOMbgN4qLKZnKIEoi0UI6Py+5ZSv+/gEUuy0 XllK42XJErS04nAsXFmo/Bb+1FncHZZnAKmNfdQtf2PdZoT90TdJi3t6+n79b10woh URVqUWbF52wrBVbmmvq3FuBE1B/YFPxeKxd5TuQwDMmf8YHIzVbsSjUL6RRdsDK6vG ObH4Gt0PH5Q3E4ICE7fm+Pd3GkOQIgWv4nucDWcsv8lYGv4nlgOj0TYdyyeS9Bz4TG fhA8HSYx9gyTQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 1A532180053 for ; Sun, 30 Mar 2025 12:33:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from chrono.xqk7.com (chrono.xqk7.com [176.9.45.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 30 Mar 2025 12:33:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bastelstu.be; s=mail20171119; t=1743338165; bh=P9hFwDjlC2P2wtVQh5BdMh0qXyAGqfjLa0EAlsaD7aA=; h=MIME-Version:Date:From:To:Cc:Subject:In-Reply-To:References: Message-ID:Content-Type:from:to:cc:subject:message-id; b=ZS84/0y6TxXeEwND2dZ63MVNgtlXzfmKh4w8bA0q8/uHpS0Kcotwenw6TC9lbfMPD oYA+ImaIIBgY3QR5WaOvonvQKnA7EkFnbsMkrTLS2F5BK0vPB9yvCR9Ero6yGRFbEG /FIHVC6cLJ6EUrA3zOUN5Wrg74IHnh6QJ6CKXD7fUklO/bHeCLRndnaFXmqCTUl9A2 alrn+oOztxc15MCQI7d0owtQXNZk9lH1Ot4IXDJkTfybb2qLQ2JU+MCgfdeRqCtKKW 1qIlQkIV1d2GNQFU51ngEHouUytapC4/5SUWK2rNutTJgnDcNIlDBCpZyzaPTE50fh Wkom26HxR/IsQ== Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Date: Sun, 30 Mar 2025 14:36:04 +0200 To: =?UTF-8?Q?M=C3=A1t=C3=A9_Kocsis?= Cc: Internals Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API In-Reply-To: References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> Message-ID: <8df04e01-deac-404b-beb7-cd982423db63@bastelstu.be> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit From: tim@bastelstu.be (=?UTF-8?Q?Tim_D=C3=BCsterhus?=) Hi Apologies for getting back to you just now. On 3/2/25 23:00, Máté Kocsis wrote: >> What happens for Rfc3986 when passing an invalid URI to the >> constructor? >> Will an exception be thrown? What will the error array contain? Is it >> perhaps necessary to subclass Uri\InvalidUriException for use with >> WhatWgUrl, since `$errors` is not applicable for 3986? >> > > […] > > The $errors property will contain an empty array though, as you > supposed. I > don't see much problem > with using the same exception in both cases, however I'm also fine > with making the $errors property > nullable in order to indicate that returning errors is not supported by > the > implementation triggering > the error. I think I would prefer: namespace Uri { class InvalidUriException extends \Uri\UriException { } } namespace Uri\WhatWg { class InvalidUrlException extends \Uri\InvalidUriException { /** @var list */ public readonly array $errors; } } (note the use of Url in the name of the sub-exception) While this would result in a little more boilerplate, it would make static analysis tools more useful, since the `$errors` array could be properly typed instead of being just `array`. > 7. >> >> In the “Component retrieval” section: Please add even more examples of >> what kind of percent-decoding will happen. For example, it's important >> to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is >> decoded to `=`. This really is the same case as with `%2F` in a path. >> The explanation >> > […] > The relevant sections will give a little more reasoning why I went with > these rules. I've tested some of the examples against the implementation, but it does not match the description. Is the implementation up to date? getPath()); // /foo/bar%2Fbaz var_dump($url->getRawPath()); // /foo/bar%2Fbaz results in: string(12) "/foo/bar/baz" string(14) "/foo/bar%2Fbaz" The implementation for Rfc3986 appears to be correct. > "the URI is normalized (when applicable), and then the reserved >> characters in the context of the given component are percent-decoded. >> This means that only those reserved characters are percent-decoded >> that >> are not allowed in a component. This behavior is needed to be able to >> unambiguously retrieve components." >> >> alone is not clear to me. “reserved characters that are not allowed in >> a >> component”. I assume this means that `%2F` (/) in a path will not be >> decoded, but `%3F` (?) will, because a bare `?` can't appear in a >> path? >> > > I hope that this question is also clear after my clarifications + the > reconsidered logic. Please also give an explicit example for `%3F` in a path. I know that it is reserved from reading the Rfc3986, but I think it's a little unintuitive. You can adjust the last example in the component retrieval section to make it show all cases. So: $uri = new Uri\Rfc3986\Uri("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux"); echo $uri->getHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] echo $uri->getRawHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] echo $uri->getPath(); // /foo/bar%3Fbaz echo $uri->getRawPath(); // /foo/bar%3Fbaz echo $uri->getQuery(); // foo=bar%26baz%3Dqux echo $uri->getRawQuery(); // foo=bar%26baz%3Dqux During testing I also noticed that the Rfc3986 implementation removes trailing slashes from the path when using the normalized version. This was a little unexpected, because to me this is the difference between a directory and a file. I don't think there are clear examples showing that. So: $uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/"); echo $uri->getPath(); // /foo/bar echo $uri->getRawPath(); // /foo/bar/ >> >> 9. >> >> In the “Component Modification” section, the RFC states that WhatWgUrl >> will automatically encode `?` and `#` as necessary. Will the same >> happen >> for Rfc3986? Will the encoding of `#` also happen for the query-string >> component? The RFC only mentions the path component. > > > The above referenced sections will give a clear answer for this > question as > well. > TLDR: after your message, I realized that automatic percent-encoding > also > triggers a (soft) > error case for WHATWG, so I changed my mind with regards to > Uri\Rfc3986\Uri, > so it won't do any automatic percent-encoding. It's unfortunate, > because > this behavior is not > consistent with WHATWG, but it's more consistent with the parsing rules > of its > own specification, > where there are only hard errors, and there's no such thing as > "automatic > correction". > > Is the implementation already up to date with this change? When I try: var_dump( (new Uri\Rfc3986\Uri('https://example.com/foo/path')) ->withPath('some/path?foo=bar') ->toString() ); I get string(36) "https://example.comsome/path?foo=bar" which is completely wrong. ------- > It also surprised me, but IP address normalization is only performed by > WHATWG > during recomposition! But nowhere else... I think this might be a misunderstanding of the WHATWG specification. It seems to be also normalized during parsing: When I do the following in my Google Chrome: (new URL('https://[0:0::1]')).host; I get `[::1]`, which indicates the normalization happening. And likewise will: (new URL('https://[2001:db8:0:0:0:0:0:1]')).host; result in `[2001:db8::1]`. I've also tested this with the implementation to see if this is just something that is not clear in the RFC text, but correctly handled in the implementation and noticed that the behavior is pretty broken. Consider this script: getHost()); var_dump((new Uri\WhatWg\Url($url))->getAsciiHost()); This outputs: string(20) "2001:db8:0:0:0:0:0:1" string(23) "[8193:3512:0:0:0:0:0:1]" For Rfc3986: The square brackets are missing. For WhatWg: The IPv6 is completely broken. My expectation be be `[2001:db8:0:0:0:0:0:1]` for Rfc3986 and `[2001:db8::1]` for WhatWg. I have also tested the behavior of `withHost()` when leaving out the square brackets. The Rfc3986 correctly throws an Exception, but WhatWg silently does nothing: $url = 'https://example.com/foo/path'; var_dump((new Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString()); results in string(28) "https://example.com/foo/path" Best regards Tim Düsterhus