Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:127114 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id DD70B1A00BC for ; Tue, 15 Apr 2025 14:21:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1744726729; bh=L+7HswyCxxNKYGOeQoN/36CFJntHsrsFuT+poB9NfvE=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=M5BMxLbzKnodJjJ9aoUPpxlrSXgWtqErq5RCsN8wu2Bap3nwDM+xssf2KOcZ5HCe3 0ltX1brJaa3orH5Abl5d20q3tqrzVZe7de7nAqU2+8xo97wHMbZkOOWbJSyq519KPL W1+dgR9tTqUQR1Uua/Sa3XghyOcrq9NXZcv84Jaqn9kk3hFn3vRXrsI8J5itOhFenu F9o5K2edrzTQzKBo84uGsEQ00LP9q1u/CJXM7T+VT9WRaFO71muKDZ9U/PN9RhmKCo 2MMJa6dWoOegGhMld3GXIRtESk//EH1MlInOejx1/b1d5mx54Qy0ZilXQI6EXqc8n3 qNbde8hF+tmxw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id BD963180088 for ; Tue, 15 Apr 2025 14:18:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from chrono.xqk7.com (chrono.xqk7.com [176.9.45.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 15 Apr 2025 14:18:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bastelstu.be; s=mail20171119; t=1744726853; bh=8Fn8zGXJ/2VG+V7GeOGwwDe2njxMgMYUd8OfjMxzPyo=; h=MIME-Version:Date:From:To:Cc:Subject:In-Reply-To:References: Message-ID:Content-Type:from:to:cc:subject:message-id; b=CrVA4m54f9xBJLX/4YHFAq6KpJo1XcF/VeY3t63Zp+fk7f8H6XUivWw0YQgsyypq/ WrNjVbeK5QcHekyK0eOqYofbL2EhQimasfZx+vXaBBUG20nWAcvfvvDQLelG75vp4h QhedMLdfvSVsuoAu75w892/swB1sjexLVqiCD2EyXrpCACIxLASGv9efVOyNJRFz6Z rsff9CQ9s3tsNJxT1NVu/obb8XVBDpia5skM2MArqlefKskAMJe75UOTpNKQVYVjxn ZP1ZTaaQ66638F++lxrPYQLFvFRBcy03AbY6+eFo0pHRR8d60LsgnANJ5JowWzUyT7 zqZ8Ae2QWM6bw== Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Date: Tue, 15 Apr 2025 16:20:52 +0200 To: =?UTF-8?Q?M=C3=A1t=C3=A9_Kocsis?= Cc: Internals Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API In-Reply-To: References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <8df04e01-deac-404b-beb7-cd982423db63@bastelstu.be> Message-ID: <33427cd03035ef084245c44290b56a55@bastelstu.be> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit From: tim@bastelstu.be (=?UTF-8?Q?Tim_D=C3=BCsterhus?=) Hi Am 2025-04-13 14:10, schrieb Máté Kocsis: >> >> namespace Uri { >> class InvalidUriException extends \Uri\UriException >> { >> } >> } >> >> namespace Uri\WhatWg { >> class InvalidUrlException extends \Uri\InvalidUriException { >> /** @var list */ >> public readonly array $errors; >> } >> } >> >> (note the use of Url in the name of the sub-exception) >> >> While this would result in a little more boilerplate, it would make >> static analysis tools more useful, since the `$errors` array could be >> properly typed instead of being just `array`. >> > > OK, this makes sense to me, and I've just implemented it. Great. Don't forget to adjust the RFC text (that's the more important part :-)). > At last, when I changed the RFC so that only those characters were > percent-decoded which were "URL code points", I didn't notice > that the example you referred to above would go outdated: as "/" is an > URL > code point, it's currently percent-decoded by getPath(). > Unfortunately, I still don't know what the best approach would be. I see, thank you. I did some tests myself and read the spec. I've also checked https://github.com/whatwg/url/issues/565. Perhaps the correct solution would be to offer only the non-raw methods for WHATWG URL and to not attempt any additional percent-decoding there? My reasoning is that the WHATWG URL is a living standard anyways, so trying to add additional semantics on top will result in sadness. My understanding is also that it is primarily intended for interaction with web browsers or to embed these URLs into HTML. For access control, e.g. in your framework the RFC3986 URI should be used. It's what HTTP uses internally and it supports well-defined normalization. What do you think? > >> Please also give an explicit example for `%3F` in a path. I know that >> it >> is reserved from reading the Rfc3986, but I think it's a little >> unintuitive. You can adjust the last example in the component >> retrieval >> section to make it show all cases. So: >> >> $uri = new >> Uri\Rfc3986\Uri("https:// >> [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux"); >> >> echo $uri->getHost(); // >> [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] >> echo $uri->getRawHost(); // >> [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] >> echo $uri->getPath(); // /foo/bar%3Fbaz >> echo $uri->getRawPath(); // /foo/bar%3Fbaz >> echo $uri->getQuery(); // >> foo=bar%26baz%3Dqux >> echo $uri->getRawQuery(); // >> foo=bar%26baz%3Dqux >> > > Why is this behavior unintuitive? I think the already added examples > should Unintuive probably is not the best word. But I expect users to primarily interact with the path component of an URL (e.g. within their framework’s router). So I think it makes sense to be extra explicit with examples there. As an example, I recently learned that Symfony's router does not support (encoded) slashes within a component: #[Route('/test/{message}', name: 'test')] will work for http://localhost:8000/test/foo, but not for http://localhost:8000/test/foo%2fbar, resulting in: No route found for "GET http://localhost:8000/test/foo%2fbar" So if you would just extend the: “Let's have a look at some other tricky example with Uri\Rfc3986\Uri:” to my suggestion, I would be happy :-) Note: I believe there is a small mistake in the example when you last modified it. It says: echo $uri->getHost(); // [2001:0db8:0001:0000:0000:0ab9:C0a8:0102] Should the 'C' in 'C0a8' also be lowercased? >> >> In the “Component Modification” section, the RFC states that WhatWgUrl >> >> will automatically encode `?` and `#` as necessary. Will the same >> >> happen >> >> for Rfc3986? Will the encoding of `#` also happen for the query-string >> >> component? The RFC only mentions the path component. > > > I think the question for RFC 3986 is answered in the PHP RFC by the > following paragraph: > >> In order to offer consistent behavior with the parsing rules of RFC >> 3986, >> withers of Uri\Rfc3986\Uri also only accept properly formatted input, > meaning characters >> that are not allowed to be present in a component must be >> percent-encoded. Let's see what this means in practice through the > following example Yes, thank you for pointing that out. > Effectively, RFC 3986 has different behavior than what WHATWG does. Understood, makes sense. > The latter question ("Will the encoding of `#` also happen for the > query-string component?") > was supposed to be answered by the RFC, because of this sentence: > >> WHATWG algorithm automatically percent-encodes characters that fall >> into > the percent-encoding >> character set of the given component > > It may be possible that "the given" part is misleading, but the > behavior > actually follows the WHATWG spec > for all components. In any case, I change a few words to make this > clear. Yes, that makes sense. It's also explained in the “Percent-encoding & decoding” subsection of the “Important concepts” section, but I already forgot about that when I got down to the “Component recomposition” bit. My mistake! :-) > I haven't completely implemented withers yet for RFC 3986 (first and > foremost validation is missing), > so that's why you experienced this behavior. I would fix this later, > but > only if the vote succeeds. I've already > worked a lot on the implementation without having any promise of the > RFC > to succeed. Understood. >> My expectation be be `[2001:db8:0:0:0:0:0:1]` for Rfc3986 and >> `[2001:db8::1]` for WhatWg. I have also tested the behavior of >> `withHost()` when leaving out the square brackets. The Rfc3986 >> correctly >> throws an Exception, but WhatWg silently does nothing: >> >> $url = 'https://example.com/foo/path'; >> >> var_dump((new >> Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString()); >> >> results in >> >> string(28) "https://example.com/foo/path" >> > > This looks like this is the result of WHATWG's host setter algorithm ( > https://url.spec.whatwg.org/#dom-url-hostname). > After debugging the behavior, I noticed that "new > Uri\WhatWg\Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse > the port after the first ":" character. However, the setter algorithm > obviously doesn't reach this point, since it only tries to > parse the host, and then it stops (because of the state override). So > I'm > not sure this gotcha can be cured. > > I tried to reproduce the problem in Chrome, but I realized that the URL > properties are not validated at all > when they are set ("url.hostname = "2001:db8:0:0:0:0:0:1";" will change > the > hostname no problem)... I just tested it with node.js: > u = new URL('https://example.com/foo/path'); URL { href: 'https://example.com/foo/path', origin: 'https://example.com', protocol: 'https:', username: '', password: '', host: 'example.com', hostname: 'example.com', port: '', pathname: '/foo/path', search: '', searchParams: URLSearchParams {}, hash: '' } > u.hostname = '2001:db8:0:0:0:0:0:1' '2001:db8:0:0:0:0:0:1' > u URL { href: 'https://example.com/foo/path', origin: 'https://example.com', protocol: 'https:', username: '', password: '', host: 'example.com', hostname: 'example.com', port: '', pathname: '/foo/path', search: '', searchParams: URLSearchParams {}, hash: '' } > u.toString() 'https://example.com/foo/path' > u.hostname = '[2001:db8:0:0:0:0:0:1]' '[2001:db8:0:0:0:0:0:1]' > u URL { href: 'https://[2001:db8::1]/foo/path', origin: 'https://[2001:db8::1]', protocol: 'https:', username: '', password: '', host: '[2001:db8::1]', hostname: '[2001:db8::1]', port: '', pathname: '/foo/path', search: '', searchParams: URLSearchParams {}, hash: '' } > u.toString() 'https://[2001:db8::1]/foo/path' So it indeed seems to be a limitation of the WHATWG specification and your PHP implementation is consistent with node.js. That is a good thing and when a user stumbles upon this, we can point them towards node.js / the spec. Not great, but this is workable! Best regards Tim Düsterhus