Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126474 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 378E11A00BC for ; Fri, 21 Feb 2025 12:07:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1740139461; bh=V89m1VJ2uZ3bnokREzslzMVsM5Mh7w1eA71RUYlCyMw=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=mwD94PQV2P89s7xbxwGGcxFTTzw6NeZuWk0eD0U3v9SkzoLYAVFCT6n4QMAcl1lvI jHkO8n0GFxlnnYtzr6TR5eqpgVFOPRUIihzn73AJbYV/6cVo5cY5Xhh+vpmOpxaipJ OVI4wdla0cV/qLcCWRFNdJ/c/9AvKbyEFzCb+XFfY9TrbP7chl9T/37KJPz4/076KR fT7rbME3zoCqyAQsUF9xplnGANpakoOI4cL0Cvl2Wv++k55Vkb029WSR8zF7vJuGsp C8NuwyPeJGPffYBirw779GUsLi6dRzIvmcHuZBmLbrd6R7o6csJ0SSDTUFp15MpxF/ FsqIdQofjXenw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 3791D180050 for ; Fri, 21 Feb 2025 12:04:20 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from chrono.xqk7.com (chrono.xqk7.com [176.9.45.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 21 Feb 2025 12:04:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bastelstu.be; s=mail20171119; t=1740139617; bh=zc5MK/sqWVBA/6+OiCZ7XQDbYMlN0ddqdrIIruuCToo=; h=MIME-Version:Date:From:To:Cc:Subject:In-Reply-To:References: Message-ID:Content-Type:from:to:cc:subject:message-id; b=W7vu3mcO8KvluaBCAwuTCpt+DrTckP+4aF9IDoCwOPLfn1vaP7Ka1KTOUc1OCaniS Bk14kAxr4AY09X1UDPXC3ZCw9hWyenD1vLW4RlTOrOeIRg1YWFezLExCtoTKOA7GvK Y1FdYbEHbeHC0q5EOHgeUAY98XkYNjV7I9BKjEEqBwOuOvd+Jljq7bA2DoBwc9OuAf J//D06EuHaxgzDe/DcNqcBnq596gKBy6xjMJCY5e7nZ14WzlS2s/zAQvWGOuHP2bzQ FaXuzBqxvGibaZ310b9iXWS0z+KBF8kj/NYu/T34Gmmyqos15pkHUScbig/qiUB1FJ /asxUCRMe/DIA== Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Date: Fri, 21 Feb 2025 13:06:57 +0100 To: =?UTF-8?Q?M=C3=A1t=C3=A9_Kocsis?= Cc: Dennis Snell , Internals Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API In-Reply-To: References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> Message-ID: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit From: tim@bastelstu.be (=?UTF-8?Q?Tim_D=C3=BCsterhus?=) Hi Am 2025-02-16 23:01, schrieb Máté Kocsis: >> I only harp on the WhatWG spec so much because for many people this >> will >> be the only one they are aware of, if they are aware of any spec at >> all, >> and this is a sizable vector of attack targeting servers from >> user-supplied >> content. I’m curious to hear from folks here hat fraction of the >> actual PHP >> code deals with RFC3986 URLs, and of those, if the systems using them >> are >> truly RFC3986 systems or if the common-enough URLs are valid in both >> specs. >> > > I think Ignace's examples already highlighted that the two > specifications > differ in nuances so much that even I had to admit after months of > trying > to squeeze them into the same interface that doing so would be > irresponsible. I think this is also a good argument in favor of finally making the classes final. Not making them final would allow for irresponsible sub-classes :-) > echo $url->getHost(); // xn--go8h.com > echo $url->getHostForDisplay(); // 🐘.com > echo $url->toString(); // > https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 > echo $url->toDisplayString(); / > https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 The naming of these methods seems to be a little inconsistent. It should either be: ->getHostForDisplay() ->toStringForDisplay() or ->getDisplayHost() ->toDisplayString() but not a mix between both of them. > I think the RFC is now mature enough to consider voting in the > foreseeable future, since most of the concerns which came up until now > are > addressed some way or another. However, the only remaining question > that I > still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url > classes > should be final? Personally, I don't see much problem with opening them > for Yes. Besides the remark above, my previous arguments still apply (e.g. `with()`ers not being able to construct instances for subclasses, requiring to override all of them). I'm also noticing that serialization is unsafe with subclasses that add a `$__uri` property (or perhaps any property at all?). -------------------- We already had extensive off-list discussion about the RFC and I agree it's in a good shape now. I've given it another read and here's my remarks: 1. The `toDisplayString()` method that you mentioned above is not in the RFC. Did you mean `toHumanFriendlyString()`? Which one is correct? 2. The example output of the `$errors` array does not match the stub. It contains a `failure` property, should that be `softError` instead? 3. The RFC states "When trying to instantiate a WHATWG Url via its constructor, a Uri\InvalidUriException is thrown when parsing results in a failure." What happens for Rfc3986 when passing an invalid URI to the constructor? Will an exception be thrown? What will the error array contain? Is it perhaps necessary to subclass Uri\InvalidUriException for use with WhatWgUrl, since `$errors` is not applicable for 3986? 4. The RFC does not specify when `UninitializedUriException` is thrown. 5. The RFC does not specify when `UriOperationException` is thrown. 6. Generally speaking I believe it would help understanding if you would add a `/** @throws InvalidUriException */` to each of the methods in the stub to make it clear which ones are able to throw (e.g. resolve(), or the withers). It's harder to find this out from “English” rather than “code” :-) 7. In the “Component retrieval” section: Please add even more examples of what kind of percent-decoding will happen. For example, it's important to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is decoded to `=`. This really is the same case as with `%2F` in a path. The explanation "the URI is normalized (when applicable), and then the reserved characters in the context of the given component are percent-decoded. This means that only those reserved characters are percent-decoded that are not allowed in a component. This behavior is needed to be able to unambiguously retrieve components." alone is not clear to me. “reserved characters that are not allowed in a component”. I assume this means that `%2F` (/) in a path will not be decoded, but `%3F` (?) will, because a bare `?` can't appear in a path? 8. In the “Component retrieval” section: You compare the behavior of WhatWgUrl and Rfc3986Uri. It would be useful to add something like: $url->getRawScheme() // does not exist, because WhatWgUrl always normalizes the scheme to better point out the differences between the two APIs with regard to normalization (it's mentioned, but having it in the code blocks would make it more visible). 9. In the “Component Modification” section, the RFC states that WhatWgUrl will automatically encode `?` and `#` as necessary. Will the same happen for Rfc3986? Will the encoding of `#` also happen for the query-string component? The RFC only mentions the path component. I'm also wondering if there are cases where the withers would not round-trip, i.e. where `$url->withPath($url->getPath())` would not result in the original URL? 10. Can you add examples where the authority / host contains IPv6 literals? It would be useful to specifically show whether or not the square brackets are returned when using the getters. It would also be interesting to see whether or not IPv6 addresses are normalized (e.g. shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`). 11. In “Component Recomposition” the RFC states "The Uri\Rfc3986\Uri::toString() returns the unnormalized URI string". Does this mean that toString() for Rfc3986 will always return the original input? 12. It would be useful to know whether or not the classes implement `__debugInfo()` / how they appear when `var_dump()`ing them. Best regards Tim Düsterhus