Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:126486
X-Original-To: internals@lists.php.net
Delivered-To: internals@lists.php.net
Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5])
	by qa.php.net (Postfix) with ESMTPS id 6F3EB1A00C1
	for <internals@lists.php.net>; Mon, 24 Feb 2025 09:18:22 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail;
	t=1740388543; bh=IjZnZaLf1AnLcOUOuWq+mDvSz+XEzomBz7U/YAQKvKM=;
	h=Date:Subject:To:References:From:In-Reply-To:From;
	b=do38aGNzxRC1PcnC+KrwSz+yAZkKvAxzye2Db/rDpNaJmat5UGeWjlF9VSc74RiaX
	 w/+UaY+/CSocNKpZIVIpMySkHotJaq1s/gzpgAoA9Jem9jFF5LnsRvr5Kt4g1Ogt9q
	 rrXRkR2WC2MxxpKGS70Aa8Xg7vDckonjzQP4av8iu24GRDu8tpCor84QtQ64LIeT9y
	 gcIREqc0YtwVm/droFCF4rVbd8dJabBRlO4vZ+ZBneQ4ZIc0nxWIx1x0pvntvNRceG
	 37V/+Rx7PjLhRZpP4T1tVx22dUCkAM5SgmMgHQkwWfmTXvIYNEquJAfSCZ/lugmpeW
	 IlPyqbjdgNoJw==
Received: from php-smtp4.php.net (localhost [127.0.0.1])
	by php-smtp4.php.net (Postfix) with ESMTP id 77CB11804FD
	for <internals@lists.php.net>; Mon, 24 Feb 2025 09:15:40 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net
X-Spam-Level: 
X-Spam-Status: No, score=-3.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM,
	RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=4.0.0
X-Spam-Virus: No
X-Envelope-From: <nyamsprod@gmail.com>
Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by php-smtp4.php.net (Postfix) with ESMTPS
	for <internals@lists.php.net>; Mon, 24 Feb 2025 09:15:35 +0000 (UTC)
Received: by mail-wr1-f44.google.com with SMTP id ffacd0b85a97d-38f1e8efef5so2377450f8f.1
        for <internals@lists.php.net>; Mon, 24 Feb 2025 01:18:13 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740388692; x=1740993492; darn=lists.php.net;
        h=content-transfer-encoding:in-reply-to:from:content-language
         :references:to:subject:user-agent:mime-version:date:message-id:from
         :to:cc:subject:date:message-id:reply-to;
        bh=qaBKqPCvKfJ1Z2F9sWMACYEbPNgvLY6iH7t2KYDeF/k=;
        b=LsKckRoRNZNy71SeJlBdGVqvYBxeKXAXXQ3T8PdFRigRqe9vxZkRHR8wm3JmcRt9tO
         kSNK/86ErHnxEnouCWe1FtbhcSbF3pFXmXvkjkmDofZcCsfQ1PKa3WvLz71PNf8SS7zH
         g9cTMIRLdwyopHluteEUeyfxQ41WGjlmztV4qDvf1fi7ltkv++OxUBjaJu7g194Z3eqg
         2z5wu6KlgOEPT93TJi2t5P8X6Ck6/806ix7ZwqsYzXee006uU8MzNxfvGfrryvlHYjn1
         9S373GjpGaLaYHGoFMeYuMBU41BOgNWEEZ3PRCJyLXdzgtcLpA/D9NkK4rFGDfGA04BN
         jL0A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740388692; x=1740993492;
        h=content-transfer-encoding:in-reply-to:from:content-language
         :references:to:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=qaBKqPCvKfJ1Z2F9sWMACYEbPNgvLY6iH7t2KYDeF/k=;
        b=bwSDXndUFiLNnokgT0CCsPgiB54mPa3c+qXSFk3nU2mR1i9e0Me5X/4kM0x3sCfhSz
         6NBlTgZ6CUdKT6mUBXPvzZYpEhV2UpCs8J7F5FNp9eTiT8sO0wwrdrnw1OecuCFs7hSO
         jdlYpPnRxVFoUFMuF3qA9Fgc+H8qxDQttBZIkYX0Lg6FqK27EElHy60z+Rv/4sSUP3Fa
         9g0NhU0RFE3qN2KrIdnl3+olfPKZFDiHUzJ2Paaa2TTC9rXT45HqisZ8YAULazoTPF72
         wSx5UMn4QsjYhNEtUmjAgLekYm0KFbpcdJydE9h3f2TJSqBE/BICN8SWjhUWrkYNSvI7
         xJug==
X-Gm-Message-State: AOJu0YzUEdzHTr8lPjzG98e72dGvPw7ZMHUKndJOd/rWcPTkd03ukID2
	rKbjkx3AOCyT8b/Sb4GzIljWzAg3mFhxBwfkRxGTcbF4WHjw/n1AhAWt6Q==
X-Gm-Gg: ASbGncsJFcgz273USf3FVymSGvBqyRZI7XJoIR/bNOK++Fa+OzIQ1r7I5y+q+b1YzT8
	vz4/eyVayEVnuZb1X1MMJMryCFPnqLdAh8C709bjITIe3ZqLRg3D60riLpJWXR++zxTY4L8jo10
	S+7/EgksbGHRTqMhKJI/W05YG8yIK9IyeRN54C5Duhp43PrpBLq2NpjW6Qw0xAIpa3NHqkkjgnw
	/NtlQ//uyBbGpxH9TO14J5a5M+NthKY88NQcsTMVicNSiwu5yFWjVnxojmyxmPXhPz4/82EsXLz
	rkQ+ZM67+dnsgi62IH0G0UfzP/thmUwePa6OD7boo75+jn1/ISKZXclV8nGny1hSdrYZyqOtvPf
	OFOUBEOHeNN1NAOU1mjuE3Pru2KvepptoUDB/mOAFHORbaG4N/hncuxMMib/SwPq0R8CJ4n94IL
	MCaBY9
X-Google-Smtp-Source: AGHT+IERsQsybeCK0COfMulOSHOzsK7r29ZpT7BTPE+6h/o6Qo1OiDRCdzKpMFI4swlKbx5prE8c+w==
X-Received: by 2002:a05:6000:400f:b0:38f:475d:1411 with SMTP id ffacd0b85a97d-38f6e95c042mr12138641f8f.18.1740388692283;
        Mon, 24 Feb 2025 01:18:12 -0800 (PST)
Received: from ?IPV6:2a02:1811:3716:cb00:5969:4587:cb75:7795? (ptr-9c16nbct5s3cflsyd91.18120a2.ip6.access.telenet.be. [2a02:1811:3716:cb00:5969:4587:cb75:7795])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38f6bfbe8efsm10439552f8f.46.2025.02.24.01.18.11
        for <internals@lists.php.net>
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 24 Feb 2025 01:18:11 -0800 (PST)
Message-ID: <9bf11a89-39d9-457b-b0ea-789fd07d7370@gmail.com>
Date: Mon, 24 Feb 2025 10:18:10 +0100
Precedence: bulk
list-help: <mailto:internals+help@lists.php.net
list-unsubscribe: <mailto:internals+unsubscribe@lists.php.net>
list-post: <mailto:internals@lists.php.net>
List-Id: internals.lists.php.net
x-ms-reactions: disallow
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API
To: internals@lists.php.net
References: <CAH5C8xUb1O20ZDrOQNC=ckFxHUUWSK7sw_njQQzFBd0qgQqoww@mail.gmail.com>
 <dd61999c-1ebd-4765-9add-cd8065968965@gmail.com>
 <CAOV5rgYZ_s1of-igLFEK7oqqh6=3HeYv0=JvvKvvjkcp0F6Q9Q@mail.gmail.com>
 <CAH5C8xV67frqOBCvLt73RM7QO86_pu40sHP5h71_kDRbKPtA8Q@mail.gmail.com>
 <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com>
 <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com>
 <CAH5C8xV69pHnSKWJrVB8EQab7vPhcaXAwYezoG6z3Oj7ZkXBaQ@mail.gmail.com>
 <bc5472e39bc607e2c7accc9d2bd6f239@bastelstu.be>
Content-Language: fr
In-Reply-To: <bc5472e39bc607e2c7accc9d2bd6f239@bastelstu.be>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
From: nyamsprod@gmail.com (Ignace Nyamagana Butera)

On 21/02/2025 13:06, Tim Düsterhus wrote:

> Hi
>
> Am 2025-02-16 23:01, schrieb Máté Kocsis:
>>> I only harp on the WhatWG spec so much because for many people this 
>>> will
>>> be the only one they are aware of, if they are aware of any spec at 
>>> all,
>>> and this is a sizable vector of attack targeting servers from 
>>> user-supplied
>>> content. I’m curious to hear from folks here hat fraction of the 
>>> actual PHP
>>> code deals with RFC3986 URLs, and of those, if the systems using 
>>> them are
>>> truly RFC3986 systems or if the common-enough URLs are valid in both 
>>> specs.
>>>
>>
>> I think Ignace's examples already highlighted that the two 
>> specifications
>> differ in nuances so much that even I had to admit after months of 
>> trying
>> to squeeze them into the same interface that doing so would be
>> irresponsible.
>
> I think this is also a good argument in favor of finally making the 
> classes final. Not making them final would allow for irresponsible 
> sub-classes :-)
>
>> echo $url->getHost();                // xn--go8h.com
>> echo $url->getHostForDisplay();      // 🐘.com
>> echo $url->toString();               //
>> https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
>> echo $url->toDisplayString();        /
>> https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
>
> The naming of these methods seems to be a little inconsistent. It 
> should either be:
>
>     ->getHostForDisplay()
>     ->toStringForDisplay()
>
> or
>
>     ->getDisplayHost()
>     ->toDisplayString()
>
> but not a mix between both of them.
>
>> I think the RFC is now mature enough to consider voting in the
>> foreseeable future, since most of the concerns which came up until 
>> now are
>> addressed some way or another. However, the only remaining question 
>> that I
>> still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
>> should be final? Personally, I don't see much problem with opening 
>> them for
>
> Yes. Besides the remark above, my previous arguments still apply (e.g. 
> `with()`ers not being able to construct instances for subclasses, 
> requiring to override all of them). I'm also noticing that 
> serialization is unsafe with subclasses that add a `$__uri` property 
> (or perhaps any property at all?).
>
> --------------------
>
> We already had extensive off-list discussion about the RFC and I agree 
> it's in a good shape now. I've given it another read and here's my 
> remarks:
>
> 1.
>
> The `toDisplayString()` method that you mentioned above is not in the 
> RFC. Did you mean `toHumanFriendlyString()`? Which one is correct?
>
> 2.
>
> The example output of the `$errors` array does not match the stub. It 
> contains a `failure` property, should that be `softError` instead?
>
> 3.
>
> The RFC states "When trying to instantiate a WHATWG Url via its 
> constructor, a Uri\InvalidUriException is thrown when parsing results 
> in a failure."
>
> What happens for Rfc3986 when passing an invalid URI to the 
> constructor? Will an exception be thrown? What will the error array 
> contain? Is it perhaps necessary to subclass Uri\InvalidUriException 
> for use with WhatWgUrl, since `$errors` is not applicable for 3986?
>
> 4.
>
> The RFC does not specify when `UninitializedUriException` is thrown.
>
> 5.
>
> The RFC does not specify when `UriOperationException` is thrown.
>
> 6.
>
> Generally speaking I believe it would help understanding if you would 
> add a `/** @throws InvalidUriException */` to each of the methods in 
> the stub to make it clear which ones are able to throw (e.g. 
> resolve(), or the withers). It's harder to find this out from 
> “English” rather than “code” :-)
>
> 7.
>
> In the “Component retrieval” section: Please add even more examples of 
> what kind of percent-decoding will happen. For example, it's important 
> to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is 
> decoded to `=`. This really is the same case as with `%2F` in a path. 
> The explanation
>
> "the URI is normalized (when applicable), and then the reserved 
> characters in the context of the given component are percent-decoded. 
> This means that only those reserved characters are percent-decoded 
> that are not allowed in a component. This behavior is needed to be 
> able to unambiguously retrieve components."
>
> alone is not clear to me. “reserved characters that are not allowed in 
> a component”. I assume this means that `%2F` (/) in a path will not be 
> decoded, but `%3F` (?) will, because a bare `?` can't appear in a path?
>
> 8.
>
> In the “Component retrieval” section: You compare the behavior of 
> WhatWgUrl and Rfc3986Uri. It would be useful to add something like:
>
>     $url->getRawScheme() // does not exist, because WhatWgUrl always 
> normalizes the scheme
>
> to better point out the differences between the two APIs with regard 
> to normalization (it's mentioned, but having it in the code blocks 
> would make it more visible).
>
> 9.
>
> In the “Component Modification” section, the RFC states that WhatWgUrl 
> will automatically encode `?` and `#` as necessary. Will the same 
> happen for Rfc3986? Will the encoding of `#` also happen for the 
> query-string component? The RFC only mentions the path component.
>
> I'm also wondering if there are cases where the withers would not 
> round-trip, i.e. where `$url->withPath($url->getPath())` would not 
> result in the original URL?
>
> 10.
>
> Can you add examples where the authority / host contains IPv6 
> literals? It would be useful to specifically show whether or not the 
> square brackets are returned when using the getters. It would also be 
> interesting to see whether or not IPv6 addresses are normalized (e.g. 
> shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`).
>
> 11.
>
> In “Component Recomposition” the RFC states "The 
> Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".
>
> Does this mean that toString() for Rfc3986 will always return the 
> original input?
>
> 12.
>
> It would be useful to know whether or not the classes implement 
> `__debugInfo()` / how they appear when `var_dump()`ing them.
>
> Best regards
> Tim Düsterhus


Hi Maté I just read the final proposal and here's my quick remarks it 
may be possible other have already highlighted some of those remarks:

I believe there's a typo in the RFC

 > All URI components - with the exception of the host - can be 
retrieved in two formats:

I believe you mean - with the excepotion of the Port


0 - It is a unfortunate that there's no IDNA support for RFC3986, I 
understand the reasoning behind that decision but I was wondering if it 
was possible to optin its use when the ext-intl extension is present ?

1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they 
will also get a `Uri::toDisplayString` and `Uri::getHostForDisplay` 
maybe this should be stated in the Futurscope ?

2 - I would go with final classes for both classes and promote 
decoration for extension. This would reduce security issues a lot.

3 - I would make the constructor private using a `from` , `tryFrom` or 
`parse` and `tryParse` methods to highlight the difference in result

4 - For consistency I would use toRawString and toString just like it is 
done for components.

5 - Can the returned array from __debugInfo be used in a "normal" method 
like `toComponents` naming can be changed/improve to ease migration from 
parse_url or is this left for userland library ?