Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126486 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 6F3EB1A00C1 for ; Mon, 24 Feb 2025 09:18:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1740388543; bh=IjZnZaLf1AnLcOUOuWq+mDvSz+XEzomBz7U/YAQKvKM=; h=Date:Subject:To:References:From:In-Reply-To:From; b=do38aGNzxRC1PcnC+KrwSz+yAZkKvAxzye2Db/rDpNaJmat5UGeWjlF9VSc74RiaX w/+UaY+/CSocNKpZIVIpMySkHotJaq1s/gzpgAoA9Jem9jFF5LnsRvr5Kt4g1Ogt9q rrXRkR2WC2MxxpKGS70Aa8Xg7vDckonjzQP4av8iu24GRDu8tpCor84QtQ64LIeT9y gcIREqc0YtwVm/droFCF4rVbd8dJabBRlO4vZ+ZBneQ4ZIc0nxWIx1x0pvntvNRceG 37V/+Rx7PjLhRZpP4T1tVx22dUCkAM5SgmMgHQkwWfmTXvIYNEquJAfSCZ/lugmpeW IlPyqbjdgNoJw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 77CB11804FD for ; Mon, 24 Feb 2025 09:15:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 24 Feb 2025 09:15:35 +0000 (UTC) Received: by mail-wr1-f44.google.com with SMTP id ffacd0b85a97d-38f1e8efef5so2377450f8f.1 for ; Mon, 24 Feb 2025 01:18:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1740388692; x=1740993492; darn=lists.php.net; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=qaBKqPCvKfJ1Z2F9sWMACYEbPNgvLY6iH7t2KYDeF/k=; b=LsKckRoRNZNy71SeJlBdGVqvYBxeKXAXXQ3T8PdFRigRqe9vxZkRHR8wm3JmcRt9tO kSNK/86ErHnxEnouCWe1FtbhcSbF3pFXmXvkjkmDofZcCsfQ1PKa3WvLz71PNf8SS7zH g9cTMIRLdwyopHluteEUeyfxQ41WGjlmztV4qDvf1fi7ltkv++OxUBjaJu7g194Z3eqg 2z5wu6KlgOEPT93TJi2t5P8X6Ck6/806ix7ZwqsYzXee006uU8MzNxfvGfrryvlHYjn1 9S373GjpGaLaYHGoFMeYuMBU41BOgNWEEZ3PRCJyLXdzgtcLpA/D9NkK4rFGDfGA04BN jL0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740388692; x=1740993492; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=qaBKqPCvKfJ1Z2F9sWMACYEbPNgvLY6iH7t2KYDeF/k=; b=bwSDXndUFiLNnokgT0CCsPgiB54mPa3c+qXSFk3nU2mR1i9e0Me5X/4kM0x3sCfhSz 6NBlTgZ6CUdKT6mUBXPvzZYpEhV2UpCs8J7F5FNp9eTiT8sO0wwrdrnw1OecuCFs7hSO jdlYpPnRxVFoUFMuF3qA9Fgc+H8qxDQttBZIkYX0Lg6FqK27EElHy60z+Rv/4sSUP3Fa 9g0NhU0RFE3qN2KrIdnl3+olfPKZFDiHUzJ2Paaa2TTC9rXT45HqisZ8YAULazoTPF72 wSx5UMn4QsjYhNEtUmjAgLekYm0KFbpcdJydE9h3f2TJSqBE/BICN8SWjhUWrkYNSvI7 xJug== X-Gm-Message-State: AOJu0YzUEdzHTr8lPjzG98e72dGvPw7ZMHUKndJOd/rWcPTkd03ukID2 rKbjkx3AOCyT8b/Sb4GzIljWzAg3mFhxBwfkRxGTcbF4WHjw/n1AhAWt6Q== X-Gm-Gg: ASbGncsJFcgz273USf3FVymSGvBqyRZI7XJoIR/bNOK++Fa+OzIQ1r7I5y+q+b1YzT8 vz4/eyVayEVnuZb1X1MMJMryCFPnqLdAh8C709bjITIe3ZqLRg3D60riLpJWXR++zxTY4L8jo10 S+7/EgksbGHRTqMhKJI/W05YG8yIK9IyeRN54C5Duhp43PrpBLq2NpjW6Qw0xAIpa3NHqkkjgnw /NtlQ//uyBbGpxH9TO14J5a5M+NthKY88NQcsTMVicNSiwu5yFWjVnxojmyxmPXhPz4/82EsXLz rkQ+ZM67+dnsgi62IH0G0UfzP/thmUwePa6OD7boo75+jn1/ISKZXclV8nGny1hSdrYZyqOtvPf OFOUBEOHeNN1NAOU1mjuE3Pru2KvepptoUDB/mOAFHORbaG4N/hncuxMMib/SwPq0R8CJ4n94IL MCaBY9 X-Google-Smtp-Source: AGHT+IERsQsybeCK0COfMulOSHOzsK7r29ZpT7BTPE+6h/o6Qo1OiDRCdzKpMFI4swlKbx5prE8c+w== X-Received: by 2002:a05:6000:400f:b0:38f:475d:1411 with SMTP id ffacd0b85a97d-38f6e95c042mr12138641f8f.18.1740388692283; Mon, 24 Feb 2025 01:18:12 -0800 (PST) Received: from ?IPV6:2a02:1811:3716:cb00:5969:4587:cb75:7795? (ptr-9c16nbct5s3cflsyd91.18120a2.ip6.access.telenet.be. [2a02:1811:3716:cb00:5969:4587:cb75:7795]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38f6bfbe8efsm10439552f8f.46.2025.02.24.01.18.11 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 24 Feb 2025 01:18:11 -0800 (PST) Message-ID: <9bf11a89-39d9-457b-b0ea-789fd07d7370@gmail.com> Date: Mon, 24 Feb 2025 10:18:10 +0100 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: internals@lists.php.net References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> Content-Language: fr In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit From: nyamsprod@gmail.com (Ignace Nyamagana Butera) On 21/02/2025 13:06, Tim Düsterhus wrote: > Hi > > Am 2025-02-16 23:01, schrieb Máté Kocsis: >>> I only harp on the WhatWG spec so much because for many people this >>> will >>> be the only one they are aware of, if they are aware of any spec at >>> all, >>> and this is a sizable vector of attack targeting servers from >>> user-supplied >>> content. I’m curious to hear from folks here hat fraction of the >>> actual PHP >>> code deals with RFC3986 URLs, and of those, if the systems using >>> them are >>> truly RFC3986 systems or if the common-enough URLs are valid in both >>> specs. >>> >> >> I think Ignace's examples already highlighted that the two >> specifications >> differ in nuances so much that even I had to admit after months of >> trying >> to squeeze them into the same interface that doing so would be >> irresponsible. > > I think this is also a good argument in favor of finally making the > classes final. Not making them final would allow for irresponsible > sub-classes :-) > >> echo $url->getHost();                // xn--go8h.com >> echo $url->getHostForDisplay();      // 🐘.com >> echo $url->toString();               // >> https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 >> echo $url->toDisplayString();        / >> https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 > > The naming of these methods seems to be a little inconsistent. It > should either be: > >     ->getHostForDisplay() >     ->toStringForDisplay() > > or > >     ->getDisplayHost() >     ->toDisplayString() > > but not a mix between both of them. > >> I think the RFC is now mature enough to consider voting in the >> foreseeable future, since most of the concerns which came up until >> now are >> addressed some way or another. However, the only remaining question >> that I >> still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes >> should be final? Personally, I don't see much problem with opening >> them for > > Yes. Besides the remark above, my previous arguments still apply (e.g. > `with()`ers not being able to construct instances for subclasses, > requiring to override all of them). I'm also noticing that > serialization is unsafe with subclasses that add a `$__uri` property > (or perhaps any property at all?). > > -------------------- > > We already had extensive off-list discussion about the RFC and I agree > it's in a good shape now. I've given it another read and here's my > remarks: > > 1. > > The `toDisplayString()` method that you mentioned above is not in the > RFC. Did you mean `toHumanFriendlyString()`? Which one is correct? > > 2. > > The example output of the `$errors` array does not match the stub. It > contains a `failure` property, should that be `softError` instead? > > 3. > > The RFC states "When trying to instantiate a WHATWG Url via its > constructor, a Uri\InvalidUriException is thrown when parsing results > in a failure." > > What happens for Rfc3986 when passing an invalid URI to the > constructor? Will an exception be thrown? What will the error array > contain? Is it perhaps necessary to subclass Uri\InvalidUriException > for use with WhatWgUrl, since `$errors` is not applicable for 3986? > > 4. > > The RFC does not specify when `UninitializedUriException` is thrown. > > 5. > > The RFC does not specify when `UriOperationException` is thrown. > > 6. > > Generally speaking I believe it would help understanding if you would > add a `/** @throws InvalidUriException */` to each of the methods in > the stub to make it clear which ones are able to throw (e.g. > resolve(), or the withers). It's harder to find this out from > “English” rather than “code” :-) > > 7. > > In the “Component retrieval” section: Please add even more examples of > what kind of percent-decoding will happen. For example, it's important > to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is > decoded to `=`. This really is the same case as with `%2F` in a path. > The explanation > > "the URI is normalized (when applicable), and then the reserved > characters in the context of the given component are percent-decoded. > This means that only those reserved characters are percent-decoded > that are not allowed in a component. This behavior is needed to be > able to unambiguously retrieve components." > > alone is not clear to me. “reserved characters that are not allowed in > a component”. I assume this means that `%2F` (/) in a path will not be > decoded, but `%3F` (?) will, because a bare `?` can't appear in a path? > > 8. > > In the “Component retrieval” section: You compare the behavior of > WhatWgUrl and Rfc3986Uri. It would be useful to add something like: > >     $url->getRawScheme() // does not exist, because WhatWgUrl always > normalizes the scheme > > to better point out the differences between the two APIs with regard > to normalization (it's mentioned, but having it in the code blocks > would make it more visible). > > 9. > > In the “Component Modification” section, the RFC states that WhatWgUrl > will automatically encode `?` and `#` as necessary. Will the same > happen for Rfc3986? Will the encoding of `#` also happen for the > query-string component? The RFC only mentions the path component. > > I'm also wondering if there are cases where the withers would not > round-trip, i.e. where `$url->withPath($url->getPath())` would not > result in the original URL? > > 10. > > Can you add examples where the authority / host contains IPv6 > literals? It would be useful to specifically show whether or not the > square brackets are returned when using the getters. It would also be > interesting to see whether or not IPv6 addresses are normalized (e.g. > shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`). > > 11. > > In “Component Recomposition” the RFC states "The > Uri\Rfc3986\Uri::toString() returns the unnormalized URI string". > > Does this mean that toString() for Rfc3986 will always return the > original input? > > 12. > > It would be useful to know whether or not the classes implement > `__debugInfo()` / how they appear when `var_dump()`ing them. > > Best regards > Tim Düsterhus Hi Maté I just read the final proposal and here's my quick remarks it may be possible other have already highlighted some of those remarks: I believe there's a typo in the RFC > All URI components - with the exception of the host - can be retrieved in two formats: I believe you mean - with the excepotion of the Port 0 - It is a unfortunate that there's no IDNA support for RFC3986, I understand the reasoning behind that decision but I was wondering if it was possible to optin its use when the ext-intl extension is present ? 1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they will also get a `Uri::toDisplayString` and `Uri::getHostForDisplay` maybe this should be stated in the Futurscope ? 2 - I would go with final classes for both classes and promote decoration for extension. This would reduce security issues a lot. 3 - I would make the constructor private using a `from` , `tryFrom` or `parse` and `tryParse` methods to highlight the difference in result 4 - For consistency I would use toRawString and toString just like it is done for components. 5 - Can the returned array from __debugInfo be used in a "normal" method like `toComponents` naming can be changed/improve to ease migration from parse_url or is this left for userland library ?