Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126587 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 91BF11A00BC for ; Wed, 5 Mar 2025 22:45:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1741214587; bh=3NVCijzjuqbbI65mcyDUAbOMMswa1LIsY14B/gcUM4E=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=g7ph+2xv4Ipe3w8oS8CBu4t0cfmou4VuOwmSo0YMXx4/sWIB+k2nwvj7yAouc7fHH xVjPG0FpHm7TYG/lNsCqDCWWP8Ec0OT1O0bzZXD0KdIw4BMjbb2Mgdbk+ihGgrXy0J Z5suYyvNmGzyi4+s0ifhg8onqXpR4YAhmP3HDNFc3oZg7v9Zfzz3o3AM7Iw95VV53Q 2cVSe8sL0XegNLuFMqozMcgRqEKGNB5w8N31qIrffl+lfx8zV4zxhwfWomRAAW2aPw Se8kdWcHeiMYWvwZdcU0jr9DF68gjo6eRkIwBVsWX8q/QBSuSZiQkc3RwISjcYajH7 Pm+hnoORJ4P1Q== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 1CEEC180082 for ; Wed, 5 Mar 2025 22:43:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 5 Mar 2025 22:43:05 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 0AF283A034D for ; Wed, 5 Mar 2025 22:45:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=content-type:content-type:x-mailer:mime-version:references :in-reply-to:message-id:subject:subject:from:from:date:date; s= automattic1; t=1741214740; bh=3NVCijzjuqbbI65mcyDUAbOMMswa1LIsY1 4B/gcUM4E=; b=XHWLovGZKHmFYXfZbVKifUm/HvfaYXiGAdw7fpq6SuFVVPJgyX io5qVogs/61O2jsoHj39SoHuI+Vhrvod9teryq9z9iUg9gdxs1MDibu2+eiP9Mcp 25aaiKPbWK7peis2KtqT6NbkdkXOVDnLSObVfXsWfdIz/Z097QYR9EDHLftXHHKb qmU37xLEv0g659vn9e3dYRG/sTpWbdU3tYFsJPuam20PjWX8it5ls9yk04dysrWO XACHD/Q4mPXJYPs7Bg65EmlteOFQGlkEgOKkFmzncklrWf0ct/EQMzyhR9ttdlx+ L+cgxG3bh39sMoox6L+ZEf0jYmMWmrryjy4g== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7UqkwWbmKfzb for ; Wed, 5 Mar 2025 22:45:40 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 48F873A034C for ; Wed, 5 Mar 2025 22:45:40 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="YELbWvto"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id E9673A099E for ; Wed, 5 Mar 2025 22:45:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1741214740; bh=3NVCijzjuqbbI65mcyDUAbOMMswa1LIsY14B/gcUM4E=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=YELbWvtoSR9uZC11FPp5OqoljRJVQbIp3+DS/j5Kfv9oVRUfRjqxjXETxuPPOAIq/ 21UYnSlU2/KOWImTbbYFjI19W0l3m/3CdEMOUujk87qPmCZ1XIvLvRs4hf/S3yrq6j iKI+RLveOqShFssWYv0GqPsF74D1NaRWFm4gsynxj1lOjghBbk8uGpEzZhw6318H+N lwelF0dCp+39D+LxC7sth47VHbuzRngOyp7+NK7uhE8BfWTdyRp7xavggSVgJ31S1p a0KQ+ySeduUszTvReqL8rT4MzWvNyXeMZ7ljzGN5LDxpGbdDN4nEN2e8VvD+CrgOyh uPDqSDnAhH+xg== Received: from mail-pl1-f200.google.com (mail-pl1-f200.google.com [209.85.214.200]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id CE931A074C for ; Wed, 5 Mar 2025 22:45:39 +0000 (UTC) Received: by mail-pl1-f200.google.com with SMTP id d9443c01a7336-2234dddbd6fso1510265ad.0 for ; Wed, 05 Mar 2025 14:45:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741214739; x=1741819539; h=mime-version:references:in-reply-to:message-id:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=iXmba8Hk4FvybuJRXedRmgbggUHkYMcRZ1FGnpZG6wg=; b=Ig3IctZIo65+zQ8YniA3hQeDyY9oCEu7t6rtngjb+YJfUcRFouGJjQzdieq3Pagk5i gyloLacD4dRgOGOX1L84hoq65oCndQ3r8L2aOJsVAKHZ7/1g9vAdISAXPwXJQwVtcJL7 lJvHxaGAoP4FKeMVsTR2XbDRDLrhFZoN1lg2KzNDJ0TxndjQNp5ZHqt1sJTHZti+nZKd IsKEs8O4SUDl977ZvWF3wo+tOe9uLwqd4zkk86f7Q81JKrXEaK3Bb/peqpi+pB1zHzFY BRthhW80IK2dGwwSDWL0HhffqM7IQogM20DHXANt3w2K7UBhOY2B1iPPKbJm1vzkJKvw qGSg== X-Gm-Message-State: AOJu0YzpVR9opaD6KkpR1AWgTUjbZQUnW1mqFC0fKSGf570VO9UILI3/ erm6QcH/Ibr9RuFE4cKgA2V16zUlW/bW7MUnMr9pISRSpk1r81XLVqiKA0oe19gLJPUieIMQTEC ddLEEdhzNtXOYe0kI1iv+ZUPVHQuK+GuPFp2x7FDxS5lM/KxGryvo//Vq6B4u5JM= X-Gm-Gg: ASbGnctg57vXVMmDMBrIXExASo8C2j/O3nx8C/fvTxIjKUGab0Tqho9BwAQnqBDyoCe Bj3TGKLgX25ri5db9ulgjRjgrs55cFIKDPIWR1zXi3JnYac44jQ8cJPw0s6XS2MgNyE0KCrORIY Cgj63/jW2HXyNV3cfE20w14mXmcytxT6FcQOvgoCYUQPGwrTWmaTcSi1QXHiPVzdF/ByRVFVc0P lmSVQwoNmAU+roqUaM50Mkhgpqm9rsLOZ0W1vTVyJANo61wvqSuASg/MOQ2f9tLFq/meZ1yDDq+ pf3Q8E7RaDNzORNdSy1gtT0RIfRyomYCC9pXGcVYEU0= X-Received: by 2002:a17:902:cf0b:b0:223:3eed:f680 with SMTP id d9443c01a7336-22409477568mr15346575ad.18.1741214738487; Wed, 05 Mar 2025 14:45:38 -0800 (PST) X-Google-Smtp-Source: AGHT+IHA2fprBXWe2ETsOu1m6P2CLLn+vbzmA7hxJpuwprlARErlNq46UqbU3ldDTsCoBP+Lsi/emQ== X-Received: by 2002:a17:902:cf0b:b0:223:3eed:f680 with SMTP id d9443c01a7336-22409477568mr15345945ad.18.1741214737724; Wed, 05 Mar 2025 14:45:37 -0800 (PST) Received: from [10.133.141.53] ([206.207.225.137]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-223f5a408d0sm17574825ad.252.2025.03.05.14.45.36 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 05 Mar 2025 14:45:37 -0800 (PST) Date: Wed, 05 Mar 2025 14:45:37 -0800 (PST) X-Google-Original-Date: 05 Mar 2025 15:45:33 -0700 X-Google-Original-From: Dennis Snell To: =?UTF-8?Q?M=C3=A1t=C3=A9_Kocsis?= Cc: Internals Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API Message-ID: <044E7A8E-B79D-44DB-B572-102A80CDFC3C@automattic.com> In-Reply-To: References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 X-Mailer: Unibox (443:24.3.0) Content-Type: multipart/alternative; boundary="=_EE9BE495-4B74-4D30-8B8D-3474577A9430" From: dennis.snell@automattic.com (Dennis Snell) --=_EE9BE495-4B74-4D30-8B8D-3474577A9430 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > On Feb 16, 2025, at 3:01 PM, M=C3=A1t=C3=A9 Kocsis wrote: >=20 >=20 > Hi Dennis, >>=20 >> I only harp on the WhatWG spec so much because for many people this will= be the only one they are aware of, if they are aware of any spec at all, a= nd this is a sizable vector of attack targeting servers from user-supplied = content. I=E2=80=99m curious to hear from folks here hat fraction of the ac= tual PHP code deals with RFC3986 URLs, and of those, if the systems using t= hem are truly RFC3986 systems or if the common-enough URLs are valid in bot= h specs. >>=20 >=20 > I think Ignace's examples already highlighted that the two specifications= differ in nuances so much that even I had to admit after months of=C2=A0tr= ying to squeeze them into the same interface that=C2=A0doing so would be ir= responsible. > The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing U= RNs or URIs with scheme-specific behavior - like ldap apparently),=C2=A0but= even the UriInterface of PSR-7 can build upon it. On the other hand, Uri\W= hatWg\Url will be useful for representing browser links and any other URLs = for the web (i.e. an HTTP application router component should use this clas= s). > =C2=A0 >> Just to enlighten me and possibly others with less familiarity, how and = when are RFC3986 URLs used and what are those systems supposed to do when a= n invalid URL appears, such as when dealing with percent-encodings as you b= rought up in response to Tim? >>=20 >=20 > I am not 100% sure what I brought up to Tim, but certainly, the biggest d= ifference between the two specs regarding percent-encoding was recently doc= umented in the RFC:=C2=A0https://wiki.php.net/rfc/url_parsing_api#percent-e= ncoding > . The other main difference is how the host component is stored: WHATWG a= utomatically percent-decodes it,=C2=A0while RFC3986 doesn't. This is summar= ized in the=C2=A0https://wiki.php.net/rfc/url_parsing_api#component_retriev= al > section (a bit below). > =C2=A0=C2=A0 >> This would be fine, knowing in hindsight that it was originally a relati= ve path. Of course, this would mean that it=E2=80=99s critical that `https:= //example.com >> ` does not replace the actual host part if one is provided in `$url`. Fo= r example, this code should work. >>=20 >> ``` >> =C2=A0 =C2=A0 $url =3D Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc >> =E2=80=99, =E2=80=98https://example.com >> =E2=80=99 ); >> =C2=A0 =C2=A0 $url->domain =3D=3D=3D 'wiki.php.net >> ' >>=20 >=20 > Yes. it's the case. Both classes only use the base URL for relative URIs. > =C2=A0 >> Hopefully this won=E2=80=99t be too controversial, even though the conce= pt was new to me when I started having to reliably work with URLs. I choose= the example I did because of human risk factors in security exploits.= =C2=A0 "xn--google.com >> " is not in fact a Google domain, but an IDNA domain decoding to "= =E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com >> =E2=80=9D >>=20 >=20 > I got your point, so I implemented your suggestion. Actually, I made yet = another larger API change in the meanwhile, but in any case, the WHATWG imp= lementation now supports IDNA the following way: > $url =3D Uri\WhatWg\Url::parse("https://=F0=9F=90=98.com/=F0=9F=90=98?= =F0=9F=90=98=3D=F0=9F=90=98", null); >=20 > echo $url->getHost(); // xn--go8h.com >=20 > echo $url->getHostForDisplay(); // =F0=9F=90=98.com > echo $url->toString(); // https://xn--go8h.com/%F0%9F%90%98= ?%F0%9F%90%98=3D%F0%9F%90%98 >=20 > echo $url->toDisplayString(); / https://=F0=9F=90=98.com/%F0%9F%90= %98?%F0%9F%90%98=3D%F0%9F%90%98=C2=A0 >=20 >=20 > Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed ou= t at the end of=C2=A0https://externals.io/message/126182#126184 > ), and adding support for RFC3987 (therefore IRIs) would be a very heavy = amount of work,=C2=A0it's just not feasible within this RFC :( To make thin= gs worse, its code should be written from scratch, since I haven't found an= y suitable C library yet for this purpose. That's why I'll leave them for >=20 >=20 > On other notes, let me share some of the changes since my previous messag= e to the mailing list: >=20 >=20 > - First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method f= rom the proposal after Arnaud's feedback. Now, both the normalized (and dec= oded), as well as the non-normalized representation can equally be retrieve= d from the same URI instance. This was necessary to change in order for use= rs to be able to consistently use URIs. Now, if someone needs an exact URI = component value, they can use the getRaw*() getter. If they want the normal= ized and=C2=A0percent-decoded form then a get*() getter should be used. For= more information, the >=20 > > =C2=A0https://wiki.php.net/rfc/url_parsing_api#component_retrieval > section should be consulted. >=20 >=20 This seems like a good change. > - I made a few less important API changes, like converting the WhatWgErro= r class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changi= ng the return type of some getters (removing nullability) etc. >=20 >=20 Love this. > - I fixed quite some smaller details of the implementation along with a v= ery important spec incompatibility: until now, the "path" component didn't = contain the leading "/" character when it should have. Now, both classes co= nform to their respective specifications with regards to path handling. >=20 >=20 This is a late thought, and surely amenable to a later RFC, but I was think= ing about the get/set path methods and the issue of the / and %2F. =C2=A0- If we exposed `getPathIterator()` or `getPathSegments()` could we n= ot report these in their fully-decoded forms? That is, because the path seg= ments are separated by some invocation or array element, they could be deco= ded? =C2=A0- Probably more valuably, if `withPath()` accepted an array, could we= not allow fully non-escaped PHP strings as path segments which the URL cla= ss could safely and by-default handle the escaping for the caller? Right now, if someone haphazardly joins path segments in order to set `with= Path()` they will likely be unaware of that nuance and get the path wrong. = On the grand scale of things, I suspect this is a really minor risk. Howeve= r, if they could send in an array then they would never need to be aware of= that nuance in order to provide a fully-reliable URL, up to the class reje= cting path segments which cannot be represented. >=20 >=20 > I think the RFC is now mature enough to consider voting in the foreseeabl= e=C2=A0future, since most of the concerns which came up until now are addre= ssed some way or another. However, the only remaining question that I still= have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should = be final? Personally, I don't see much problem with opening them for extens= ion (other than some technical challenges that I already shared a few month= s ago), and I think people will have legitimate use cases for extending the= se classes. On the other hand, having final classes may allow us to make sl= ightly more significant changes without BC concerns until we have a more ba= ttle-tested API, and of course completely eliminate=C2=A0the=C2=A0need to o= vercome=C2=A0the said technical=C2=A0challenges. According to Tim, it may a= lso result in safer code because spec-compliant base classes cannot be exte= nded by possibly non-spec compliant/buggy children. I don't necessarily ful= ly agree with this specific concern, but here it is. >=20 >=20 I=E2=80=99ve taken another fresh and full review of the RFC and I just want= to share my appreciation for how well-written it seems, and how meticulous= ly you have taken everyone=E2=80=99s feedback and incorporated it. It seems= mature enough to me as well, and I think it=E2=80=99s in a good place. Sti= ll, here are some additional thoughts (and a previous one again) related to= some of aspects, mostly naming. The HTML5 library has `::createFromString()` instead of `parse()`. Did you = consider following this form? It doesn=E2=80=99t seem that important, but c= ould be a nice improvement in consistency among the newer spec-compliant AP= Is. Further, I think `createFromString()` is a little more obvious in inten= t, as `parse()` is so generic. Given the issues around equivalence, what about `isEquivalent()` instead of= `equals()`? In the RFC I think you have been careful to use the =E2=80=9Ce= quivalence=E2=80=9D terminology, but then in the actual interface we fall b= ack to `equals()` and lose some of the nuance. Something about not implementing `getRawScheme()` and friends in the WHATWG= class seems off. Your rationale makes sense, but then I wonder what the pr= oblem is in exposing the raw untranslated components, particularly since th= e =E2=80=9Craw=E2=80=9D part of the name already suggests some kind of dang= er or risk in using it as some semantic piece. Tim brought up the naming of `getHost()` and `getHostForDisplay()` as well = as the correspondence with the `toString()` methods. I=E2=80=99m not sure i= f it was overlooked or I missed the followup, but I wonder what your though= ts are on passing an enum to these methods indicating the rendering context= . Here=E2=80=99s why: I see developers reach for the first method that look= s right. In this case, that would almost always be `getHost()`, yet `getHos= t()` or `toString()` or whatever is going to be inappropriate in many commo= n cases. I see two ways of baking in education into the API surface: creati= ng two symmetric methods (e.g. `getDisplayableHost()` and `getNonDisplayabl= eHost()`); or requiring an enum forcing the choice (e.g. `getHost( ForDispl= ay | ForNonDisplay )`). In the case on an enum this could be equally applie= d across all of the relevant methods where such a distinction exists. On on= e hand this could be seen as forcing callers to make a choice, but on the o= ther hand it can also be seen as a safeguard against an extremely-common fo= ot-gun, making such an easy oversight impossible. Just this week I stumbled upon an issue with escaping the hash/fragment par= t of a URL. I think that browsers used to decode percent-encodings in the f= ragment but they all stopped and this was removed from the WHATWG HTML spec= [no-percent-escaping]. The RFC currently shows `getFragment()` decoding pe= rcent-encoded fragments, However, I believe that the WHATWG URL spec only i= ndicates percent-encoding when _setting_ the fragment. You can test this in= a browser with the following example: Chrome, Firefox, and Safari exhibit = the same behavior. =C2=A0 =C2=A0 u =3D new URL(window.location) =C2=A0 =C2=A0 u.hash =3D =E2=80=98one and two=E2=80=99; =C2=A0 =C2=A0 u.hash =3D=3D=3D =E2=80=98#one%20and%20two=E2=80=99; =C2=A0 =C2=A0 u.toString() =3D=3D=3D =E2=80=98=E2=80=A6.#one%20and%20two= =E2=80=99; So I think it may be more accurate and consistent to handle `Whatwg\Url::ge= tFragment` in the same way as `getScheme()`. When setting a fragment we sho= uld percent-encode the appropriate characters, but when reading it, we shou= ld never interpret those characters =E2=80=94 it should always return the = =E2=80=9Craw=E2=80=9D value of the fragment. [no-percent-escaping]:=C2=A0https://github.com/whatwg/url/issues/344 Once again, thank you for the great work you=E2=80=99ve put into this. I= =E2=80=99m so excited to have it. All my comments should be understood excl= usively within the WHATWG domain as I don=E2=80=99t have the same experienc= e with the RFC3986 side. Dennis Snell >=20 >=20 > Regards, > M=C3=A1t=C3=A9 >=20 >=20 >=20 --=_EE9BE495-4B74-4D30-8B8D-3474577A9430 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
On Feb 16, 2025, at 3:01 PM, M=C3=A1t=C3=A9 Kocsis <kocsismate90@gm= ail.com> wrote:

Hi Dennis,

I think I= gnace's examples already highlighted that the two specifications differ in = nuances so much that even I had to admit after months of=C2=A0trying to squ= eeze them into the same interface that=C2=A0doing so would be irresponsible= .
The Uri\R= fc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs= with scheme-specific behavior - like ldap apparently),=C2=A0but even the UriInterface of PSR-7 can build u= pon it. On the other hand, Uri\WhatWg\Url will be useful for representing b= rowser links and any other URLs for the web (i.e. an HTTP application route= r component should use this class).
=C2=A0

I am not = 100% sure what I brought up to Tim, but certainly, the biggest difference b= etween the two specs regarding percent-encoding was recently documented in = the RFC:=C2=A0https://wiki.php.net/rfc/url_parsing_api#percent-encoding. The= other main difference is how the host component is stored: WHATWG automati= cally percent-decodes it,=C2=A0while RFC3986 doesn't. This is summarized in= the=C2=A0https://wiki.php.net/rfc/url_parsing_api#component_retrievalsec= tion (a bit below).
=C2=A0=C2=A0
https://examp= le.com` does not replace the actual host part if one is provided in `$u= rl`. For example, this code should work.

```
=C2=A0 =C2=A0 $url = =3D Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc=E2=80=99, =E2=80=98<= a href=3D"https://example.com" rel=3D"noreferrer" target=3D"_blank">https:/= /example.com=E2=80=99 );
=C2=A0 =C2=A0 $url->domain =3D=3D=3D 'wiki.php= .net'

Yes. it's= the case. Both classes only use the base URL for relative URIs.
=C2=A0
xn= --google.com" is not in fact a Google domain, but an IDNA domain decodi= ng to "=E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com=E2=80=9D

I got you= r point, so I implemented your suggestion. Actually, I made yet another lar= ger API change in the meanwhile, but in any case, the WHATWG implementation= now supports IDNA the following way:
$url =
=3D Uri\WhatWg\Url::parse("https://=F0=9F=90=98.com/=F0=9F=90=98?=F0=9F=90=98=3D=F0=9F=90=98", null);

echo $url->getH= ost(); // xn--go8h.comecho $url->getHostForDisplay(); // =F0=9F=90=98.com
echo $url->toS= tring(); // https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=3D%= F0%9F%90%98
echo $url->toDisplayString(); / https://=F0=9F=90=98.com/%F0%9F%90%98?%F0%9F%90%98=3D%F0%9F%90%98=C2=A0
Unfortuna= tely, RFC3986 doesn't support IDNA (as Ignace already pointed out at the en= d of=C2=A0https://externals.io/message/126182#1261= 84), and adding support for RFC3987 (therefore IRIs) would be a very he= avy amount of work,=C2=A0it's ju= st not feasible within this RFC :( To make things worse, its code should be= written from scratch, since I haven't found any suitable C library yet for= this purpose. That's why I'll leave them for

On other = notes, let me share some of the changes since my previous message to the ma= iling list:

- First a= nd foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the pro= posal after Arnaud's feedback. Now, both the normalized (and decoded), as w= ell as the non-normalized representation can equally be retrieved from the = same URI instance. This was necessary to change in order for users to be ab= le to consistently use URIs. Now, if someone needs an exact URI component v= alue, they can use the getRaw*() getter. If they want the normalized and= =C2=A0percent-decoded form then a get*() getter should be used. For more in= formation, the

This seems like a good change.

- I made a few less important API changes= , like converting the WhatWgError class to an enum, adding a Uri\Rfc3986\IU= ri::getUserInfo() method, changing the return type of some getters (removin= g nullability) etc.

Love this.

- I fixed quite some smaller details of t= he implementation along with a very important spec incompatibility: until n= ow, the "path" component didn't contain the leading "/" character when it s= hould have. Now, both classes conform to their respective specifications wi= th regards to path handling.

This is a late thought, and surely amenable to a later RFC, but I was = thinking about the get/set path methods and the issue of the / and %2F.

=C2=A0- If we exposed `getPathIterator()` or `getPathSegments()` could= we not report these in their fully-decoded forms? That is, because the pat= h segments are separated by some invocation or array element, they could be= decoded?
=C2=A0- Probably more valuably, if `withPath()` accepted an array, cou= ld we not allow fully non-escaped PHP strings as path segments which the UR= L class could safely and by-default handle the escaping for the caller?

Right now, if someone haphazardly joins path segments in order to set = `withPath()` they will likely be unaware of that nuance and get the path wr= ong. On the grand scale of things, I suspect this is a really minor risk. H= owever, if they could send in an array then they would never need to be awa= re of that nuance in order to provide a fully-reliable URL, up to the class= rejecting path segments which cannot be represented.


I think the RFC is now mature enough to c= onsider voting in the foreseeable=C2=A0future, since most of the concerns w= hich came up until now are addressed some way or another. However, the only= remaining question that I still have is whether the Uri\Rfc3986\Uri and th= e Uri\WhatWg\Url classes should be final? Personally, I don't see much prob= lem with opening them for extension (other than some technical challenges t= hat I already shared a few months ago), and I think people will have legiti= mate use cases for extending these classes. On the other hand, having final= classes may allow us to make slightly more significant changes without BC = concerns until we have a more battle-tested API, and of course completely e= liminate=C2=A0the=C2=A0need to overcome=C2=A0the said technical=C2=A0challe= nges. According to Tim, it may also result in safer code because spec-compl= iant base classes cannot be extended by possibly non-spec compliant/buggy c= hildren. I don't necessarily fully agree with this specific concern, but he= re it is.

I=E2=80=99ve taken another fresh and full review of the RFC and I just= want to share my appreciation for how well-written it seems, and how metic= ulously you have taken everyone=E2=80=99s feedback and incorporated it. It = seems mature enough to me as well, and I think it=E2=80=99s in a good place= . Still, here are some additional thoughts (and a previous one again) relat= ed to some of aspects, mostly naming.

The HTML5 library has `::createFromString()` instead of `parse()`. Did= you consider following this form? It doesn=E2=80=99t seem that important, = but could be a nice improvement in consistency among the newer spec-complia= nt APIs. Further, I think `createFromString()` is a little more obvious in = intent, as `parse()` is so generic.

Given the issues around equivalence, what about `isEquivalent()` inste= ad of `equals()`? In the RFC I think you have been careful to use the = =E2=80=9Cequivalence=E2=80=9D terminology, but then in the actual interface= we fall back to `equals()` and lose some of the nuance.

Something about not implementing `getRawScheme()` and friends in the W= HATWG class seems off. Your rationale makes sense, but then I wonder what t= he problem is in exposing the raw untranslated components, particularly sin= ce the =E2=80=9Craw=E2=80=9D part of the name already suggests some kind of= danger or risk in using it as some semantic piece.

Tim brought up the naming of `getHost()` and `getHostForDisplay()` as = well as the correspondence with the `toString()` methods. I=E2=80=99m not s= ure if it was overlooked or I missed the followup, but I wonder what your t= houghts are on passing an enum to these methods indicating the rendering co= ntext. Here=E2=80=99s why: I see developers reach for the first method that= looks right. In this case, that would almost always be `getHost()`, yet `g= etHost()` or `toString()` or whatever is going to be inappropriate in many = common cases. I see two ways of baking in education into the API surface: c= reating two symmetric methods (e.g. `getDisplayableHost()` and `getNonDispl= ayableHost()`); or requiring an enum forcing the choice (e.g. `getHost( For= Display | ForNonDisplay )`). In the case on an enum this could be equally a= pplied across all of the relevant methods where such a distinction exists. = On one hand this could be seen as forcing callers to make a choice, but on = the other hand it can also be seen as a safeguard against an extremely-comm= on foot-gun, making such an easy oversight impossible.

Just this week I stumbled upon an issue with escaping the hash/fragmen= t part of a URL. I think that browsers used to decode percent-encodings in = the fragment but they all stopped and this was removed from the WHATWG HTML= spec [no-percent-escaping]. The RFC currently shows `getFragment()` decodi= ng percent-encoded fragments, However, I believe that the WHATWG URL spec o= nly indicates percent-encoding when _setting_ the fragment. You can test th= is in a browser with the following example: Chrome, Firefox, and Safari exh= ibit the same behavior.

=C2=A0 =C2=A0 u =3D new URL(window.location)
=C2=A0 =C2=A0 u.hash =3D =E2=80=98one and two=E2=80=99;
=C2=A0 =C2=A0 u.hash =3D=3D=3D =E2=80=98#one%20and%20two=E2=80=99;
=C2=A0 =C2=A0 u.toString() =3D=3D=3D =E2=80=98=E2=80=A6.#one%20and%20t= wo=E2=80=99;

So I think it may be more accurate and consistent to handle `Whatwg\Ur= l::getFragment` in the same way as `getScheme()`. When setting a fragment w= e should percent-encode the appropriate characters, but when reading it, we= should never interpret those characters =E2=80=94 it should always return = the =E2=80=9Craw=E2=80=9D value of the fragment.

[no-percent-escaping]:=C2=A0https://github.com/whatwg/url/issues/344

Once again, thank you for the great work you=E2=80=99ve put into this.= I=E2=80=99m so excited to have it. All my comments should be understood ex= clusively within the WHATWG domain as I don=E2=80=99t have the same experie= nce with the RFC3986 side.

Dennis Snell


Regards,
M=C3=A1t=C3=A9

--=_EE9BE495-4B74-4D30-8B8D-3474577A9430--