Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126780 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 85DC81A00BC for <internals@lists.php.net>; Sat, 15 Mar 2025 22:05:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1742076175; bh=k2sYRTpzxoweLnZCB4mWaHJvKhtSg0+aTn8e2sPS/X8=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=heE41cR4v2fPrBkoz++cQUBnoz68Of+/klZiWEEOmHL5lntNAveTaKhTnmUYf4nR3 WYrLw1V/A/VKamwZ2P5Cw40Pz1l9xUyl2zCfjTJlQIBtL2Upt2mwne43/2NiyyLZD+ eYY34jlFevjVCQXw/r0VSwlxK6y1ACTQrPJdfyCXGGSjocB+PD3Hg4AfCDQ1e+Yflz m+xvuU3fIcfFwojpXjjqp+HGOtCghNdj87H8szMATYMufQ4O7UHGAU+/jSsx+6p2/G H8R7mos+u6Pz+CyhDm1jL1Bi2n9KbnZKXF3Ro5hPEGNkrL+efcUzSzkCEVMFl09a+z 0i8p6bn5fC4ow== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 760B3180032 for <internals@lists.php.net>; Sat, 15 Mar 2025 22:02:54 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: <kocsismate90@gmail.com> Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for <internals@lists.php.net>; Sat, 15 Mar 2025 22:02:54 +0000 (UTC) Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-476964b2c1dso25114331cf.3 for <internals@lists.php.net>; Sat, 15 Mar 2025 15:05:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742076326; x=1742681126; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=/3WWvTAFKsiqd+Kob3eLkhROLOe/4BPoBiOK0aWFFx8=; b=CbLpYpzCjyw7giGm6fOQKGIiW65VO4lk7dUE2fgvxq1ZOSN6FrTWS9/DTAw++bKTon XRfDIv6N9hjXj+xUQpGThNF7/R5Y7X7F7MrqJUkFaB1gwFTNB73Do0FGz0hjez2YeggJ f4MPm3Ik61aE2L/0zVxvClKJNiZUexrpR+t1ktVlZ1lCLhWdjVs9w7TbbWFLjqIfWdbs rDjnKdBBrHPwqz3odgvV6KZxF64fUFVn4E9t3S2Ikm/ouHKqWxd/zAFIJCE0n232ZU2N D1d2tY13poJEwnTxKvr9NOBJZumuuh/ZRRWTqxqNmvT9p/LILTSv8mnJUsDbRaqkupfF hXkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742076326; x=1742681126; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=/3WWvTAFKsiqd+Kob3eLkhROLOe/4BPoBiOK0aWFFx8=; b=doI34jrDHorG6YKvFfdOP8FHySfs7s2/zU5lOmsAVlV80C38TeoPzKUEiE4X6cH4k2 dPaHIN3qlK2VWj5PexHbiqGtkUt1kUkLNyUhGVvdPVTz1WudpT4q0ElC9sob5Gzvoprl QzxawLb661S1NgkM/mglUbc4XPko4KT8H/d0dMG1xW9+KullQ2hsikWkRdmi1od2DK11 XLF8xkTN3VkRfz6upW6OlLkhPatjnX735/wQdlcXsGw/q1Qi1/pfC8e0/dQ/KPh2vA/z F50qFTaMGSGkex2UeBw5iqft7zUwFIaxtJldjWBYJTzRU2xtQgbRG8PpF2Up26OfjboO 1J+g== X-Gm-Message-State: AOJu0Yw/M71DY0oMGOkQMhFV+OEs6G6hoc4WX0LeW1mj+xibkb2w7YLo N9e9HCuqz9XAsYnrVa/3E3ck4esSim8OuDYihl3m7avz5yeaP0bn6tyDunfu7bB6aAt8U1PDoES 76RMsfxsq7Bp2h1coGqyP1s7A7YMW+zWWSKf81g== X-Gm-Gg: ASbGnctMrSC0M/vfWFXYzEI/3VSd8qJsefjBosG9YF+KVSjvKfLDZvA3dmMEXEkkDlC wnTKNU6qcBFm1xxOhVX5fFdTOWo/13me2UzhX9wH6v6ObitUQItfoycE9c74hhTGcxKnmc8qhQ6 3/zTSYCpdxwtbnCt/p90kTwWFgHK+J8Il2289i X-Google-Smtp-Source: AGHT+IEr1ENbr1E0ygg9H5gPcM0r4DOiVDNhx1eOEmTtNNhH5AFG4jU4mFkwmzSuOUaOGYZnOc21yspNwjVc376Szy0= X-Received: by 2002:a05:622a:1189:b0:476:77a5:3106 with SMTP id d75a77b69052e-476c812444emr117769661cf.5.1742076325782; Sat, 15 Mar 2025 15:05:25 -0700 (PDT) Precedence: bulk list-help: <mailto:internals+help@lists.php.net list-unsubscribe: <mailto:internals+unsubscribe@lists.php.net> list-post: <mailto:internals@lists.php.net> List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <CAH5C8xUb1O20ZDrOQNC=ckFxHUUWSK7sw_njQQzFBd0qgQqoww@mail.gmail.com> <dd61999c-1ebd-4765-9add-cd8065968965@gmail.com> <CAOV5rgYZ_s1of-igLFEK7oqqh6=3HeYv0=JvvKvvjkcp0F6Q9Q@mail.gmail.com> <CAH5C8xV67frqOBCvLt73RM7QO86_pu40sHP5h71_kDRbKPtA8Q@mail.gmail.com> <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <CAH5C8xV69pHnSKWJrVB8EQab7vPhcaXAwYezoG6z3Oj7ZkXBaQ@mail.gmail.com> <044E7A8E-B79D-44DB-B572-102A80CDFC3C@automattic.com> In-Reply-To: <044E7A8E-B79D-44DB-B572-102A80CDFC3C@automattic.com> Date: Sat, 15 Mar 2025 23:05:14 +0100 X-Gm-Features: AQ5f1Jp4EAV9WsCy6Xyo8TlPC-kjI2-cdvhqehOxJE6L7jEk3SnsW0jkGiAy7Y4 Message-ID: <CAH5C8xWpquzmvtbSk6=mL4U0z=Tupv2zD+FXag+zenF7q04HUQ@mail.gmail.com> Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: Dennis Snell <dennis.snell@automattic.com> Cc: Internals <internals@lists.php.net> Content-Type: multipart/alternative; boundary="00000000000091ea84063068c0eb" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --00000000000091ea84063068c0eb Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Dennis, > This is a late thought, and surely amenable to a later RFC, but I was > thinking about the get/set path methods and the issue of the / and %2F. > > - If we exposed `getPathIterator()` or `getPathSegments()` could we not > report these in their fully-decoded forms? That is, because the path > segments are separated by some invocation or array element, they could be > decoded? > - Probably more valuably, if `withPath()` accepted an array, could we no= t > allow fully non-escaped PHP strings as path segments which the URL class > could safely and by-default handle the escaping for the caller? > Yes, these are very good ideas, and actually they are in line with how I would imagine a second iteration. Probably, getPathSegments() could return the "%2F" (percent-encoded form of "/") percent-decoded, sure. But the rest of the reserved characters will also be an issue, since they can also appea= r within the path (i.e. "&" inside "Document & Settings" etc.) percent-encoded. So percent decoding of reserved characters should still be taken into account. Right now, if someone haphazardly joins path segments in order to set > `withPath()` they will likely be unaware of that nuance and get the path > wrong. On the grand scale of things, I suspect this is a really minor ris= k. > However, if they could send in an array then they would never need to be > aware of that nuance in order to provide a fully-reliable URL, up to the > class rejecting path segments which cannot be represented. > Yes, consuming an array is also a good idea, but for the same reason as above, it's not enough to take care of correctly percent-encoding "/" in order to have a valid URI as a result. (Of course I'm still talking about RFC 3986, WHATWG still performs automatic percent-encoding) > > The HTML5 library has `::createFromString()` instead of `parse()`. Did yo= u > consider following this form? It doesn=E2=80=99t seem that important, but= could be > a nice improvement in consistency among the newer spec-compliant APIs. > Further, I think `createFromString()` is a little more obvious in intent, > as `parse()` is so generic. > > Given the issues around equivalence, what about `isEquivalent()` instead > of `equals()`? In the RFC I think you have been careful to use the > =E2=80=9Cequivalence=E2=80=9D terminology, but then in the actual interfa= ce we fall back to > `equals()` and lose some of the nuance. > In my implementation, I tried to choose terminology that people are familiar with instead of using the technicus terminus of URIs. Instead of recompose(), I used toString() (or some variant of it), instead of isEquivalent(), I used equals(). Parse() is probably an outlier, since it's the correct name of the exact process. But in any case, I consider these names adequately short, and I think they very clearly convey their intent. Using the technicus terminus would probably even more suit those who have deep familiarity with URIs, but this group will likely be the minority forever. For the rest of the people, the current names make more sense, so I'd prefer keeping them as-is. > > Something about not implementing `getRawScheme()` and friends in the > WHATWG class seems off. Your rationale makes sense, but then I wonder wha= t > the problem is in exposing the raw untranslated components, particularly > since the =E2=80=9Craw=E2=80=9D part of the name already suggests some ki= nd of danger or > risk in using it as some semantic piece. > Hm, interesting remark. Do I understand correctly that you are suggesting to expose getRawScheme() and getRawHost() with their original value? If so, then this has technical challenges: the WHATWG parser doesn't store the original value of these two components, so they are effectively lost when automatically transformation happens during parsing. But this is normal, since the WHATWG specification doesn't really care about the original value of these components. > Tim brought up the naming of `getHost()` and `getHostForDisplay()` as wel= l > as the correspondence with the `toString()` methods. I=E2=80=99m not sure= if it was > overlooked or I missed the followup, but I wonder what your thoughts are = on > passing an enum to these methods indicating the rendering context. Here= =E2=80=99s > why: I see developers reach for the first method that looks right. In thi= s > case, that would almost always be `getHost()`, yet `getHost()` or > `toString()` or whatever is going to be inappropriate in many common case= s. > I see two ways of baking in education into the API surface: creating two > symmetric methods (e.g. `getDisplayableHost()` and > `getNonDisplayableHost()`); or requiring an enum forcing the choice (e.g. > `getHost( ForDisplay | ForNonDisplay )`). In the case on an enum this cou= ld > be equally applied across all of the relevant methods where such a > distinction exists. On one hand this could be seen as forcing callers to > make a choice, but on the other hand it can also be seen as a safeguard > against an extremely-common foot-gun, making such an easy oversight > impossible. > I am myself also a bit lost on the countless names that I tried out in the implementation, but I think I had toHumanFriendlyString() and toDisplayFriendlyString() methods at some point. These then ended up being toString() and toDisplayString() after some iterations. I would be ok with renaming getHost() and toString() so that their names suggest they don't use IDNA, but I'd clearly need a good enough suggestion, since neither "MachineFriendly", nor "NonDisplayable" sound like the best alternative for me. I was also considering using getIdnaHost() and toIdnaString(), but I realized these are the worst looking names I have come up with so far. > > Just this week I stumbled upon an issue with escaping the hash/fragment > part of a URL. I think that browsers used to decode percent-encodings in > the fragment but they all stopped and this was removed from the WHATWG HT= ML > spec [no-percent-escaping]. The RFC currently shows `getFragment()` > decoding percent-encoded fragments, However, I believe that the WHATWG UR= L > spec only indicates percent-encoding when _setting_ the fragment. You can > test this in a browser with the following example: Chrome, Firefox, and > Safari exhibit the same behavior. > > u =3D new URL(window.location) > u.hash =3D =E2=80=98one and two=E2=80=99; > u.hash =3D=3D=3D =E2=80=98#one%20and%20two=E2=80=99; > u.toString() =3D=3D=3D =E2=80=98=E2=80=A6.#one%20and%20two=E2=80=99; > > So I think it may be more accurate and consistent to handle > `Whatwg\Url::getFragment` in the same way as `getScheme()`. When setting = a > fragment we should percent-encode the appropriate characters, but when > reading it, we should never interpret those characters =E2=80=94 it shoul= d always > return the =E2=80=9Craw=E2=80=9D value of the fragment. > > [no-percent-escaping]: https://github.com/whatwg/url/issues/344 > > Thank you for the suggestion and for noticing this problem. I believe you must have read a version of the RFC where I was still trying to find out the correct percent-decoding rules for WHATWG. At some point, I was completely misunderstanding what the specification prescribed, so I had to make quite some changes in the RFC regarding this aspect + finally I managed to describe elaborately the reasoning behind the choices. Now I think the rules make sense. Yes, my implementation automatically percent-encodes the input when parsing or modifying a WHATWG URL. You are also right that WHATWG never percent-decodes the output due to the following reason: ... the point of view of a maintainer of the WHATWG specification is that > webservers may legitimately choose to consider encoded and decoded paths > distinct, and a standard cannot force them not to do so. The said author made this clear in multiple comments, but this one is linked in the RFC: https://github.com/whatwg/url/issues/606#issuecomment-926395864 So basically all the non-raw getters return a value that is considered by WHATWG non-equivalent with the original input. This is also explained in the "Component retrieval" section in more detail now ( https://wiki.php.net/rfc/url_parsing_api#component_retrieval). I hope Regards, M=C3=A1t=C3=A9 --00000000000091ea84063068c0eb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr">Hi Dennis,<div><br></div></div><div class= =3D"gmail_quote gmail_quote_container"><blockquote class=3D"gmail_quote" st= yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(77,77,77);padding= -left:1ex"><div class=3D"msg-7998362329702149963"><div><blockquote type=3D"= cite"><div class=3D"gmail_quote"> </div></blockquote> <div><br></div> <div> <div>This is a late thought, and surely amenable to a later RFC, but I was = thinking about the get/set path methods and the issue of the / and %2F.</di= v> <div><br></div> <div>=C2=A0- If we exposed `getPathIterator()` or `getPathSegments()` could= we not report these in their fully-decoded forms? That is, because the pat= h segments are separated by some invocation or array element, they could be= decoded?</div> <div>=C2=A0- Probably more valuably, if `withPath()` accepted an array, cou= ld we not allow fully non-escaped PHP strings as path segments which the UR= L class could safely and by-default handle the escaping for the caller?</di= v></div></div></div></blockquote><div><br></div><div>Yes, these are very go= od ideas, and actually they are in line with how I would imagine a second i= teration. Probably,=C2=A0<span style=3D"color:rgb(191,191,191)">getPathSegm= ents() could return</span></div><div><span style=3D"color:rgb(191,191,191)"= >the "%2F" (percent-encoded form of "/") percent-decode= d, sure. But the rest of the reserved characters will also be an issue, sin= ce they can also appear</span></div><div><span style=3D"color:rgb(191,191,1= 91)">within the path (i.e. "&" inside "Document & Se= ttings" etc.) percent-encoded. So percent decoding of reserved charact= ers should still be taken into account.</span></div><div><br></div><blockqu= ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px= solid rgb(77,77,77);padding-left:1ex"><div class=3D"msg-799836232970214996= 3"><div><div><div></div> <div>Right now, if someone haphazardly joins path segments in order to set = `withPath()` they will likely be unaware of that nuance and get the path wr= ong. On the grand scale of things, I suspect this is a really minor risk. H= owever, if they could send in an array then they would never need to be awa= re of that nuance in order to provide a fully-reliable URL, up to the class= rejecting path segments which cannot be represented.</div></div></div></di= v></blockquote><div><br></div><div>Yes, consuming an array is also a good i= dea, but for the same reason as above, it's not enough to take care of = correctly percent-encoding "/" in order</div><div>to have a valid= URI as a result.=C2=A0 (Of course I'm still talking about RFC 3986, WH= ATWG still performs automatic percent-encoding)</div><div>=C2=A0</div><bloc= kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:= 1px solid rgb(77,77,77);padding-left:1ex"><div class=3D"msg-799836232970214= 9963"><div> <div><br></div> <div>The HTML5 library has `::createFromString()` instead of `parse()`. Did= you consider following this form? It doesn=E2=80=99t seem that important, = but could be a nice improvement in consistency among the newer spec-complia= nt APIs. Further, I think `createFromString()` is a little more obvious in = intent, as `parse()` is so generic.</div> <div><br></div> <div>Given the issues around equivalence, what about `isEquivalent()` inste= ad of `equals()`? In the RFC I think you have been careful to use the =E2= =80=9Cequivalence=E2=80=9D terminology, but then in the actual interface we= fall back to `equals()` and lose some of the nuance.</div></div></div></bl= ockquote><div><br></div><div>In my implementation, I tried to choose termin= ology that people are familiar with instead of using the technicus terminus= of URIs. Instead of recompose(), I used toString() (or some variant of it)= , instead of isEquivalent(),</div><div>I used equals(). Parse() is probably= an outlier, since it's the correct=C2=A0name of the exact process. But= in any case, I consider these names adequately short, and I think they ver= y clearly convey their intent. Using the technicus</div><div>terminus would= probably=C2=A0<span style=3D"color:rgb(191,191,191)">even more=C2=A0</span= ><span style=3D"color:rgb(191,191,191)">suit those who have deep familiarit= y with URIs, but this group will likely be the minority forever. For the re= st of the people, the current names make more sense, so I'd prefer keep= ing them as-is.</span></div><div>=C2=A0</div><blockquote class=3D"gmail_quo= te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(77,77,77);p= adding-left:1ex"><div class=3D"msg-7998362329702149963"><div> <div><br></div> <div>Something about not implementing `getRawScheme()` and friends in the W= HATWG class seems off. Your rationale makes sense, but then I wonder what t= he problem is in exposing the raw untranslated components, particularly sin= ce the =E2=80=9Craw=E2=80=9D part of the name already suggests some kind of= danger or risk in using it as some semantic piece.</div></div></div></bloc= kquote><div><br></div><div>Hm, interesting remark. Do I understand correctl= y that you are suggesting to expose getRawScheme() and getRawHost() with th= eir original value? If so, then this has technical challenges: the WHATWG p= arser<span style=3D"color:rgb(191,191,191)">=C2=A0doesn't store</span><= /div><div><span style=3D"color:rgb(191,191,191)">the original value of thes= e two components,=C2=A0so they are effectively lost when automatically tran= sformation=C2=A0happens during parsing. But this is normal, since the WHATW= G specification doesn't really care about the original value of these c= omponents.</span></div><div><span style=3D"color:rgb(191,191,191)"><br></sp= an></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e= x;border-left:1px solid rgb(77,77,77);padding-left:1ex"><div class=3D"msg-7= 998362329702149963"><div> <div><br></div> <div>Tim brought up the naming of `getHost()` and `getHostForDisplay()` as = well as the correspondence with the `toString()` methods. I=E2=80=99m not s= ure if it was overlooked or I missed the followup, but I wonder what your t= houghts are on passing an enum to these methods indicating the rendering co= ntext. Here=E2=80=99s why: I see developers reach for the first method that= looks right. In this case, that would almost always be `getHost()`, yet `g= etHost()` or `toString()` or whatever is going to be inappropriate in many = common cases. I see two ways of baking in education into the API surface: c= reating two symmetric methods (e.g. `getDisplayableHost()` and `getNonDispl= ayableHost()`); or requiring an enum forcing the choice (e.g. `getHost( For= Display | ForNonDisplay )`). In the case on an enum this could be equally a= pplied across all of the relevant methods where such a distinction exists. = On one hand this could be seen as forcing callers to make a choice, but on = the other hand it can also be seen as a safeguard against an extremely-comm= on foot-gun, making such an easy oversight impossible.</div></div></div></b= lockquote><div><br></div><div>I am myself also a bit lost on the countless = names that I tried out in the implementation, but I think I had toHumanFrie= ndlyString() and toDisplayFriendlyString() methods at some point. These the= n ended up being toString() and toDisplayString() after some iterations. I = would be ok with renaming getHost() and toString() so=C2=A0that their names= suggest they don't use IDNA,=C2=A0but I'd clearly need a good enou= gh=C2=A0suggestion, since neither "MachineFriendly", nor "No= nDisplayable" sound like the best alternative for me. I was also consi= dering using getIdnaHost() and toIdnaString(), but I realized these are the= worst looking names I have come up with so=C2=A0far.</div><div>=C2=A0</div= ><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border= -left:1px solid rgb(77,77,77);padding-left:1ex"><div class=3D"msg-799836232= 9702149963"><div> <div><br></div> <div>Just this week I stumbled upon an issue with escaping the hash/fragmen= t part of a URL. I think that browsers used to decode percent-encodings in = the fragment but they all stopped and this was removed from the WHATWG HTML= spec [no-percent-escaping]. The RFC currently shows `getFragment()` decodi= ng percent-encoded fragments, However, I believe that the WHATWG URL spec o= nly indicates percent-encoding when _setting_ the fragment. You can test th= is in a browser with the following example: Chrome, Firefox, and Safari exh= ibit the same behavior.</div> <div><br></div> <div>=C2=A0 =C2=A0 u =3D new URL(window.location)</div> <div>=C2=A0 =C2=A0 u.hash =3D =E2=80=98one and two=E2=80=99;</div> <div>=C2=A0 =C2=A0 u.hash =3D=3D=3D =E2=80=98#one%20and%20two=E2=80=99;</di= v> <div>=C2=A0 =C2=A0 u.toString() =3D=3D=3D =E2=80=98=E2=80=A6.#one%20and%20t= wo=E2=80=99;</div> <div><br></div> <div>So I think it may be more accurate and consistent to handle `Whatwg\Ur= l::getFragment` in the same way as `getScheme()`. When setting a fragment w= e should percent-encode the appropriate characters, but when reading it, we= should never interpret those characters =E2=80=94 it should always return = the =E2=80=9Craw=E2=80=9D value of the fragment.</div> <div><br></div> <div>[no-percent-escaping]:=C2=A0<a href=3D"https://github.com/whatwg/url/i= ssues/344" target=3D"_blank">https://github.com/whatwg/url/issues/344</a> </div> <div><br></div></div></div></blockquote><div><br></div><div>=C2=A0Thank you= for the suggestion and for noticing this problem. I believe you must have = read a version of the RFC where I was still trying to find out the correct = percent-decoding rules for WHATWG. At some point, I was completely misunder= standing what the specification prescribed, so I had to make quite some cha= nges in the RFC regarding this aspect + finally I managed to describe elabo= rately the reasoning behind the choices. Now I think the rules make sense.<= /div><div><br></div><div>Yes, my implementation automatically percent-encod= es the input when parsing or modifying a WHATWG URL. You are also right tha= t WHATWG never percent-decodes the output due to the following reason:</div= ><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0= px 0.8ex;border-left:1px solid rgb(77,77,77);padding-left:1ex">... the poin= t of view of a maintainer of the WHATWG specification is that webservers ma= y legitimately choose to consider encoded and decoded paths distinct, and a= standard cannot force them not to do so.</blockquote><div><br></div><div>T= he said author made this clear in multiple comments, but this one is linked= in the RFC:=C2=A0<a href=3D"https://github.com/whatwg/url/issues/606#issue= comment-926395864">https://github.com/whatwg/url/issues/606#issuecomment-92= 6395864</a></div><div><br></div><div>So basically all the non-raw getters r= eturn a value that is considered=C2=A0<span style=3D"color:rgb(191,191,191)= ">by WHATWG=C2=A0</span><span style=3D"color:rgb(191,191,191)">non-equivale= nt with the original input. This is also explained in the "Component r= etrieval" section in more detail=C2=A0now (</span><span style=3D"color= :rgb(191,191,191)"><a href=3D"https://wiki.php.net/rfc/url_parsing_api">htt= ps://wiki.php.net/rfc/url_parsing_api</a></span>#component_retrieval<span s= tyle=3D"color:rgb(191,191,191)">). I hope=C2=A0</span></div><div><br></div>= <div>=C2=A0Regards,</div><div>M=C3=A1t=C3=A9</div></div></div> --00000000000091ea84063068c0eb--