Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:126780
Precedence: bulk
MIME-Version: 1.0
References: <CAH5C8xUb1O20ZDrOQNC=ckFxHUUWSK7sw_njQQzFBd0qgQqoww@mail.gmail.com>
 <dd61999c-1ebd-4765-9add-cd8065968965@gmail.com> <CAOV5rgYZ_s1of-igLFEK7oqqh6=3HeYv0=JvvKvvjkcp0F6Q9Q@mail.gmail.com>
 <CAH5C8xV67frqOBCvLt73RM7QO86_pu40sHP5h71_kDRbKPtA8Q@mail.gmail.com>
 <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com>
 <CAH5C8xV69pHnSKWJrVB8EQab7vPhcaXAwYezoG6z3Oj7ZkXBaQ@mail.gmail.com> <044E7A8E-B79D-44DB-B572-102A80CDFC3C@automattic.com>
In-Reply-To: <044E7A8E-B79D-44DB-B572-102A80CDFC3C@automattic.com>
Date: Sat, 15 Mar 2025 23:05:14 +0100
Message-ID: <CAH5C8xWpquzmvtbSk6=mL4U0z=Tupv2zD+FXag+zenF7q04HUQ@mail.gmail.com>
Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API
To: Dennis Snell <dennis.snell@automattic.com>
Cc: Internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="00000000000091ea84063068c0eb"
From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=)

--00000000000091ea84063068c0eb
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Dennis,


> This is a late thought, and surely amenable to a later RFC, but I was
> thinking about the get/set path methods and the issue of the / and %2F.
>
>  - If we exposed `getPathIterator()` or `getPathSegments()` could we not
> report these in their fully-decoded forms? That is, because the path
> segments are separated by some invocation or array element, they could be
> decoded?
>  - Probably more valuably, if `withPath()` accepted an array, could we no=
t
> allow fully non-escaped PHP strings as path segments which the URL class
> could safely and by-default handle the escaping for the caller?
>

Yes, these are very good ideas, and actually they are in line with how I
would imagine a second iteration. Probably, getPathSegments() could return
the "%2F" (percent-encoded form of "/") percent-decoded, sure. But the rest
of the reserved characters will also be an issue, since they can also appea=
r
within the path (i.e. "&" inside "Document & Settings" etc.)
percent-encoded. So percent decoding of reserved characters should still be
taken into account.

Right now, if someone haphazardly joins path segments in order to set
> `withPath()` they will likely be unaware of that nuance and get the path
> wrong. On the grand scale of things, I suspect this is a really minor ris=
k.
> However, if they could send in an array then they would never need to be
> aware of that nuance in order to provide a fully-reliable URL, up to the
> class rejecting path segments which cannot be represented.
>

Yes, consuming an array is also a good idea, but for the same reason as
above, it's not enough to take care of correctly percent-encoding "/" in
order
to have a valid URI as a result.  (Of course I'm still talking about RFC
3986, WHATWG still performs automatic percent-encoding)


>
> The HTML5 library has `::createFromString()` instead of `parse()`. Did yo=
u
> consider following this form? It doesn=E2=80=99t seem that important, but=
 could be
> a nice improvement in consistency among the newer spec-compliant APIs.
> Further, I think `createFromString()` is a little more obvious in intent,
> as `parse()` is so generic.
>
> Given the issues around equivalence, what about `isEquivalent()` instead
> of `equals()`? In the RFC I think you have been careful to use the
> =E2=80=9Cequivalence=E2=80=9D terminology, but then in the actual interfa=
ce we fall back to
> `equals()` and lose some of the nuance.
>

In my implementation, I tried to choose terminology that people are
familiar with instead of using the technicus terminus of URIs. Instead of
recompose(), I used toString() (or some variant of it), instead of
isEquivalent(),
I used equals(). Parse() is probably an outlier, since it's the
correct name of the exact process. But in any case, I consider these names
adequately short, and I think they very clearly convey their intent. Using
the technicus
terminus would probably even more suit those who have deep familiarity with
URIs, but this group will likely be the minority forever. For the rest of
the people, the current names make more sense, so I'd prefer keeping them
as-is.


>
> Something about not implementing `getRawScheme()` and friends in the
> WHATWG class seems off. Your rationale makes sense, but then I wonder wha=
t
> the problem is in exposing the raw untranslated components, particularly
> since the =E2=80=9Craw=E2=80=9D part of the name already suggests some ki=
nd of danger or
> risk in using it as some semantic piece.
>

Hm, interesting remark. Do I understand correctly that you are suggesting
to expose getRawScheme() and getRawHost() with their original value? If so,
then this has technical challenges: the WHATWG parser doesn't store
the original value of these two components, so they are effectively lost
when automatically transformation happens during parsing. But this is
normal, since the WHATWG specification doesn't really care about the
original value of these components.


> Tim brought up the naming of `getHost()` and `getHostForDisplay()` as wel=
l
> as the correspondence with the `toString()` methods. I=E2=80=99m not sure=
 if it was
> overlooked or I missed the followup, but I wonder what your thoughts are =
on
> passing an enum to these methods indicating the rendering context. Here=
=E2=80=99s
> why: I see developers reach for the first method that looks right. In thi=
s
> case, that would almost always be `getHost()`, yet `getHost()` or
> `toString()` or whatever is going to be inappropriate in many common case=
s.
> I see two ways of baking in education into the API surface: creating two
> symmetric methods (e.g. `getDisplayableHost()` and
> `getNonDisplayableHost()`); or requiring an enum forcing the choice (e.g.
> `getHost( ForDisplay | ForNonDisplay )`). In the case on an enum this cou=
ld
> be equally applied across all of the relevant methods where such a
> distinction exists. On one hand this could be seen as forcing callers to
> make a choice, but on the other hand it can also be seen as a safeguard
> against an extremely-common foot-gun, making such an easy oversight
> impossible.
>

I am myself also a bit lost on the countless names that I tried out in the
implementation, but I think I had toHumanFriendlyString() and
toDisplayFriendlyString() methods at some point. These then ended up being
toString() and toDisplayString() after some iterations. I would be ok with
renaming getHost() and toString() so that their names suggest they don't
use IDNA, but I'd clearly need a good enough suggestion, since neither
"MachineFriendly", nor "NonDisplayable" sound like the best alternative for
me. I was also considering using getIdnaHost() and toIdnaString(), but I
realized these are the worst looking names I have come up with so far.


>
> Just this week I stumbled upon an issue with escaping the hash/fragment
> part of a URL. I think that browsers used to decode percent-encodings in
> the fragment but they all stopped and this was removed from the WHATWG HT=
ML
> spec [no-percent-escaping]. The RFC currently shows `getFragment()`
> decoding percent-encoded fragments, However, I believe that the WHATWG UR=
L
> spec only indicates percent-encoding when _setting_ the fragment. You can
> test this in a browser with the following example: Chrome, Firefox, and
> Safari exhibit the same behavior.
>
>     u =3D new URL(window.location)
>     u.hash =3D =E2=80=98one and two=E2=80=99;
>     u.hash =3D=3D=3D =E2=80=98#one%20and%20two=E2=80=99;
>     u.toString() =3D=3D=3D =E2=80=98=E2=80=A6.#one%20and%20two=E2=80=99;
>
> So I think it may be more accurate and consistent to handle
> `Whatwg\Url::getFragment` in the same way as `getScheme()`. When setting =
a
> fragment we should percent-encode the appropriate characters, but when
> reading it, we should never interpret those characters =E2=80=94 it shoul=
d always
> return the =E2=80=9Craw=E2=80=9D value of the fragment.
>
> [no-percent-escaping]: https://github.com/whatwg/url/issues/344
>
>
 Thank you for the suggestion and for noticing this problem. I believe you
must have read a version of the RFC where I was still trying to find out
the correct percent-decoding rules for WHATWG. At some point, I was
completely misunderstanding what the specification prescribed, so I had to
make quite some changes in the RFC regarding this aspect + finally I
managed to describe elaborately the reasoning behind the choices. Now I
think the rules make sense.

Yes, my implementation automatically percent-encodes the input when parsing
or modifying a WHATWG URL. You are also right that WHATWG never
percent-decodes the output due to the following reason:

... the point of view of a maintainer of the WHATWG specification is that
> webservers may legitimately choose to consider encoded and decoded paths
> distinct, and a standard cannot force them not to do so.


The said author made this clear in multiple comments, but this one is
linked in the RFC:
https://github.com/whatwg/url/issues/606#issuecomment-926395864

So basically all the non-raw getters return a value that is considered by
WHATWG non-equivalent with the original input. This is also explained in
the "Component retrieval" section in more detail now (
https://wiki.php.net/rfc/url_parsing_api#component_retrieval). I hope

 Regards,
M=C3=A1t=C3=A9

--00000000000091ea84063068c0eb
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">Hi Dennis,<div><br></div></div><div class=
=3D"gmail_quote gmail_quote_container"><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(77,77,77);padding=
-left:1ex"><div class=3D"msg-7998362329702149963"><div><blockquote type=3D"=
cite"><div class=3D"gmail_quote">
</div></blockquote>
<div><br></div>
<div>
<div>This is a late thought, and surely amenable to a later RFC, but I was =
thinking about the get/set path methods and the issue of the / and %2F.</di=
v>
<div><br></div>
<div>=C2=A0- If we exposed `getPathIterator()` or `getPathSegments()` could=
 we not report these in their fully-decoded forms? That is, because the pat=
h segments are separated by some invocation or array element, they could be=
 decoded?</div>
<div>=C2=A0- Probably more valuably, if `withPath()` accepted an array, cou=
ld we not allow fully non-escaped PHP strings as path segments which the UR=
L class could safely and by-default handle the escaping for the caller?</di=
v></div></div></div></blockquote><div><br></div><div>Yes, these are very go=
od ideas, and actually they are in line with how I would imagine a second i=
teration. Probably,=C2=A0<span style=3D"color:rgb(191,191,191)">getPathSegm=
ents() could return</span></div><div><span style=3D"color:rgb(191,191,191)"=
>the &quot;%2F&quot; (percent-encoded form of &quot;/&quot;) percent-decode=
d, sure. But the rest of the reserved characters will also be an issue, sin=
ce they can also appear</span></div><div><span style=3D"color:rgb(191,191,1=
91)">within the path (i.e. &quot;&amp;&quot; inside &quot;Document &amp; Se=
ttings&quot; etc.) percent-encoded. So percent decoding of reserved charact=
ers should still be taken into account.</span></div><div><br></div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(77,77,77);padding-left:1ex"><div class=3D"msg-799836232970214996=
3"><div><div><div></div>
<div>Right now, if someone haphazardly joins path segments in order to set =
`withPath()` they will likely be unaware of that nuance and get the path wr=
ong. On the grand scale of things, I suspect this is a really minor risk. H=
owever, if they could send in an array then they would never need to be awa=
re of that nuance in order to provide a fully-reliable URL, up to the class=
 rejecting path segments which cannot be represented.</div></div></div></di=
v></blockquote><div><br></div><div>Yes, consuming an array is also a good i=
dea, but for the same reason as above, it&#39;s not enough to take care of =
correctly percent-encoding &quot;/&quot; in order</div><div>to have a valid=
 URI as a result.=C2=A0 (Of course I&#39;m still talking about RFC 3986, WH=
ATWG still performs automatic percent-encoding)</div><div>=C2=A0</div><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(77,77,77);padding-left:1ex"><div class=3D"msg-799836232970214=
9963"><div>
<div><br></div>
<div>The HTML5 library has `::createFromString()` instead of `parse()`. Did=
 you consider following this form? It doesn=E2=80=99t seem that important, =
but could be a nice improvement in consistency among the newer spec-complia=
nt APIs. Further, I think `createFromString()` is a little more obvious in =
intent, as `parse()` is so generic.</div>
<div><br></div>
<div>Given the issues around equivalence, what about `isEquivalent()` inste=
ad of `equals()`? In the RFC I think you have been careful to use the =E2=
=80=9Cequivalence=E2=80=9D terminology, but then in the actual interface we=
 fall back to `equals()` and lose some of the nuance.</div></div></div></bl=
ockquote><div><br></div><div>In my implementation, I tried to choose termin=
ology that people are familiar with instead of using the technicus terminus=
 of URIs. Instead of recompose(), I used toString() (or some variant of it)=
, instead of isEquivalent(),</div><div>I used equals(). Parse() is probably=
 an outlier, since it&#39;s the correct=C2=A0name of the exact process. But=
 in any case, I consider these names adequately short, and I think they ver=
y clearly convey their intent. Using the technicus</div><div>terminus would=
 probably=C2=A0<span style=3D"color:rgb(191,191,191)">even more=C2=A0</span=
><span style=3D"color:rgb(191,191,191)">suit those who have deep familiarit=
y with URIs, but this group will likely be the minority forever. For the re=
st of the people, the current names make more sense, so I&#39;d prefer keep=
ing them as-is.</span></div><div>=C2=A0</div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(77,77,77);p=
adding-left:1ex"><div class=3D"msg-7998362329702149963"><div>
<div><br></div>
<div>Something about not implementing `getRawScheme()` and friends in the W=
HATWG class seems off. Your rationale makes sense, but then I wonder what t=
he problem is in exposing the raw untranslated components, particularly sin=
ce the =E2=80=9Craw=E2=80=9D part of the name already suggests some kind of=
 danger or risk in using it as some semantic piece.</div></div></div></bloc=
kquote><div><br></div><div>Hm, interesting remark. Do I understand correctl=
y that you are suggesting to expose getRawScheme() and getRawHost() with th=
eir original value? If so, then this has technical challenges: the WHATWG p=
arser<span style=3D"color:rgb(191,191,191)">=C2=A0doesn&#39;t store</span><=
/div><div><span style=3D"color:rgb(191,191,191)">the original value of thes=
e two components,=C2=A0so they are effectively lost when automatically tran=
sformation=C2=A0happens during parsing. But this is normal, since the WHATW=
G specification doesn&#39;t really care about the original value of these c=
omponents.</span></div><div><span style=3D"color:rgb(191,191,191)"><br></sp=
an></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e=
x;border-left:1px solid rgb(77,77,77);padding-left:1ex"><div class=3D"msg-7=
998362329702149963"><div>
<div><br></div>
<div>Tim brought up the naming of `getHost()` and `getHostForDisplay()` as =
well as the correspondence with the `toString()` methods. I=E2=80=99m not s=
ure if it was overlooked or I missed the followup, but I wonder what your t=
houghts are on passing an enum to these methods indicating the rendering co=
ntext. Here=E2=80=99s why: I see developers reach for the first method that=
 looks right. In this case, that would almost always be `getHost()`, yet `g=
etHost()` or `toString()` or whatever is going to be inappropriate in many =
common cases. I see two ways of baking in education into the API surface: c=
reating two symmetric methods (e.g. `getDisplayableHost()` and `getNonDispl=
ayableHost()`); or requiring an enum forcing the choice (e.g. `getHost( For=
Display | ForNonDisplay )`). In the case on an enum this could be equally a=
pplied across all of the relevant methods where such a distinction exists. =
On one hand this could be seen as forcing callers to make a choice, but on =
the other hand it can also be seen as a safeguard against an extremely-comm=
on foot-gun, making such an easy oversight impossible.</div></div></div></b=
lockquote><div><br></div><div>I am myself also a bit lost on the countless =
names that I tried out in the implementation, but I think I had toHumanFrie=
ndlyString() and toDisplayFriendlyString() methods at some point. These the=
n ended up being toString() and toDisplayString() after some iterations. I =
would be ok with renaming getHost() and toString() so=C2=A0that their names=
 suggest they don&#39;t use IDNA,=C2=A0but I&#39;d clearly need a good enou=
gh=C2=A0suggestion, since neither &quot;MachineFriendly&quot;, nor &quot;No=
nDisplayable&quot; sound like the best alternative for me. I was also consi=
dering using getIdnaHost() and toIdnaString(), but I realized these are the=
 worst looking names I have come up with so=C2=A0far.</div><div>=C2=A0</div=
><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border=
-left:1px solid rgb(77,77,77);padding-left:1ex"><div class=3D"msg-799836232=
9702149963"><div>
<div><br></div>
<div>Just this week I stumbled upon an issue with escaping the hash/fragmen=
t part of a URL. I think that browsers used to decode percent-encodings in =
the fragment but they all stopped and this was removed from the WHATWG HTML=
 spec [no-percent-escaping]. The RFC currently shows `getFragment()` decodi=
ng percent-encoded fragments, However, I believe that the WHATWG URL spec o=
nly indicates percent-encoding when _setting_ the fragment. You can test th=
is in a browser with the following example: Chrome, Firefox, and Safari exh=
ibit the same behavior.</div>
<div><br></div>
<div>=C2=A0 =C2=A0 u =3D new URL(window.location)</div>
<div>=C2=A0 =C2=A0 u.hash =3D =E2=80=98one and two=E2=80=99;</div>
<div>=C2=A0 =C2=A0 u.hash =3D=3D=3D =E2=80=98#one%20and%20two=E2=80=99;</di=
v>
<div>=C2=A0 =C2=A0 u.toString() =3D=3D=3D =E2=80=98=E2=80=A6.#one%20and%20t=
wo=E2=80=99;</div>
<div><br></div>
<div>So I think it may be more accurate and consistent to handle `Whatwg\Ur=
l::getFragment` in the same way as `getScheme()`. When setting a fragment w=
e should percent-encode the appropriate characters, but when reading it, we=
 should never interpret those characters =E2=80=94 it should always return =
the =E2=80=9Craw=E2=80=9D value of the fragment.</div>
<div><br></div>
<div>[no-percent-escaping]:=C2=A0<a href=3D"https://github.com/whatwg/url/i=
ssues/344" target=3D"_blank">https://github.com/whatwg/url/issues/344</a>
</div>
<div><br></div></div></div></blockquote><div><br></div><div>=C2=A0Thank you=
 for the suggestion and for noticing this problem. I believe you must have =
read a version of the RFC where I was still trying to find out the correct =
percent-decoding rules for WHATWG. At some point, I was completely misunder=
standing what the specification prescribed, so I had to make quite some cha=
nges in the RFC regarding this aspect + finally I managed to describe elabo=
rately the reasoning behind the choices. Now I think the rules make sense.<=
/div><div><br></div><div>Yes, my implementation automatically percent-encod=
es the input when parsing or modifying a WHATWG URL. You are also right tha=
t WHATWG never percent-decodes the output due to the following reason:</div=
><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left:1px solid rgb(77,77,77);padding-left:1ex">... the poin=
t of view of a maintainer of the WHATWG specification is that webservers ma=
y legitimately choose to consider encoded and decoded paths distinct, and a=
 standard cannot force them not to do so.</blockquote><div><br></div><div>T=
he said author made this clear in multiple comments, but this one is linked=
 in the RFC:=C2=A0<a href=3D"https://github.com/whatwg/url/issues/606#issue=
comment-926395864">https://github.com/whatwg/url/issues/606#issuecomment-92=
6395864</a></div><div><br></div><div>So basically all the non-raw getters r=
eturn a value that is considered=C2=A0<span style=3D"color:rgb(191,191,191)=
">by WHATWG=C2=A0</span><span style=3D"color:rgb(191,191,191)">non-equivale=
nt with the original input. This is also explained in the &quot;Component r=
etrieval&quot; section in more detail=C2=A0now (</span><span style=3D"color=
:rgb(191,191,191)"><a href=3D"https://wiki.php.net/rfc/url_parsing_api">htt=
ps://wiki.php.net/rfc/url_parsing_api</a></span>#component_retrieval<span s=
tyle=3D"color:rgb(191,191,191)">). I hope=C2=A0</span></div><div><br></div>=
<div>=C2=A0Regards,</div><div>M=C3=A1t=C3=A9</div></div></div>

--00000000000091ea84063068c0eb--