Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:125984
Precedence: bulk
MIME-Version: 1.0
References: <CAH5C8xUb1O20ZDrOQNC=ckFxHUUWSK7sw_njQQzFBd0qgQqoww@mail.gmail.com>
 <dd61999c-1ebd-4765-9add-cd8065968965@gmail.com> <CAOV5rgYZ_s1of-igLFEK7oqqh6=3HeYv0=JvvKvvjkcp0F6Q9Q@mail.gmail.com>
 <CAH5C8xV67frqOBCvLt73RM7QO86_pu40sHP5h71_kDRbKPtA8Q@mail.gmail.com> <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com>
In-Reply-To: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com>
Date: Tue, 19 Nov 2024 09:49:41 +0100
Message-ID: <CAH5C8xXFpBU=h2wgdHJ6vf6CT8tvKa5V9KLBVseJJTpST=rM7Q@mail.gmail.com>
Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API
To: Dennis Snell <dennis.snell@automattic.com>
Cc: Internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="000000000000daf2d80627401d37"
From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=)

--000000000000daf2d80627401d37
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Dennis,

Even though I didn't answer for a long time, I was improving my RFC
implementation in the meanwhile as well as evaluating your suggestions.

I=E2=80=99m worried about the side-effects that having a global
uri.default_handler could
> have with code running differently for no apparent reason, or differently
> based on what is calling it. If someone is writing code for a controlled
> system I could see this being valuable, but if someone is writing a
> framework like WordPress and has no control over the environments in whic=
h
> code runs, it seems dangerous to hope that every plugin and every host ru=
ns
> compatible system configurations. Nobody is going to check `ini_get(
> =E2=80=98uri.default_handler=E2=80=99 )` before every line that parses UR=
Ls. Beyond this,
> even just *allowing* a pluggable parser invites broken deployments
> because PHP code that is reading from a browser or sending output to one
> needs to speak the language the browser is speaking, not some arbitrary
> language that=E2=80=99s similar to it.
>

You convinced me with your arguments regarding the issues a global
uri.default_handler
INI config can cause, especially after having read a blog post by Daniel
Stenberg about the topic (
https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/). That's why I
removed this from the RFC in favor of relying on configuring the parser at
the individual feature level. However, I don't agree with removing a
pluggable parser because of the following reasons:

- the current method (parse_url() based parser) is already doomed, isn't
compliant with any spec, so it already doesn't speak the language the
browser is speaking
- even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web
- in addition, there are tools which aren't compliant with the WhatWg spec,
but with some other. Most prominently, cURL is mostly RFC3986 compliant
with some additional flavour of WhatWg according to
https://everything.curl.dev/cmdline/urls/browsers.html

That's why I intend to keep support for pluggability.


> Being able to parse a relative URL and know if a URL is relative or
> absolute would help WordPress, which often makes decisions differently
> based on this property (for instance, when reading an `href` property of =
a
> link). I know these aren=E2=80=99t spec-compliant URLs, but they  still r=
epresent
> valid values for URL fields in HTML and knowing if they are relative or n=
ot
> requires some amount of parsing specific details everywhere, vs. in a cla=
ss
> that already parses URLs. Effectively, this would imply that PHP=E2=80=99=
s new URL
> parser decodes  `document.querySelector( =E2=80=98a=E2=80=99 ).getAttribu=
te( =E2=80=98href=E2=80=99 )`,
> which should be the same as `document.querySelector( =E2=80=98a=E2=80=99 =
).href`, and
> indicates whether it found a full URL or only a portion of one.
>
>   * `$url->is_relative` or `$url->is_absolute`
>   * `$url->specificity =3D URL::Relative | URL::Absolute`
>

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when
the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL, and then WhatWgUri should let you know whether the originally passed
in URI was relative or not, did I get you right? This feature is certainly
possible with RFC3986 URIs (even without the base parameter), but WhatWg
requires the above mentioned workaround for parsing + I have to look into
how this can be implemented...

Having methods to add query arguments, change the path, etc=E2=80=A6 would =
be a
> great way to simplify user-space code working with URLs. For instance, re=
ad
> a URL and then add a query argument if some condition within the URL
> warrants it (for example, the path ends in `.png`).
>

I managed to retain support for the "wither" methods that were originally
part of the proposal. This required using custom code for the uriparser
library, while the maintainer of Lexbor was kind enough to add native
support for modification after I submitted a feature request. However,
convenience methods for manipulating query parameters are still not part of
the RFC because it would increase the scope of the RFC even more, and due
to other issues highlighted by Ignace in his prior email:
https://externals.io/message/123997#124077. As I really want such a
feature, I'd be eager to create a followup RFC dedicated for handling query
strings.

My counter-point to this argument is that I see security exploits appear
> everywhere that functions which implement specifications are pluggable an=
d
> extendable. It=E2=80=99s easy to see the need to create a class that *lim=
its* possible
> URLs, but that also doesn=E2=80=99t require extending a class. A class ca=
n wrap a
> URL parser just as it could extend one. Magic methods would make it even
> easier.
>

Right now, it's only possible to plug internal URI implementation into PHP,
userland classes cannot be used, so this probably reduces the issue.
However, I recently bumped into a technical issue with URIs not being final
which I am currently trying to assess how to solve. More information is
available at one of my comments on my PR:
https://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c0=
6124aa2f331afa8#r1847316607
As far as I see the situation currently, it would probably be better to
make these classes final so that similar unforeseen issues and
inconsistencies cannot happen again (we can unfinalize them later anyway).


> Finally, I frequently find the need to be able to consider a URL in both
> the *display* context and the *serialization* context. With Ada we have
> `normalize_url()`, `parse_search_params()`, and the IDNA functions to
> convert between the two representations. In order to keep strong boundari=
es
> between security domains, it would be nice if PHP could expose the two
> variations: one is an encoded form of a URL that machines can easily pars=
e
> while the other is a =E2=80=9Cplain string=E2=80=9D in PHP that=E2=80=99s=
 easier for humans to
> parse but which might not even be a valid URL. Part of the reason for thi=
s
> need is that I often see user-space code treating an entire URL as a sing=
le
> text span that requires one set of rules for full decoding; it=E2=80=99s =
multiple
> segments that each have their own decoding rules.
>
>  - Original [ https://xn--google.com/secret/../search?q=3D=F0=9F=8D=94 ]
>  - `$url->normalize()` [ https://xn--google.com/search?q=3D%F0%9F%8D%94 ]
>  - `$url->for_display()` Displayed [ https://=E4=95=AE=E4=95=B5=E4=95=B6=
=E4=95=B1.com/search?q=3D
> <https://xn--google.com/search?q=3D>=F0=9F=8D=94 ]
>

Even though I didn't entirely implement this suggestion, I added
normalization support:
- the normalize() method can be used to create a new URI instance whose
components are normalized based on the current object
- the toNormalizedString() method can be used when only the normalized
string representation is needed
- the newly added equalsTo() method also makes use of normalization to
better identify equal URIs

For more information, please refer to the relevant section of the RFC:
https://wiki.php.net/rfc/url_parsing_api#api_design. The forDisplay()
method also seems to be useful at the first glance, but since this may be a
controversial optional feature, I'd defer it for later...

Regards,
M=C3=A1t=C3=A9

--000000000000daf2d80627401d37
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">Hi Dennis,</div><div class=3D"gmail_quote=
"><div><br></div><div>Even though I didn&#39;t answer for a long time, I wa=
s improving my RFC implementation in the meanwhile as well as evaluating yo=
ur suggestions.</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(77,77,77);padding-le=
ft:1ex"><div><div>I=E2=80=99m worried about the side-effects that having a =
global=C2=A0<span style=3D"color:rgb(191,191,191);font-family:&quot;Source =
Code Pro&quot;,&quot;Courier New&quot;,Courier,monospace,sans-serif;backgro=
und-color:rgb(31,31,31)">uri.default_handler</span>=C2=A0could have with co=
de running differently for no apparent reason, or differently based on what=
 is calling it. If someone is writing code for a controlled system I could =
see this being valuable, but if someone is writing a framework like WordPre=
ss and has no control over the environments in which code runs, it seems da=
ngerous to hope that every plugin and every host runs compatible system con=
figurations. Nobody is going to check `ini_get( =E2=80=98uri.default_handle=
r=E2=80=99 )` before every line that parses URLs. Beyond this, even just <i=
>allowing</i>=C2=A0a pluggable parser invites broken deployments because PH=
P code that is reading from a browser or sending output to one needs to spe=
ak the language the browser is speaking, not some arbitrary language that=
=E2=80=99s similar to it.</div></div></blockquote><div><br></div><div>You c=
onvinced me with your arguments regarding the issues a=C2=A0<span style=3D"=
color:rgb(191,191,191)">global=C2=A0</span><span style=3D"color:rgb(191,191=
,191);font-family:&quot;Source Code Pro&quot;,&quot;Courier New&quot;,Couri=
er,monospace,sans-serif;background-color:rgb(31,31,31)">uri.default_handler=
 INI config=C2=A0</span>can cause,<span style=3D"color:rgb(191,191,191)">=
=C2=A0especially after having read a blog post by Daniel Stenberg about the=
 topic (<a href=3D"https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-pars=
ers/">https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/</a>). Th=
at&#39;s why I removed this from the RFC in favor of relying on configuring=
 the parser at the=C2=A0individual feature level. However, I don&#39;t agre=
e with removing a pluggable parser because of the following reasons:</span>=
</div><div><span style=3D"color:rgb(191,191,191)"><br></span></div><div><sp=
an style=3D"color:rgb(191,191,191)">- the current method (parse_url() based=
 parser) is already doomed, isn&#39;t compliant with any spec, so it alread=
y doesn&#39;t speak the language the browser is speaking</span></div><div><=
span style=3D"color:rgb(191,191,191)">- even though the majority does, not =
everyone builds a browser application with PHP, especially because URIs are=
 not necessarily accessible on the web</span></div><div><span style=3D"colo=
r:rgb(191,191,191)">- in addition, there are tools which aren&#39;t complia=
nt with the WhatWg spec, but with some other. Most prominently, cURL is mos=
tly RFC3986 compliant with some additional flavour of WhatWg according to=
=C2=A0</span><span style=3D"color:rgb(191,191,191)"><a href=3D"https://ever=
ything.curl.dev/cmdline/urls/browsers.html">https://everything.curl.dev/cmd=
line/urls/browsers.html</a></span></div><div><span style=3D"color:rgb(191,1=
91,191)"><br></span></div><div><span style=3D"color:rgb(191,191,191)">That&=
#39;s why I intend to keep support for pluggability.</span></div><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e=
x;border-left:1px solid rgb(77,77,77);padding-left:1ex"><div><div></div><di=
v>Being able to parse a relative URL and know if a URL is relative or absol=
ute would help WordPress, which often makes decisions differently based on =
this property (for instance, when reading an `href` property of a link). I =
know these aren=E2=80=99t spec-compliant URLs, but they =C2=A0still represe=
nt valid values for URL fields in HTML and knowing if they are relative or =
not requires some amount of parsing specific details everywhere, vs. in a c=
lass that already parses URLs. Effectively, this would imply that PHP=E2=80=
=99s new URL parser decodes =C2=A0`document.querySelector( =E2=80=98a=E2=80=
=99 ).getAttribute( =E2=80=98href=E2=80=99 )`, which should be the same as =
`document.querySelector( =E2=80=98a=E2=80=99 ).href`, and indicates whether=
 it found a full URL or only a portion of one.</div><div><br></div><div>=C2=
=A0 * `$url-&gt;is_relative` or `$url-&gt;is_absolute`</div><div>=C2=A0 * `=
$url-&gt;specificity =3D URL::Relative | URL::Absolute`</div></div></blockq=
uote><div><br></div><div>The Uri\WhatWgUri::parse() method accepts a (relat=
ive) URI parameter when the 2nd (base URI) parameter is provided. So essent=
ially you need to use this variant of the parse() method if you want to par=
se a WhatWg compliant URL, and then=C2=A0<span style=3D"color:rgb(191,191,1=
91)">WhatWgUri</span>=C2=A0should let=C2=A0you know whether the originally=
=C2=A0passed in URI was relative or not, did I get you right? This feature =
is certainly possible with RFC3986 URIs (even without the base parameter), =
but WhatWg requires the above mentioned workaround for parsing=C2=A0+ I hav=
e to look into how this can be implemented...</div><div><br></div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(77,77,77);padding-left:1ex"><div><div></div><div>Having methods t=
o add query arguments, change the path, etc=E2=80=A6 would be a great way t=
o simplify user-space code working with URLs. For instance, read a URL and =
then add a query argument if some condition within the URL warrants it (for=
 example, the path ends in `.png`).</div></div></blockquote><div><br></div>=
<div>I managed to retain support for the &quot;wither&quot; methods that we=
re originally part of the proposal. This required using custom code for the=
 uriparser library, while the maintainer of Lexbor was kind enough to add n=
ative support for modification after I submitted a feature request. However=
, convenience methods for manipulating query parameters=C2=A0are still not =
part of the=C2=A0RFC because it would increase the scope of the RFC even mo=
re, and due to other issues highlighted by Ignace in his prior email: <a hr=
ef=3D"https://externals.io/message/123997#124077">https://externals.io/mess=
age/123997#124077</a>. As I really want such a feature, I&#39;d be eager to=
 create a followup RFC dedicated for handling query strings.</div><div><br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(77,77,77);padding-left:1ex"><div><div>My counter-p=
oint to this argument is that I see security exploits appear everywhere tha=
t functions which implement specifications are pluggable and extendable. It=
=E2=80=99s easy to see the need to create a class that <i>limits</i>=C2=A0p=
ossible URLs, but that also doesn=E2=80=99t require extending a class. A cl=
ass can wrap a URL parser just as it could extend one. Magic methods would =
make it even easier.</div></div></blockquote><div><br></div><div>Right now,=
 it&#39;s only possible to plug internal URI implementation into PHP, userl=
and classes cannot be used, so this probably reduces the issue. However, I =
recently bumped into a technical issue with URIs not being final which I am=
 currently trying to assess how to solve. More information is available at =
one of my comments on my PR:=C2=A0<a href=3D"https://github.com/php/php-src=
/pull/14461/commits/8e21e6760056fc24954ec36c06124aa2f331afa8#r1847316607">h=
ttps://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c06=
124aa2f331afa8#r1847316607</a> As far as I see the situation=C2=A0<span sty=
le=3D"color:rgb(191,191,191)">currently</span><span style=3D"color:rgb(191,=
191,191)">, it would probably be better to make these classes final so that=
 similar unforeseen issues and inconsistencies cannot happen again (we can =
unfinalize them later anyway).</span></div><div>=C2=A0</div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(77,77,77);padding-left:1ex"><div>Finally, I frequently find the need to=
 be able to consider a URL in both the <i>display</i>=C2=A0context and the =
<i>serialization</i>=C2=A0context. With Ada we have `normalize_url()`, `par=
se_search_params()`, and the IDNA functions to convert between the two repr=
esentations. In order to keep strong boundaries between security domains, i=
t would be nice if PHP could expose the two variations: one is an encoded f=
orm of a URL that machines can easily parse while the other is a =E2=80=9Cp=
lain string=E2=80=9D in PHP that=E2=80=99s easier for humans to parse but w=
hich might not even be a valid URL. Part of the reason for this need is tha=
t I often see user-space code treating an entire URL as a single text span =
that requires one set of rules for full decoding; it=E2=80=99s multiple seg=
ments that each have their own decoding rules.</div><div><br></div><div>=C2=
=A0- Original [=C2=A0<a href=3D"https://xn--google.com/secret/../search?q=
=3D" target=3D"_blank">https://xn--google.com/secret/../search?q=3D</a>=F0=
=9F=8D=94 ]</div><div>=C2=A0- `$url-&gt;normalize()` [=C2=A0<a href=3D"http=
s://xn--google.com/search?q=3D%F0%9F%8D%94" target=3D"_blank">https://xn--g=
oogle.com/search?q=3D%F0%9F%8D%94</a> ]</div><div>=C2=A0- `$url-&gt;for_dis=
play()` Displayed [ <a href=3D"https://xn--google.com/search?q=3D">https://=
=E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com/search?q=3D</a>=F0=9F=8D=94 ]</div=
></blockquote><div><br></div><div>Even though I didn&#39;t entirely impleme=
nt this suggestion, I added normalization support:</div><div>- the normaliz=
e() method can be used to create a new URI instance whose components are=C2=
=A0normalized=C2=A0<span style=3D"color:rgb(191,191,191)">based on the curr=
ent object</span></div><div>- the toNormalizedString() method can be used w=
hen only the normalized string representation is needed</div><div>- the new=
ly added equalsTo() method also makes use of normalization to better identi=
fy equal URIs</div><div><br></div><div>For more information, please refer t=
o the relevant section of the RFC:=C2=A0<a href=3D"https://wiki.php.net/rfc=
/url_parsing_api#api_design">https://wiki.php.net/rfc/url_parsing_api#api_d=
esign</a>. The forDisplay() method also seems to be useful at the first gla=
nce, but since this may be a controversial optional feature, I&#39;d defer =
it=C2=A0for later...</div><div><br></div><div>Regards,</div><div>M=C3=A1t=
=C3=A9</div><div><br></div><div><br></div></div></div>

--000000000000daf2d80627401d37--