[RFC] [Discussion] Add WHATWG compliant URL parsing API

1 year ago by Marco Pivetta — view source

unread

Hey Máté,

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since the parse_url() function is optimized for
performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

So far, amazing ! 👏

1 year ago by Lynn — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since the parse_url() function is optimized for
performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

This is a great addition to have! I see there's nothing specifically about
__toString in the RFC, is this aiming to do the same as PSR-7?

1 year ago by Niels Dossche — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

Hi Máté

+1 from me, I'm all for modern web-related APIs as you know.

Some questions/remarks:

Why did you choose UrlParser to be a "static" class? Right now it's just a fancy namespace.
I can see the point of having a UrlParser class where you can e.g. configure it with which URL standard you want,
but as it is now there is no such capability.
It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.
Why is UrlComponent a backed enum?
A nit: We didn't bundle the entire Lexbor engine, only select parts of it. Just thought I'd make it clear.
About edge cases: e.g. what happens if I call the Url constructor and leave every string field empty?

Overall seems good.

Kind regards
Niels

1 year ago by Bilge — view source

unread

Why did you choose UrlParser to be a "static" class?

Because "static class" is the hip new cool ;)

Bilge

1 year ago by Stephen Reay — view source

unread

It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.

Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)

We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....

Cheers

Stephen

1 year ago by Rob Landers — view source

unread

It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.

Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)

We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....

Cheers

Stephen

I personally ignore PSR when it doesn't make sense to use it. They're nice for library compatibility, but I will happily toss compatibility when it doesn't make sense to be compatible. This might be one of those cases as there is no reason it has to be PSR compliant. In fact, a wrapper may be written to make it compliant, if one so chooses. I suspect it is better to be realistic and learn from the short-comings of PSR and apply those learnings here, vs. reiterating them and "engraving them in stone" (so to speak).

— Rob

1 year ago by ignace nyamagana butera — view source

unread

It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.

Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)

We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....

Cheers

Stephen

While I do not think the debate should be about compatibility with PSR-7
some historical context shoyld be brought to light for a fair discussion:

parse_url and parse_str predates RFC3986
URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.
PHP historical query parser parse_str logic is so bad (mangled
parameter name for instance) that PSR-7 was right not embedding that
parsing algorithm in its specification.
If you take aside URITemplate specification and now URLSearchParams
there is no official, referenced and or agreed upon rules/document on
how a query string MUST or SHOULD be parsed.
Last but not least URLSearchParans encoding/decoding rules DO NOT
follow either RFC1738 nor RFC3986 (they follow the form data which is
kind of a mix between both RFC)

THis means that just adding a method or a class that mimic 100%
URLSearchParans for instance will constitute a major departure in how
PHP trears query string you will no longer have a 1:1 relation between
the data you have inside your _GET array and the one in
UrlSearchParams for better or for worse.

For all these arguments I would keep the proposed Url free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.

1 year ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

As far as I understand it, if this RFC were to pass as is it will model

PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url. In my view the Url class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.

Supporting IANA registered schemes is a valid request, and is definitely
useful.
However, I think this feature is not strictly required to have in the
current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely on parse_url(). And of course, we can (and should)
add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.

Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in
PHP:
yes, PHP is server side, but it still interacts with browsers very heavily.
Among other
use-cases I cannot yet image, the major one is most likely validating
user-supplied URLs
for opening in the browser. As far as I see the situation, currently there
is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or
not.

parse_url and parse_str predates RFC3986

URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.

Thank you for the historical context!

Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until
this function
really becomes superfluous. That's why it now seems to me that the behavior
of
parse_url() could be leveraged in ext/url so that it would work with a
Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a
Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are
TBD).

For all these arguments I would keep the proposed Url free of all

these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.

My WIP implementation still uses nullable properties and return types. I
only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio
for everyone
involved in the discussion, then I think making these types nullable is
fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with
this.

Again, thank you for your constructive criticism.

Regards,
Máté

1 year ago by Rob Landers — view source

unread

Hi Ignace,

As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url. In my view the Url class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.

Supporting IANA registered schemes is a valid request, and is definitely useful.
However, I think this feature is not strictly required to have in the current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely on parse_url(). And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.

It's also worth pointing out (as another reason not to do this) is that IANA may-or-may not be valid in the current network. For example, TOR, Handshake, IPFS, Freenet, etc. all have their own DNS schemes and do not (usually) use IANA registered schemes, and many people create sites that cater to those networks.

Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in PHP:
yes, PHP is server side, but it still interacts with browsers very heavily. Among other
use-cases I cannot yet image, the major one is most likely validating user-supplied URLs
for opening in the browser. As far as I see the situation, currently there is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or not.

Looking at the spec for WHATWG, it looks like example%2Ecom will be parsed as a valid URL, and transformed to example.com, while this doesn't currently happen in parse_url():

https://3v4l.org/NtqQm

I don't know if that may be an issue, but might be if you are expecting the string to remain URL encoded.

parse_url and parse_str predates RFC3986

URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.

Thank you for the historical context!

Based on your and others' feedback, it has now become clear for me that parse_url()
is still useful and ext/url needs quite some additional capabilities until this function
really becomes superfluous. That's why it now seems to me that the behavior of
parse_url() could be leveraged in ext/url so that it would work with a Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are TBD).

For all these arguments I would keep the proposed Url free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.

My WIP implementation still uses nullable properties and return types. I only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone
involved in the discussion, then I think making these types nullable is fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with this.

The spec contains elements and their types. It would be good to adhere to the spec (simplifies documentation):

scheme may be null or empty string
port may be null
path is never null, but may be empty string
query may be null
fragment may be null
user/password may be null (to differentiate between an empty password or no password)
host may be null (for relative URLs

Again, thank you for your constructive criticism.

Regards,
Máté

— Rob

1 year ago by Rob Landers — view source

unread

Hi Ignace,

As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url. In my view the Url class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.

Supporting IANA registered schemes is a valid request, and is definitely useful.
However, I think this feature is not strictly required to have in the current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely on parse_url(). And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.

It's also worth pointing out (as another reason not to do this) is that IANA may-or-may not be valid in the current network. For example, TOR, Handshake, IPFS, Freenet, etc. all have their own DNS schemes and do not (usually) use IANA registered schemes, and many people create sites that cater to those networks.

Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in PHP:
yes, PHP is server side, but it still interacts with browsers very heavily. Among other
use-cases I cannot yet image, the major one is most likely validating user-supplied URLs
for opening in the browser. As far as I see the situation, currently there is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or not.

Looking at the spec for WHATWG, it looks like example%2Ecom will be parsed as a valid URL, and transformed to example.com, while this doesn't currently happen in parse_url():

https://3v4l.org/NtqQm

I don't know if that may be an issue, but might be if you are expecting the string to remain URL encoded.

parse_url and parse_str predates RFC3986

URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.

Thank you for the historical context!

Based on your and others' feedback, it has now become clear for me that parse_url()
is still useful and ext/url needs quite some additional capabilities until this function
really becomes superfluous. That's why it now seems to me that the behavior of
parse_url() could be leveraged in ext/url so that it would work with a Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are TBD).

For all these arguments I would keep the proposed Url free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.

My WIP implementation still uses nullable properties and return types. I only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone
involved in the discussion, then I think making these types nullable is fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with this.

The spec contains elements and their types. It would be good to adhere to the spec (simplifies documentation):

scheme may be null or empty string

port may be null

path is never null, but may be empty string

query may be null

fragment may be null

user/password may be null (to differentiate between an empty password or no password)

host may be null (for relative URLs

Again, thank you for your constructive criticism.

Regards,
Máté

— Rob

Here's a list of examples worth adding to the RFC:

//example.com?
ftp://user@example.com/path/to/ffile
https://user:@example.com
https://user:pass@example%2Ecom/?something=other&bool#heading

etc.

— Rob

1 year ago by ignace nyamagana butera — view source

unread

Hi Máté,

Supporting IANA registered schemes is a valid request, and is
definitely useful. However, I think this feature is not strictly
required to have in the current RFC.

True. Having a WHATWG compliant parser in PHP source code is a big +1
from me I have nothing against that inclusion.

Based on your and others' feedback, it has now become clear for me
that parse_url() is still useful and ext/url needs quite some additional
capabilities until this function really becomes superfluous.

parse_url can only be deprecated when a RFC3986 compliant parser is
added to php-src, hence why I insist in having that parser being present
too.

I will also add that everything up to now in PHP uses RFC3986 as basis
for generating or representing URLs (cURL extension, streams, etc...).
Having the first and only OOP representation of an URL in the language
not following that same specification seems odd to me. It opens the door
to inconcistencies that will only be resolved once an equivalent RFC3986
URL object made its way into the source code.

On the public API side I would recommend the following:

if you are to strictly follow the WHATWG specification no URI
component can be null. They must all be strings. If we have to plan to
use the same object for RFC3986 compliant parser, then all components
should be nullable except for the path component which can never be null
as it is always present.
As other have mention we should add a method to resolve an URI against
a base URI something like Url::resolve(string $url, Url|string|null
$baseUrl) where the baseURL argument should be an absolute Url if
present. If absent the url argument must be absolute otherwise an
exception should be thrown
last but not least the WHATWG specification is not only a URL parser
but also a URL validator and can apply some "correction" to malformed
URL and report them. The specification has a provision for a structure
to report malformed URL errors. I failed to see this mechanism being
mention anywhere the RFC. Will the URL only trigger exceptions or will
it also triggers warnings ? For inspiration the excellent PHP userland
WHATWG URL parser from Trevor Rowbotham
https://github.com/TRowbotham/URL-Parser allow using a PSR-3 logger to
record those errors.

Best regards,
Ignace

1 year ago by Rob Landers — view source

unread

Hi Máté,

Supporting IANA registered schemes is a valid request, and is
definitely useful. However, I think this feature is not strictly
required to have in the current RFC.

True. Having a WHATWG compliant parser in PHP source code is a big +1
from me I have nothing against that inclusion.

Based on your and others' feedback, it has now become clear for me
that parse_url() is still useful and ext/url needs quite some additional
capabilities until this function really becomes superfluous.

parse_url can only be deprecated when a RFC3986 compliant parser is
added to php-src, hence why I insist in having that parser being present
too.

I will also add that everything up to now in PHP uses RFC3986 as basis
for generating or representing URLs (cURL extension, streams, etc...).
Having the first and only OOP representation of an URL in the language
not following that same specification seems odd to me. It opens the door
to inconcistencies that will only be resolved once an equivalent RFC3986
URL object made its way into the source code.

On the public API side I would recommend the following:

if you are to strictly follow the WHATWG specification no URI
component can be null. They must all be strings. If we have to plan to
use the same object for RFC3986 compliant parser, then all components
should be nullable except for the path component which can never be null
as it is always present.

This isn't true. It's just that in the language it is specified in, any element can be null (i.e., no nullable types). It specifies what may be null here: URL Standard (whatwg.org) https://url.spec.whatwg.org/#url-representation

— Rob

1 year ago by Nicolas Grekas — view source

unread

Hi Maté,

Fantastic RFC :)

Le dim. 7 juil. 2024 à 11:17, Máté Kocsis kocsismate90@gmail.com a écrit :

Hi Ignace,

As far as I understand it, if this RFC were to pass as is it will model

PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url. In my view the Url class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.

Supporting IANA registered schemes is a valid request, and is definitely
useful.
However, I think this feature is not strictly required to have in the
current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely on parse_url().

If I may, parse_url is showing its age and issues like
https://github.com/php/php-src/issues/12703 make it unreliable. We need an
escape plan from it.

FYI, we're discussing whether a Uri component should make it in Symfony
precisely to work around parse_url's issues in
https://github.com/php/php-src/issues/12703
Your RFC would be the perfect answer to this discussion but IRI would need
to be part of it.

I agree with everything Ignace said. Supporting RFC3986 from day-1 would be
absolutely great!

Note that we use parse_url for http-URLs, but also to parse DSNs like
redis://localhost and the likes.

And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.

Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in
PHP:
yes, PHP is server side, but it still interacts with browsers very
heavily. Among other
use-cases I cannot yet image, the major one is most likely validating
user-supplied URLs
for opening in the browser. As far as I see the situation, currently there
is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or
not.

parse_url and parse_str predates RFC3986

URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.

Thank you for the historical context!

Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until
this function
really becomes superfluous. That's why it now seems to me that the
behavior of
parse_url() could be leveraged in ext/url so that it would work with a
Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a
Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are
TBD).

For all these arguments I would keep the proposed Url free of all

these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.

My WIP implementation still uses nullable properties and return types. I
only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low
prio for everyone
involved in the discussion, then I think making these types nullable is
fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with
this.

Again, thank you for your constructive criticism.

Regards,
Máté

1 year ago by kocsismate90@gmail.com — view source

unread

Hey Ignace, Nicolas,

Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/) in
the recent days
in order to add support for the requested functionality. As far as I can
tell, the results
were very promising, so I'm ok to include this into my proposal (I haven't
pushed my
changes yet and haven't updated the RFC yet).

Regarding the reference resolution (
https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering what
the use-case is?
But in any case, I'm fine with incorporating this as well into the RFC,
since apparently
both Lexbor and uriparser support this (naturally).

What I became puzzled about is the correct object structure and naming. Now
that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class extending
the former one...
If it is, then in my opinion, the logical behavior would be that Lexbor
always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case the
differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since URL
objects
could hold URLs parsed based on both specs (and therefore having a unified
interface is required).

Or rather we should have a separate URI and a WhatwgUrl class so that the
former one would
always be created by uriparser, while the latter one by Lexbor? This way we
could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one could
have a getUserInfo() method,
while the WHATWG related one could have both getUser() and getPassword()
methods). But then
the question is how interchangeable these classes should be? I.e. should we
be able to convert them
back and forth, or should there be an interface that is implemented by the
two classes?

I'd appreciate any suggestions regarding these questions.

P.S. due to its bad receptance, I got rid of the UrlParser class as well as
the UrlComponent enum from my
implementation in the meantime.

Regards,
Máté

1 year ago by Larry Garfield — view source

unread

Hey Ignace, Nicolas,

Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/)
in the recent days
in order to add support for the requested functionality. As far as I
can tell, the results
were very promising, so I'm ok to include this into my proposal (I
haven't pushed my
changes yet and haven't updated the RFC yet).

Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
But in any case, I'm fine with incorporating this as well into the RFC,
since apparently
both Lexbor and uriparser support this (naturally).

What I became puzzled about is the correct object structure and naming.
Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
If it is, then in my opinion, the logical behavior would be that Lexbor
always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case
the differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since
URL objects
could hold URLs parsed based on both specs (and therefore having a
unified interface is required).

Or rather we should have a separate URI and a WhatwgUrl class so that
the former one would
always be created by uriparser, while the latter one by Lexbor? This
way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one
could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and
getPassword() methods). But then
the question is how interchangeable these classes should be? I.e.
should we be able to convert them
back and forth, or should there be an interface that is implemented by
the two classes?

I'd appreciate any suggestions regarding these questions.

P.S. due to its bad receptance, I got rid of the UrlParser class as
well as the UrlComponent enum from my
implementation in the meantime.

Regards,
Máté

I apologize if I missed this up-thread somewhere, but what precisely are the differences between URI and URL? My understanding was that URL is a subset of URI (all URLs are URIs, but not all URIs are URLs). You're saying they're slightly disjoint sets? Can you give some concrete examples of where the parsing rules would produce different results? That may give us a better sense of what the logic should be.

--Larry Garfield

1 year ago by Ignace Nyamagana Butera — view source

unread

Hey Ignace, Nicolas,

Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/)
in the recent days
in order to add support for the requested functionality. As far as I
can tell, the results
were very promising, so I'm ok to include this into my proposal (I
haven't pushed my
changes yet and haven't updated the RFC yet).

Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
But in any case, I'm fine with incorporating this as well into the
RFC, since apparently
both Lexbor and uriparser support this (naturally).

What I became puzzled about is the correct object structure and
naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
If it is, then in my opinion, the logical behavior would be that
Lexbor always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case
the differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since
URL objects
could hold URLs parsed based on both specs (and therefore having a
unified interface is required).

Or rather we should have a separate URI and a WhatwgUrl class so that
the former one would
always be created by uriparser, while the latter one by Lexbor? This
way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one
could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and
getPassword() methods). But then
the question is how interchangeable these classes should be? I.e.
should we be able to convert them
back and forth, or should there be an interface that is implemented by
the two classes?

I'd appreciate any suggestions regarding these questions.

P.S. due to its bad receptance, I got rid of the UrlParser class as
well as the UrlComponent enum from my
implementation in the meantime.

Regards,
Máté

Hi Máté,

As far as I can tell, the results were very promising, so I'm ok to
include this into my proposal (I haven't pushed my changes yet and
haven't updated the RFC yet).

This is a great news if indeed it is possible to release both
specifications at the same time that would be really great.

Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?

Resolution is common when using an HTTP client and you defined a base
URI and then you can construct
subsequent URI based on that base URI using resolution.

What I became puzzled about is the correct object structure and
naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...

Both specification parse and can be represented by a URL value object.
The main difference between both
implementation are around normalization and encoding.

RFC3986 only allow non destructive normalization which is not true in
the case of WHATWG spec:

Here's a simple example to illustrate the differences:

HttPs://0300.0250.0000.0001/path?query=foo%20bar

with RFC3986 you will end up with
https://0300.0250.0000.0001/path?query=foo%20bar
with WHATWG you will end up with https://192.168.0.1/path?query=foo+bar

In the case of WHATWG the host is changed and the query string follow a
distinctive encoding spec.

From my POV you have 2 choices either you use one URL object for both
specifications with distinctive named constructors fromRFC3986 and
fromWhatwg or you have one interface and two distinctive implementations.
I do not think that one can be the extended to create the other one at
least that's my POV.

Hope this helps you in your implementation.

Best regards,
Ignace

1 year ago by kocsismate90@gmail.com — view source

unread

Hi Niels,

First of all, thank you for your support!

Why did you choose UrlParser to be a "static" class? Right now it's just a

fancy namespace.

That's a good question, let me explain the reason: one of my major design
goals was to make the UrlParser class to be
extendable and configurable (e.g. via an "engine" property similar to what
Random/Randomizer has). Of course, UrlParser
doesn't support any of this yet, but at least the possibility is there for
followup RFCs due to the class being final.

Since I knew it would be an overkill to require instantiating an UrlParser
instance for a task which is stateless (URL parsing),
finally I settled on using static methods for the purpose. Later, if the
need arises, the static methods could be converted to
non-static ones with minimal BC impact.

It's a bit of a shame that the PSR interface treats queries as strings.

In Javascript we have the URLSearchParams class that we can use as a

key-value storage for query parameters.

Hm, yes, that's an observation I can agree with. However, this restriction
shouldn't limit followups to add key-value storage
support for query parameters. Although, as far as I could determine,
neither Lexbor is capable of such a thing currently.

Why is UrlComponent a backed enum?

To be honest, it has no specific reason apart from that's what I am used
to. I'm fine with whatever choice, even with getting rid of
UrlComponent completely. I added the UrlParser::parseUrlComponent() method
(and hence the UrlComponent enum) to the
proposal in order to have a direct replacement for parse_url() when it's
called with the $component parameter set, but I wasn't
really sure whether this is needed at all... So I'm eager to hear any
recommendations regarding this problem.

A nit: We didn't bundle the entire Lexbor engine, only select parts of it.

Just thought I'd make it clear.

Yes, my wording was slightly misleading. I'll clarify this in the RFC.

About edge cases: e.g. what happens if I call the Url constructor and leave

every string field empty?

Nothing :) The Url class in its current form can store invalid URLs. I know
that URLs are generally modeled as value objects (that's
also why the proposed class is immutable), and generally speaking, value
objects should protect their invariants. However, due to
separating the parser to its own class, I abandoned this "rule". So this is
one more downside of the current API.

Regards,
Máté

1 year ago by Larry Garfield — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the
WHATWG URL living standard), since the parse_url() function is
optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

I am all for proper data modeling of all the things, so I support this effort.

Comments:

There's no need for UrlComponent to be backed.
I don't understand why UrlParser is a static class. We just had a whole big debate about that. :-)

There's a couple of ways I could see it working, and I'm not sure which I prefer:

Better if we envision the parser getting options or configuration in the future.
$url = new UrlParser()->parseUrl(): Url;
The named-constructor pattern is quite common.
$url = Url::parseFromString()
$url = Url::parseToArray();

I... do not understand the point of having public properties AND getters/withers. A readonly class with withers, OK, a bit clunky to implement but it would be your problem in C, not mine, so I don't care. :-) But why getters AND public properties? If going that far, why not finish up clone-with and then we don't need the withers, either? :-)
Making all the parameters to Url required except port makes little sense to me. User/pass is more likely to be omitted 99% of the time than port. In practice, most components are optional, in which case it would be inaccurate to not make them nullable. Empty string wouldn't be quite the same, as that is still a value and code that knows to skip empty string when doing something is basically the same as code that knows to skip nulls. We should assume people are going to instantiate this class themselves often, not just get it from the parser, so it should be designed to support that.
I would not make Url final. "OMG but then people can extend it!" Exactly. I can absolutely see a case for an HttpUrl subclass that enforces scheme as http/https, or an FtpUrl that enforces a scheme of ftp, etc. Or even an InternalUrl that assumes the host is one particular company, or something. (If this sounds like scope creep, it's because I am confident that people will want to creep this direction and we should plan ahead for it.)
If the intent of the withers is to mimic PSR-7, I don't think it does so effectively. Without the interface, it couldn't be a drop-in replacement for UriInterface anyway. And we cannot extend it to add the interface if it's final. Widening the parameters in PSR-7 interfaces to support both wouldn't work, as that would be a hard-BC break for any existing implementations. So I don't really see what the goal is here.
If we ever get "data classes", this would be a good candidate. :-)
Crazy idea:

new UriParser(HttpUrl::class)->parse(string);

To allow a more restrictive set of rules. Or even just to cast the object to that child class.

--Larry Garfield

1 year ago by kocsismate90@gmail.com — view source

unread

Hi Larry,

Thank you very much for your feedback! I think I have already partially
answered some of your questions in my previous email to Niels,
but let me answer your other questions below:

I... do not understand the point of having public properties AND

getters/withers. A readonly class with withers, OK, a bit clunky to
implement but it would be your problem in C, not mine, so I don't care.
:-) But why getters AND public properties? If going that far, why not
finish up clone-with and then we don't need the withers, either? :-)

I know it's disappointing, but the public modifiers are just a typo which
were forgotten there from the very first iteration of the API :) However,
I'm fine with having public readonly properties without getters as well, as
long as we declare this a policy that we are going to adopt... Withers are
indeed a must for now (and their implementation indeed requires some magic
in C...).

Making all the parameters to Url required except port makes little sense

to me. User/pass is more likely to be omitted 99% of the time than port.
In practice, most components are optional, in which case it would be
inaccurate to not make them nullable. Empty string wouldn't be quite the
same, as that is still a value and code that knows to skip empty string
when doing something is basically the same as code that knows to skip
nulls. We should assume people are going to instantiate this class
themselves often, not just get it from the parser, so it should be designed
to support that.

I may have misunderstood what you wrote, but all the parameters - including
port - are required. If you really meant "nullable" instead of "required",
then you are right. Apart from this, I'm completely fine with making these
parameters optional, especially if we decide not to have the UrlParser (my
initial assumption was that the Url class is going to be instantiated via
UrlParser::parseUrl() calls).

I would not make Url final. "OMG but then people can extend it!"

Exactly. I can absolutely see a case for an HttpUrl subclass that enforces
scheme as http/https, or an FtpUrl that enforces a scheme of ftp, etc. Or
even an InternalUrl that assumes the host is one particular company, or
something. (If this sounds like scope creep, it's because I am confident
that people will want to creep this direction and we should plan ahead for
it.)

Without having thought much about its consequences on the implementation,
I'm fine with removing the final modifier.

If the intent of the withers is to mimic PSR-7, I don't think it does so

effectively. Without the interface, it couldn't be a drop-in replacement
for UriInterface anyway. And we cannot extend it to add the interface if
it's final. Widening the parameters in PSR-7 interfaces to support both
wouldn't work, as that would be a hard-BC break for any existing
implementations. So I don't really see what the goal is here.

I've just answered this to Ben, but let me reiterate: PSR-7's UriInterface
is only needed because PHP doesn't have a Url internal class. :)

Máté

1 year ago by Ben Ramsey — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

The RFC states:

<snip> The Url\Url class is intentionally compatible with the PSR-7 UriInterface. </snip>

It mirrors the interface, but it can’t be swapped out for a UriInterface instance, especially since it can’t be extended, so I wouldn’t consider it compatible. I would still need to write a compatibility layer that composes Url\Url and implements UriInterface.

<snip> This makes it possible for a next iteration of the PSR-7 standard to use Url\Url directly instead of requiring implementations to provide their own Psr\Http\Message\UriInterface implementation. </snip>

Since PSRs are concerned with shared interfaces and this class is final and does not implement any interfaces, I’m not sure how you envision “a next iteration” of PSR-7 to use this directly, unless what you mean is that UriInterface would be deprecated and applications would type directly against Url\Url.

Cheers,
Ben

1 year ago by nyamsprod the funky webmaster — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since the parse_url() function is optimized for
performance instead of correctness.

In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

As a maintainer of a PHP userland URI toolkit I have a couple of
questioms/remarks on the proposal. Fist, I look forward for finally
having a real Url parser AND validator in PHP core. Any effort on that
direction is always a welcomed good news.

As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url. In my view the Url class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.

Therefore, I would rather suggest we ALSO include support for RFC3986
and RFC3987 specification properly and give both specs a go (at the same
time!) and a clear way to instantiate your Url with one or the other spec.
In clear, my ideal situation would be to add to the parser at least 2
named constructors UrlParser::fromRFC3986 and UrlParser::fromWHATWG
or something similar (name can be changed or improved).

While this is an old article by Daniel Stenberg
(https://daniel.haxx.se/blog/2017/01/30/one-url-standard-please/), it
conveys with more in depth analysis my issues with the WHATWG spec and
its usage in PHP if it were to be use as the ONLY available parser in
PHP core for URL.

the PSR-7 relation is also unfortunate from my POV: PSR-7 UriInterface
is designed to be at its core an HTTP URI representation (so it shares
the same type of issue as the WHATWG spec!) meaning in absence of a
scheme it falls back to the HTTP scheme validation. This is why the
interface can forgone any nullable component because the HTTP spec
allows it, other schemes do not. For instance the FTP scheme prohibits
the presence of the query and fragment components which means they MUST
be null in that case.

By removing PSR-7 constraints we could add

the Url::(get|to)Components method: it would mimics parse_url
returned value and as such ease migration from parse_url
the Url::getUsername and Url::getPassword to access the username
and password component individually. You would still use
the withUserInfo method to update them but you give the developer the
ability to access both components directly from the Url object.

These additions would remove the need for

UrlParser::parseUrlToArray
UrlParser::parseUrlComponent
UrlComponent Enum

Cheers,
Ignace

1 year ago by Ben Ramsey — view source

unread

Therefore, I would rather suggest we ALSO include support for RFC3986 and RFC3987 specification properly and give both specs a go (at the same time!) and a clear way to instantiate your Url with one or the other spec.
In clear, my ideal situation would be to add to the parser at least 2 named constructors UrlParser::fromRFC3986 and UrlParser::fromWHATWG
or something similar (name can be changed or improved).

While this is an old article by Daniel Stenberg (https://daniel.haxx.se/blog/2017/01/30/one-url-standard-please/), it conveys with more in depth analysis my issues with the WHATWG spec and its usage in PHP if it were to be use as the ONLY available parser in PHP core for URL.

I agree that I would love to see a more general IRI parser, with maybe a URI parser being a subtype of an IRI parser.

Cheers,
Ben

1 year ago by Juris Evertovskis — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the
WHATWG URL living standard), since the parse_url() function is
optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

Hey,

That's great that you've made the Url class readonly. Immutability is
realiable. And I fully agree that a better parser is needed.

I agree with the otters that

the enum might be fine without the backing, if it's needed at all
I'm not convinced a separate UrlParser is needed,
Url::someFactory($str) should be enough
getters seem unnecessary, they should only be added if you can be sure
they are going to be used for compatibility with PSR-7
treating $query as a single string is clumsy, having some kind of bag
or at least an array to represent it would be cooler and easier to build
and manipulate

I wanted to add that it might be more useful to make all the Url
constructor arguments optional. Either nullable or with reasonable
defaults. So you could $url = new Url(path: 'robots.txt'); foreach ($domains as $d) $r[] = file_get_contents($url->withHost($d)) and stuff
like that.

Similar modifiers would be very useful for the query stuff, e.g. $u = Url::current(); return $u->withQueryParam('page', $u->queryParam->page + 1);.

Sure, all of that can be done in the userland as long as you drop
final :)

BR,
Juris

1 year ago by Krinkle — view source

unread

[…] add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

First-pass comments/thoughts.

As others have mentioned, it seems the class would/could not actually satisfy PSR-7. Realistically, the PSR-7 interface package or someone else would need to create a new class that combines the two, potentially as part of a transition away from it to the built-in class, with future PSRs building directly on Url. If we take that as given, we might as well design for the end state, and accept that there will be a (minimal) transition. This end state would benefit from being designed with the logical constraints of PSR-7 (so that migration is possible without major surprises), but without restricting us to its exact API shape, since an intermediary class would come into existence either way.

For example, Url could be a value class with merely 8 public properties. Possibly with a UrlImmutable subclass, akin to DateTime, where the properties are read-only instead a clone method could return Url?).

It might be more ergonomic to leave the parser as implementation detail, allowing the API to be accessed from a single import rather than requiring two. This could look like Url::parse() or Url::parseFromString().

For the Url::parseComponent() method, did you consider accepting the existing PHP_URL_* constants? They appear to fit exactly, in naming, description, and associated return types.

Without UrlParser/UrlComponent, I'd adopt it direclty in applications and frameworks. WIthout it, further wrapping seems likely for improved usability. This is sometimes benefitial when exposing low-level APIs, but it seems like this is close to fitting in a single class, as demonstrated by the WHATWG URL API.

One thing I feel is missing, is a method to parse a (partial) URL relative to another. E.g. to expand or translate paths between two URLs. Consider expanding "/w/index.php", or "index.php" relative to "https://wikipedia.org/w/";. Or expanding "//example.org" relative to either "https://wikipedia.org" vs "http://wikipedia.org". The WHATWG URL API does this in the form of a second optional string|Stringable parameter to Url::parse(). Implementing "expand URL" with parsing of incomplete URLs is error-prone and hard to get right. Including this would be valuable.

See also Net_URL2 and its resolve() method https://pear.php.net/package/Net_URL2 https://github.com/pear/Net_URL2

--
Timo Tijhof
https://timotijhof.net/

1 year ago by Lanre — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since the parse_url() function is optimized for
performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

I was exploring wrapping ada_url for PHP (
https://github.com/lnear-dev/ada-url). It works, but it's a bit slower,
likely due to the implementation of the objects. I was planning to embed
the zvals directly in the object, similar to PhpToken, but I haven't had
the chance and don't really need it anymore. Shouldn't be too much work to
clean it up though

1 year ago by Lanre — view source

unread

On Fri, Jun 28, 2024 at 3:38 PM Máté Kocsis kocsismate90@gmail.com
wrote:

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since the parse_url() function is optimized for
performance instead of correctness.

In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by any
means, the RFC only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

I was exploring wrapping ada_url for PHP (
https://github.com/lnear-dev/ada-url). It works, but it's a bit slower,
likely due to the implementation of the objects. I was planning to embed
the zvals directly in the object, similar to PhpToken, but I haven't had
the chance and don't really need it anymore. Shouldn't be too much work to
clean it up though

I’ve updated the implementation, and with Ada 2.9.0, the performance is now
closer to parse_url for short URLs and even outperforms it for longer
URLs. You can see the benchmarks in the "Run benchmark script" section of
this GitHub Actions run.

cheers,
Lanre

1 year ago by Niels Dossche — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

Hi Máté

Something that I thought about lately is how the existing URL parser in PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the url extension could be used instead of the current one for those usages (opt-in).

Kind regards
Niels

1 year ago by ignace nyamagana butera — view source

unread

Hi Máté

Something that I thought about lately is how the existing URL parser in
PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we
rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the
url extension could be used instead of the current one for those usages
(opt-in).

Kind regards
Niels

Hi Niels,
As mentioned before, I believe the "pluggable" system can only be applied
once a RFC3986 URL object is available, using the WHATWG URL
would constitute a major BC. I would even go a step further and state that
even by using the RFC3986 URL object you would still face some issues, for
instance,
in regards to file scheme based URL. Those are not parsed the same
way with parse_url function and RFC3986 rules.
Maybe that change may land on PHP9 or the behaviour may be deprecated to be
removed in PHP10 whenever that one happens.

On Sun, Jul 21, 2024 at 1:22 PM Niels Dossche dossche.niels@gmail.com
wrote:

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since the parse_url() function is optimized for
performance instead of correctness.

In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by any
means, the RFC only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api <
https://wiki.php.net/rfc/url_parsing_api>;

Regards,
Máté

Hi Máté

Something that I thought about lately is how the existing URL parser in
PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we
rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the url
extension could be used instead of the current one for those usages
(opt-in).

Kind regards
Niels

11 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace, Niels,

Sorry for being silent for so long, I was working hard on the
implementation besides some summer activities :) I can say that I had
really good progress in the last month and now I think (hope) that I
managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:

The uriparser library is now used for parsing URIs based on RFC 3986.
I renamed the extension to "uri" in favor of "url" in order to
make the name more generic and to express the new use-case.
There is no Url\UrlParser class anymore. The Uri\Uri class now includes
the relevant factory methods.
Uri/Uri is now an abstract class which is implemented by 2 concrete
classes: Uri\Rfc3986Uri and Uri\WhatwgUri.
WhatWG URL parsing now returns the exact error code according to the
specification (although a reference parameter is used for now - but this is
TBD)
As suggested by Niels, it's now possible to plug an URI parsing
implementation into PHP. A new uri.default_handler INI option is also added.
Currently, integration is only implemented for FILTER_VALIDATE_URL though.
The approach also makes it possible to register additional 3rd party
libraries for parsing URIs (like ADA URL).
It looks like that performance significantly improved according to the
rough benchmarks performed in CI.

Please re-read the RFC as it shares a bit more details than my quick
summary above: https://wiki.php.net/rfc/url_parsing_api

There are some questions I still didn't manage to find an answer for
though. Most importantly, the URI parser libraries used don't support
modification
of the URI. That's why I had to get rid of the "wither" methods for now
which were originally part of the API. I think it's unfortunate, and I'll
try to do my
best to reclaim them.

Additionally, due to technical reasons, extending the Uri\Uri class in
userland is only possible if all the methods are overridden by the child.
It's because
I had to use "computed" properties in the implementation (roughly, they are
stored in an internal C struct unlike regular properties). That's why it
may be
better if userland code could use (and possibly implement) an Uri\Uri
interface instead.

In one of my previous emails, I had some concerns that RFC 3986 and WhatWg
spec can really share the same interface (they do in my current
implementation
despite that they are different classes). I still share this concern
because WhatWg specifies the "user" and "password" URL components, while
RFC 3986
only specifies the notion of "userinfo" (which is usually just
user:password, but it's not necessarily the case as far as I understood).
The RFC implementation
of the RFC 3986 parser currently splits the 'userinfo' component at the ":"
character, but doing so doesn't seem very spec compliant.

Arnaud suggested that it would be better if the query parameters could be
retrieved both escaped and unescaped after parsing. I haven't had time to
investigate
the possibilities, but my gut feeling is that it's only possible to achieve
with some custom code. Arnaud also had questions regarding canonization.
Currently,
it's not performed when calling the __toString() method, because only
uriparser library supports this feature, and I didn't want to diverge the
two implementations.
I'm not even sure that it's a good idea to always do it so I'm thinking
about the possibility to selectively enable this feature (i.e. adding a
separate "toCanonizedString"
method).

Regards,
Máté

11 months ago by Dennis Snell — view source

unread

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing URLs
according to any well established standards (RFC 1738 or the WHATWG URL
living standard), since the parse_url() function is optimized for
performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

Máté, thanks for putting this together.

Whenever I need to work with URLs there are a few things missing that I would love to see incorporated into any change in PHP that brings us a spec-compliant parsing class.

First of all, I typically care most about WhatWG URLs because the PHP code I’m working with is making decisions about HTML that a browser will interpret. Paramount above all other concerns that code on the server can understand content in the same way that the browsers will, otherwise we will invite security issues. People may have valid critiques with the WhatWG specification, but it’s also the most-relevant specification for users of much or most of the PHP code we write, and it’s valuable because it allows us to talk about URLs in the same way a browser would.

I’m worried about the side-effects that having a global uri.default_handler could have with code running differently for no apparent reason, or differently based on what is calling it. If someone is writing code for a controlled system I could see this being valuable, but if someone is writing a framework like WordPress and has no control over the environments in which code runs, it seems dangerous to hope that every plugin and every host runs compatible system configurations. Nobody is going to check ini_get( ‘uri.default_handler’ ) before every line that parses URLs. Beyond this, even just allowing a pluggable parser invites broken deployments because PHP code that is reading from a browser or sending output to one needs to speak the language the browser is speaking, not some arbitrary language that’s similar to it.

One thing I feel is missing, is a method to parse a (partial) URL relative to another

Being able to parse a relative URL and know if a URL is relative or absolute would help WordPress, which often makes decisions differently based on this property (for instance, when reading an href property of a link). I know these aren’t spec-compliant URLs, but they still represent valid values for URL fields in HTML and knowing if they are relative or not requires some amount of parsing specific details everywhere, vs. in a class that already parses URLs. Effectively, this would imply that PHP’s new URL parser decodes document.querySelector( ‘a’ ).getAttribute( ‘href’ ), which should be the same as document.querySelector( ‘a’ ).href, and indicates whether it found a full URL or only a portion of one.

$url->is_relative or $url->is_absolute
$url->specificity = URL::Relative | URL::Absolute

the URI parser libraries used don't support modification of the URI

Having methods to add query arguments, change the path, etc… would be a great way to simplify user-space code working with URLs. For instance, read a URL and then add a query argument if some condition within the URL warrants it (for example, the path ends in .png).

Was it intended to add this to the RFC before it’s finalized?

I would not make Url final. "OMG but then people can extend it!" Exactly.

My counter-point to this argument is that I see security exploits appear everywhere that functions which implement specifications are pluggable and extendable. It’s easy to see the need to create a class that limits possible URLs, but that also doesn’t require extending a class. A class can wrap a URL parser just as it could extend one. Magic methods would make it even easier.

A problem that can arise with adding additional rules onto a specification like this is that the subclass gets used in more places than it should and then somewhere some PHP code allows a malicious URL because it failed to parse and then the inspection rules weren’t applied.

Finally, I frequently find the need to be able to consider a URL in both the display context and the serialization context. With Ada we have normalize_url(), parse_search_params(), and the IDNA functions to convert between the two representations. In order to keep strong boundaries between security domains, it would be nice if PHP could expose the two variations: one is an encoded form of a URL that machines can easily parse while the other is a “plain string” in PHP that’s easier for humans to parse but which might not even be a valid URL. Part of the reason for this need is that I often see user-space code treating an entire URL as a single text span that requires one set of rules for full decoding; it’s multiple segments that each have their own decoding rules.

Original [ https://xn--google.com/secret/../search?q=🍔 ]
$url->normalize() [ https://xn--google.com/search?q=%F0%9F%8D%94 ]
$url->for_display() Displayed [ https://䕮䕵䕶䕱.com/search?q=🍔 ]

Having this in the RFC would give everyone the tools they need to effectively and safely set links within an HTML document.

All the best,
Dennis Snell

8 months ago by kocsismate90@gmail.com — view source

unread

Hi Dennis,

Even though I didn't answer for a long time, I was improving my RFC
implementation in the meanwhile as well as evaluating your suggestions.

I’m worried about the side-effects that having a global
uri.default_handler could

have with code running differently for no apparent reason, or differently
based on what is calling it. If someone is writing code for a controlled
system I could see this being valuable, but if someone is writing a
framework like WordPress and has no control over the environments in which
code runs, it seems dangerous to hope that every plugin and every host runs
compatible system configurations. Nobody is going to check ini_get( ‘uri.default_handler’ ) before every line that parses URLs. Beyond this,
even just allowing a pluggable parser invites broken deployments
because PHP code that is reading from a browser or sending output to one
needs to speak the language the browser is speaking, not some arbitrary
language that’s similar to it.

You convinced me with your arguments regarding the issues a global
uri.default_handler
INI config can cause, especially after having read a blog post by Daniel
Stenberg about the topic (
https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/). That's why I
removed this from the RFC in favor of relying on configuring the parser at
the individual feature level. However, I don't agree with removing a
pluggable parser because of the following reasons:

the current method (parse_url() based parser) is already doomed, isn't
compliant with any spec, so it already doesn't speak the language the
browser is speaking
even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web
in addition, there are tools which aren't compliant with the WhatWg spec,
but with some other. Most prominently, cURL is mostly RFC3986 compliant
with some additional flavour of WhatWg according to
https://everything.curl.dev/cmdline/urls/browsers.html

That's why I intend to keep support for pluggability.

Being able to parse a relative URL and know if a URL is relative or
absolute would help WordPress, which often makes decisions differently
based on this property (for instance, when reading an href property of a
link). I know these aren’t spec-compliant URLs, but they still represent
valid values for URL fields in HTML and knowing if they are relative or not
requires some amount of parsing specific details everywhere, vs. in a class
that already parses URLs. Effectively, this would imply that PHP’s new URL
parser decodes document.querySelector( ‘a’ ).getAttribute( ‘href’ ),
which should be the same as document.querySelector( ‘a’ ).href, and
indicates whether it found a full URL or only a portion of one.

$url->is_relative or $url->is_absolute

$url->specificity = URL::Relative | URL::Absolute

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when
the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL, and then WhatWgUri should let you know whether the originally passed
in URI was relative or not, did I get you right? This feature is certainly
possible with RFC3986 URIs (even without the base parameter), but WhatWg
requires the above mentioned workaround for parsing + I have to look into
how this can be implemented...

Having methods to add query arguments, change the path, etc… would be a

great way to simplify user-space code working with URLs. For instance, read
a URL and then add a query argument if some condition within the URL
warrants it (for example, the path ends in .png).

I managed to retain support for the "wither" methods that were originally
part of the proposal. This required using custom code for the uriparser
library, while the maintainer of Lexbor was kind enough to add native
support for modification after I submitted a feature request. However,
convenience methods for manipulating query parameters are still not part of
the RFC because it would increase the scope of the RFC even more, and due
to other issues highlighted by Ignace in his prior email:
https://externals.io/message/123997#124077. As I really want such a
feature, I'd be eager to create a followup RFC dedicated for handling query
strings.

My counter-point to this argument is that I see security exploits appear

everywhere that functions which implement specifications are pluggable and
extendable. It’s easy to see the need to create a class that limits possible
URLs, but that also doesn’t require extending a class. A class can wrap a
URL parser just as it could extend one. Magic methods would make it even
easier.

Right now, it's only possible to plug internal URI implementation into PHP,
userland classes cannot be used, so this probably reduces the issue.
However, I recently bumped into a technical issue with URIs not being final
which I am currently trying to assess how to solve. More information is
available at one of my comments on my PR:
https://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c06124aa2f331afa8#r1847316607
As far as I see the situation currently, it would probably be better to
make these classes final so that similar unforeseen issues and
inconsistencies cannot happen again (we can unfinalize them later anyway).

Finally, I frequently find the need to be able to consider a URL in both
the display context and the serialization context. With Ada we have
normalize_url(), parse_search_params(), and the IDNA functions to
convert between the two representations. In order to keep strong boundaries
between security domains, it would be nice if PHP could expose the two
variations: one is an encoded form of a URL that machines can easily parse
while the other is a “plain string” in PHP that’s easier for humans to
parse but which might not even be a valid URL. Part of the reason for this
need is that I often see user-space code treating an entire URL as a single
text span that requires one set of rules for full decoding; it’s multiple
segments that each have their own decoding rules.

Original [ https://xn--google.com/secret/../search?q=🍔 ]

$url->normalize() [ https://xn--google.com/search?q=%F0%9F%8D%94 ]

$url->for_display() Displayed [ https://䕮䕵䕶䕱.com/search?q=
https://xn--google.com/search?q=🍔 ]

Even though I didn't entirely implement this suggestion, I added
normalization support:

the normalize() method can be used to create a new URI instance whose
components are normalized based on the current object
the toNormalizedString() method can be used when only the normalized
string representation is needed
the newly added equalsTo() method also makes use of normalization to
better identify equal URIs

For more information, please refer to the relevant section of the RFC:
https://wiki.php.net/rfc/url_parsing_api#api_design. The forDisplay()
method also seems to be useful at the first glance, but since this may be a
controversial optional feature, I'd defer it for later...

Regards,
Máté

7 months ago by Dennis Snell — view source

unread

It seems that I’ve mucked up the mailing list again by deleting an old message I intended on replying to. Apologies all around for replying to an older message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no returning from a bit of an extended leave, so I appreciate your diligence and patience. Here are some thoughts in response to your message from Nov. 19, 2024.

even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web

This has largely been touched on elsewhere, but I will echo the idea that it seems valid to have to separate parsers for the two standards, and truly they diverge enough that it seems like it could be only a superficial thing for them to share an interface.

I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.

Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?

Coming from the XHTML/HTML/XML side I know that there was substantial effort to enforce standards on browsers and that led to decades of security exploits and confusion, when the “official” standards never fully existed in the way people thought. I don’t mean to start any flame wars, but is the URL story at all similar here?

I’m mostly worried that we could accidentally encourage risky behavior for developers who aren’t familiar with the nuances of having to URL specifications vs. having the simplest, least-specific interface point them in the right direction for what they will probably be doing. parse_url() is a great example of how the thing that looks right is actually terribly prone to failure.

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL

If this means passing something like the following then I suppose it’s okay. It would be nice to be able to know without passing the second parameter, as there are multitude cases where no such base URL would be available, and some dummy parameter would need to be provided.

    $url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
    var_dump( $url->is_relative_or_something_like_that );

This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that https://example.com does not replace the actual host part if one is provided in $url. For example, this code should work.

    $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘https://example.com’ );
    $url->domain === 'wiki.php.net'

The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I'd defer it for later…

Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”

This is a misleading URL to human readers, which is why the WhatWG indicates that “browsers should render a URL’s host by running domain to Unicode with the URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n].

The lack of a standard method here means that (a) most code won’t render the URLs the way a human would recognize them, and (b) those who do will run to inefficient and likely-incomplete user-space code to try and decode/render these hosts.

It may be something fine for a follow-up to this work, but it’s also something I personally consider essential for any native support of handling URLs that are destined for human review. If sending to an href attribute it should be the normalized URL; but if displayed as text it should be easy to prevent tricking people in this way.

In my HTML decoding RFC I tried to bake in this decision in the type of the function using an enum. Since I figured most people are unaware of the role of the context in which HTML text is decoded, I found the enum to be a suitable convenience as well as educational tool.

    $url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
    $url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com

The names probably are terrible in all of my code snippets, but at this point I’m not proposing actual names, just code samples good enough to illustrate the point. By forcing a choice here (no default value) someone will see the options and probably make the right call.

This is all looking quite nice. I’m happy to see how the RFC continues to develop, and I’m eagerly looking forward to being able to finally rely on PHP’s handling of URLs.

Happy new year,
Dennis Snell

6 months ago by ignace nyamagana butera — view source

unread

It seems that I’ve mucked up the mailing list again by deleting an old message I intended on replying to. Apologies all around for replying to an older message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no returning from a bit of an extended leave, so I appreciate your diligence and patience. Here are some thoughts in response to your message from Nov. 19, 2024.

even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web

This has largely been touched on elsewhere, but I will echo the idea that it seems valid to have to separate parsers for the two standards, and truly they diverge enough that it seems like it could be only a superficial thing for them to share an interface.

I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.

Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?

Coming from the XHTML/HTML/XML side I know that there was substantial effort to enforce standards on browsers and that led to decades of security exploits and confusion, when the “official” standards never fully existed in the way people thought. I don’t mean to start any flame wars, but is the URL story at all similar here?

I’m mostly worried that we could accidentally encourage risky behavior for developers who aren’t familiar with the nuances of having to URL specifications vs. having the simplest, least-specific interface point them in the right direction for what they will probably be doing. parse_url() is a great example of how the thing that looks right is actually terribly prone to failure.

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL

If this means passing something like the following then I suppose it’s okay. It would be nice to be able to know without passing the second parameter, as there are multitude cases where no such base URL would be available, and some dummy parameter would need to be provided.
     $url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
     var_dump( $url->is_relative_or_something_like_that );
This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that https://example.com does not replace the actual host part if one is provided in $url. For example, this code should work.
     $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘https://example.com’ );
     $url->domain === 'wiki.php.net'
The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I'd defer it for later…

Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”

This is a misleading URL to human readers, which is why the WhatWG indicates that “browsers should render a URL’s host by running domain to Unicode with the URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n].

The lack of a standard method here means that (a) most code won’t render the URLs the way a human would recognize them, and (b) those who do will run to inefficient and likely-incomplete user-space code to try and decode/render these hosts.

It may be something fine for a follow-up to this work, but it’s also something I personally consider essential for any native support of handling URLs that are destined for human review. If sending to an href attribute it should be the normalized URL; but if displayed as text it should be easy to prevent tricking people in this way.

In my HTML decoding RFC I tried to bake in this decision in the type of the function using an enum. Since I figured most people are unaware of the role of the context in which HTML text is decoded, I found the enum to be a suitable convenience as well as educational tool.
     $url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
     $url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com
The names probably are terrible in all of my code snippets, but at this point I’m not proposing actual names, just code samples good enough to illustrate the point. By forcing a choice here (no default value) someone will see the options and probably make the right call.

This is all looking quite nice. I’m happy to see how the RFC continues to develop, and I’m eagerly looking forward to being able to finally rely on PHP’s handling of URLs.

Happy new year,
Dennis Snell

Hi Dennis,

I’m curious to hear from folks here hat fraction of the actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them
are truly RFC3986 systems or if the common-enough URLs are valid in both
specs.

Here's my take on both RFC. RFC3986/87 is a "parsing" RFC which leave
the validation to each individual scheme, for instance the following URL
is valid under RFC3986 but will be problematic under WHATWG URL spec

ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)

The LDAP URL is RFC3986 compliant but adds its own validation rules on
top of the RFC. This means that LDAP URL generation would be problematic
if we only implement the WHATWG spec, hence why having a RFC3986/87 URI
in PHP is crucial.

Futhermore, the WHATWG spec not only parses but also in the same time
validates and more agressively normalizes the URL something the RFC3986
does not do or more precisely recognizes and categorizes in two
categories, the non-destructive and the destructive normalizations.
These normalization affect the scheme, the path and also the host which
can be very impactful in your application.

For the following URL 'https://0073.0232.0311.0377/b'

RFC3986:    'https://0073.0232.0311.0377/b'
WHATWG URL: 'https://59.154.201.255/b'

So this can be a source of confusion for developper. Last but not least
RFC3986 alone will never be able to parses IDN domain names and required
suport of RFC3987 IDN domains to do so.

Hopefully with those examples you will understand the strenghts and
weaknesses of each spec and why IMHO PHP needs both to be up to date.

5 months ago by kocsismate90@gmail.com — view source

unread

Hi Dennis,

I only harp on the WhatWG spec so much because for many people this will
be the only one they are aware of, if they are aware of any spec at all,
and this is a sizable vector of attack targeting servers from user-supplied
content. I’m curious to hear from folks here hat fraction of the actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them are
truly RFC3986 systems or if the common-enough URLs are valid in both specs.

I think Ignace's examples already highlighted that the two specifications
differ in nuances so much that even I had to admit after months of trying
to squeeze them into the same interface that doing so would be
irresponsible.
The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing
URNs or URIs with scheme-specific behavior - like ldap apparently), but
even the UriInterface of PSR-7 can build upon it. On the other hand,
Uri\WhatWg\Url will be useful for representing browser links and any other
URLs for the web (i.e. an HTTP application router component should use this
class).

Just to enlighten me and possibly others with less familiarity, how and
when are RFC3986 URLs used and what are those systems supposed to do when
an invalid URL appears, such as when dealing with percent-encodings as you
brought up in response to Tim?

I am not 100% sure what I brought up to Tim, but certainly, the biggest
difference between the two specs regarding percent-encoding was recently
documented in the RFC:
https://wiki.php.net/rfc/url_parsing_api#percent-encoding . The other main
difference is how the host component is stored: WHATWG automatically
percent-decodes it, while RFC3986 doesn't. This is summarized in the
https://wiki.php.net/rfc/url_parsing_api#component_retrieval section (a bit
below).

This would be fine, knowing in hindsight that it was originally a relative
path. Of course, this would mean that it’s critical that https://example.com does not replace the actual host part if one is
provided in $url. For example, this code should work.
    $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘
https://example.com’ );
    $url->domain === 'wiki.php.net'

Yes. it's the case. Both classes only use the base URL for relative URIs.

Hopefully this won’t be too controversial, even though the concept was new
to me when I started having to reliably work with URLs. I choose the
example I did because of human risk factors in security exploits. "
xn--google.com" is not in fact a Google domain, but an IDNA domain
decoding to "䕮䕵䕶䕱.com http://xn--google.com”

I got your point, so I implemented your suggestion. Actually, I made yet
another larger API change in the meanwhile, but in any case, the WHATWG
implementation now supports IDNA the following way:

$url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘";, null);

echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // 🐘.com
echo $url->toString(); //
https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
echo $url->toDisplayString(); /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98

Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out
at the end of https://externals.io/message/126182#126184), and adding
support for RFC3987 (therefore IRIs) would be a very heavy amount of work, it's
just not feasible within this RFC :( To make things worse, its code should
be written from scratch, since I haven't found any suitable C library yet
for this purpose. That's why I'll leave them for

On other notes, let me share some of the changes since my previous message
to the mailing list:

First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method
from the proposal after Arnaud's feedback. Now, both the normalized (and
decoded), as well as the non-normalized representation can equally be
retrieved from the same URI instance. This was necessary to change in order
for users to be able to consistently use URIs. Now, if someone needs an
exact URI component value, they can use the getRaw*() getter. If they want
the normalized and percent-decoded form then a get*() getter should be
used. For more information, the
https://wiki.php.net/rfc/url_parsing_api#component_retrieval section should
be consulted.
I made a few less important API changes, like converting the WhatWgError
class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing
the return type of some getters (removing nullability) etc.
I fixed quite some smaller details of the implementation along with a
very important spec incompatibility: until now, the "path" component didn't
contain the leading "/" character when it should have. Now, both classes
conform to their respective specifications with regards to path handling.

I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until now are
addressed some way or another. However, the only remaining question that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
should be final? Personally, I don't see much problem with opening them for
extension (other than some technical challenges that I already shared a few
months ago), and I think people will have legitimate use cases for
extending these classes. On the other hand, having final classes may allow
us to make slightly more significant changes without BC concerns until we
have a more battle-tested API, and of course completely eliminate the need
to overcome the said technical challenges. According to Tim, it may also
result in safer code because spec-compliant base classes cannot be extended
by possibly non-spec compliant/buggy children. I don't necessarily fully
agree with this specific concern, but here it is.

Regards,
Máté

5 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-02-16 23:01, schrieb Máté Kocsis:

I only harp on the WhatWG spec so much because for many people this
will
be the only one they are aware of, if they are aware of any spec at
all,
and this is a sizable vector of attack targeting servers from
user-supplied
content. I’m curious to hear from folks here hat fraction of the
actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them
are
truly RFC3986 systems or if the common-enough URLs are valid in both
specs.

I think Ignace's examples already highlighted that the two
specifications
differ in nuances so much that even I had to admit after months of
trying
to squeeze them into the same interface that doing so would be
irresponsible.

I think this is also a good argument in favor of finally making the
classes final. Not making them final would allow for irresponsible
sub-classes :-)

echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // 🐘.com
echo $url->toString(); //
https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
echo $url->toDisplayString(); /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98

The naming of these methods seems to be a little inconsistent. It should
either be:

 ->getHostForDisplay()
 ->toStringForDisplay()

or

 ->getDisplayHost()
 ->toDisplayString()

but not a mix between both of them.

I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until now
are
addressed some way or another. However, the only remaining question
that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url
classes
should be final? Personally, I don't see much problem with opening them
for

Yes. Besides the remark above, my previous arguments still apply (e.g.
with()ers not being able to construct instances for subclasses,
requiring to override all of them). I'm also noticing that serialization
is unsafe with subclasses that add a $__uri property (or perhaps any
property at all?).

We already had extensive off-list discussion about the RFC and I agree
it's in a good shape now. I've given it another read and here's my
remarks:

The toDisplayString() method that you mentioned above is not in the
RFC. Did you mean toHumanFriendlyString()? Which one is correct?

The example output of the $errors array does not match the stub. It
contains a failure property, should that be softError instead?

The RFC states "When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results in
a failure."

What happens for Rfc3986 when passing an invalid URI to the constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since $errors is not applicable for 3986?

The RFC does not specify when UninitializedUriException is thrown.

The RFC does not specify when UriOperationException is thrown.

Generally speaking I believe it would help understanding if you would
add a /** @throws InvalidUriException */ to each of the methods in the
stub to make it clear which ones are able to throw (e.g. resolve(), or
the withers). It's harder to find this out from “English” rather than
“code” :-)

In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if %26 is decoded to & in a query-string. Or if %3D is
decoded to =. This really is the same case as with %2F in a path.
The explanation

"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components."

alone is not clear to me. “reserved characters that are not allowed in a
component”. I assume this means that %2F (/) in a path will not be
decoded, but %3F (?) will, because a bare ? can't appear in a path?

In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:

 $url->getRawScheme() // does not exist, because WhatWgUrl always

normalizes the scheme

to better point out the differences between the two APIs with regard to
normalization (it's mentioned, but having it in the code blocks would
make it more visible).

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ? and # as necessary. Will the same happen
for Rfc3986? Will the encoding of # also happen for the query-string
component? The RFC only mentions the path component.

I'm also wondering if there are cases where the withers would not
round-trip, i.e. where $url->withPath($url->getPath()) would not
result in the original URL?

Can you add examples where the authority / host contains IPv6 literals?
It would be useful to specifically show whether or not the square
brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening 2001:db8:0:0:0:0:0:1 to 2001:db8::1).

In “Component Recomposition” the RFC states "The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".

Does this mean that toString() for Rfc3986 will always return the
original input?

It would be useful to know whether or not the classes implement
__debugInfo() / how they appear when var_dump()ing them.

Best regards
Tim Düsterhus

5 months ago by tim@bastelstu.be — view source

unread

Hi

[dropping Dennis from the Cc list]

Am 2025-02-21 13:06, schrieb Tim Düsterhus:

We already had extensive off-list discussion about the RFC and I agree
it's in a good shape now. I've given it another read and here's my
remarks:

One more thing that came to my mind, but where I'm not sure what the
correct choice is:

Naming of WhatWgError and WhatWgErrorType. They are placed within
the Uri\WhatWg namespace, making the WhatWg in their name a little
redundant.

For Exceptions the recommendation is to use this kind of redundant
naming, to make implicit imports for catch blocks more convenient
compared to needing to alias each and every Exception class. The same
reasoning could also apply here, but here I find it less obvious.

The alternative would probably be Uri\WhatWg\Error and
Uri\WhatWg\Error\Type.

No strong opinion from my side, but wanted to mention it nevertheless.

Best regards
Tim Düsterhus

5 months ago by Juris Evertovskis — view source

unread

-----Original Message-----
From: Tim Düsterhus tim@bastelstu.be
Sent: Sunday, February 23, 2025 5:05 PM
To: Máté Kocsis kocsismate90@gmail.com
Cc: Internals internals@lists.php.net
Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing
API

Naming of WhatWgError and WhatWgErrorType. They are placed within the
Uri\WhatWg namespace, making the WhatWg in their name a little
redundant.

Hey,

As those are URI validation errors, maybe something like
Uri\WhatWg\ValidationError would be both less clashy and less redundant?

If I'd see WhatWgError without seeing the "Uri" keyword I'd probably think
it's related to other aspects of the spec, e.g. something went wrong with
the HTML parsing. Although I understand it's validating against the WhatWg
spec, UriError would seem clearer to me.

BR,
Juris

5 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-02-23 18:47, schrieb Juris Evertovskis:

As those are URI validation errors, maybe something like
Uri\WhatWg\ValidationError would be both less clashy and less
redundant?

I like that suggestion.

Best regards
Tim Düsterhus

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Juris and Tim,

Am 2025-02-23 18:47, schrieb Juris Evertovskis:

As those are URI validation errors, maybe something like
Uri\WhatWg\ValidationError would be both less clashy and less
redundant?

I like that suggestion.

Best regards
Tim Düsterhus

I liked it as well, so I changed the related classes the following way:

Uri\WhatWg\WhatWgError became Uri\WhatWg\UrlValidationError
Uri\WhatWg\WhatWgErrorType became Uri\WhatWg\UrlValidationErrorType

This way, WhatWg is not duplicated in the FQCN, but the class name is still
specific enough to possibly not clash with anything else.
I could also imagine removing the Url prefix, but I like it, since it
highlights that it's related to WHATWG URLs.

Regards,
Máté

5 months ago by Ignace Nyamagana Butera — view source

unread

Hi

Am 2025-02-16 23:01, schrieb Máté Kocsis:

I only harp on the WhatWG spec so much because for many people this
will
be the only one they are aware of, if they are aware of any spec at
all,
and this is a sizable vector of attack targeting servers from
user-supplied
content. I’m curious to hear from folks here hat fraction of the
actual PHP
code deals with RFC3986 URLs, and of those, if the systems using
them are
truly RFC3986 systems or if the common-enough URLs are valid in both
specs.

I think Ignace's examples already highlighted that the two
specifications
differ in nuances so much that even I had to admit after months of
trying
to squeeze them into the same interface that doing so would be
irresponsible.

I think this is also a good argument in favor of finally making the
classes final. Not making them final would allow for irresponsible
sub-classes :-)

echo $url->getHost();                // xn--go8h.com
echo $url->getHostForDisplay();      // 🐘.com
echo $url->toString();               //
https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
echo $url->toDisplayString();        /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98

The naming of these methods seems to be a little inconsistent. It
should either be:

    ->getHostForDisplay()
    ->toStringForDisplay()

or

    ->getDisplayHost()
    ->toDisplayString()

but not a mix between both of them.

I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until
now are
addressed some way or another. However, the only remaining question
that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
should be final? Personally, I don't see much problem with opening
them for

Yes. Besides the remark above, my previous arguments still apply (e.g.
with()ers not being able to construct instances for subclasses,
requiring to override all of them). I'm also noticing that
serialization is unsafe with subclasses that add a $__uri property
(or perhaps any property at all?).

We already had extensive off-list discussion about the RFC and I agree
it's in a good shape now. I've given it another read and here's my
remarks:

The toDisplayString() method that you mentioned above is not in the
RFC. Did you mean toHumanFriendlyString()? Which one is correct?

The example output of the $errors array does not match the stub. It
contains a failure property, should that be softError instead?

The RFC states "When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results
in a failure."

What happens for Rfc3986 when passing an invalid URI to the
constructor? Will an exception be thrown? What will the error array
contain? Is it perhaps necessary to subclass Uri\InvalidUriException
for use with WhatWgUrl, since $errors is not applicable for 3986?

The RFC does not specify when UninitializedUriException is thrown.

The RFC does not specify when UriOperationException is thrown.

Generally speaking I believe it would help understanding if you would
add a /** @throws InvalidUriException */ to each of the methods in
the stub to make it clear which ones are able to throw (e.g.
resolve(), or the withers). It's harder to find this out from
“English” rather than “code” :-)

In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if %26 is decoded to & in a query-string. Or if %3D is
decoded to =. This really is the same case as with %2F in a path.
The explanation

"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded
that are not allowed in a component. This behavior is needed to be
able to unambiguously retrieve components."

alone is not clear to me. “reserved characters that are not allowed in
a component”. I assume this means that %2F (/) in a path will not be
decoded, but %3F (?) will, because a bare ? can't appear in a path?

In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:

    $url->getRawScheme() // does not exist, because WhatWgUrl always
normalizes the scheme

to better point out the differences between the two APIs with regard
to normalization (it's mentioned, but having it in the code blocks
would make it more visible).

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ? and # as necessary. Will the same
happen for Rfc3986? Will the encoding of # also happen for the
query-string component? The RFC only mentions the path component.

I'm also wondering if there are cases where the withers would not
round-trip, i.e. where $url->withPath($url->getPath()) would not
result in the original URL?

Can you add examples where the authority / host contains IPv6
literals? It would be useful to specifically show whether or not the
square brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening 2001:db8:0:0:0:0:0:1 to 2001:db8::1).

In “Component Recomposition” the RFC states "The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".

Does this mean that toString() for Rfc3986 will always return the
original input?

It would be useful to know whether or not the classes implement
__debugInfo() / how they appear when var_dump()ing them.

Best regards
Tim Düsterhus

Hi Maté I just read the final proposal and here's my quick remarks it
may be possible other have already highlighted some of those remarks:

I believe there's a typo in the RFC

All URI components - with the exception of the host - can be
retrieved in two formats:

I believe you mean - with the excepotion of the Port

0 - It is a unfortunate that there's no IDNA support for RFC3986, I
understand the reasoning behind that decision but I was wondering if it
was possible to optin its use when the ext-intl extension is present ?

1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they
will also get a Uri::toDisplayString and Uri::getHostForDisplay
maybe this should be stated in the Futurscope ?

2 - I would go with final classes for both classes and promote
decoration for extension. This would reduce security issues a lot.

3 - I would make the constructor private using a from , tryFrom or
parse and tryParse methods to highlight the difference in result

4 - For consistency I would use toRawString and toString just like it is
done for components.

5 - Can the returned array from __debugInfo be used in a "normal" method
like toComponents naming can be changed/improve to ease migration from
parse_url or is this left for userland library ?

5 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-02-24 10:18, schrieb Ignace Nyamagana Butera:

5 - Can the returned array from __debugInfo be used in a "normal"
method like toComponents naming can be changed/improve to ease
migration from parse_url or is this left for userland library ?

I would prefer not expose this functionality for the same reason that
there are no raw properties provided: The user must make an explicit
choice whether they are interested in the raw or in the normalized
version of the individual components.

It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the encoded
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signature, e.g. %2f is different
than %2F from a SAML signature perspective, requiring workarounds like
this:
https://github.com/SAML-Toolkits/php-saml/blob/c89d78c4aa398767cf9775d9e32d445e64213425/lib/Saml2/Utils.php#L724-L737

Best regards
Tim Düsterhus

5 months ago by Nicolas Grekas — view source

unread

Hi,

Thanks for all the efforts making this RFC happen, it'll be a game changer
in the domain!

I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type. The
behavior is what some want to close by making the class final. But the
result is that the type will also be final. And this would lead to a
situation where people tighly couple their code to one single
implementation - the internal one.

The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By making
the type non-final, we keep things open enough for userland to build on it.
If not, we're going to end up with a fragmented community: some will
tightly couple to the native Url implementation, some others will define a
UriInterface of their own and will compose it with the native
implementation, all these with non-interoperable base types of course,
because interop is hard.

By making the classes non-final, there will be one base type to build upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve my
main concern).

5 - Can the returned array from __debugInfo be used in a "normal"

method like toComponents naming can be changed/improve to ease
migration from parse_url or is this left for userland library ?

I would prefer not expose this functionality for the same reason that
there are no raw properties provided: The user must make an explicit
choice whether they are interested in the raw or in the normalized
version of the individual components.

The RFC is also missing whether __debugInfo returns raw or non-raw
components. Then, I'm wondering if we need this per-component break for
debugging at all? It might be less confusing (on this encoding aspect) to
dump basically what __serialize() returns (under another key than __uri of
course).
This would also close the avenue of calling __debugInfo() directly (at the
cost of making it possibly harder to move away from parse_url(), but I
don't think we need to make this simpler - getting familiar with the new
API before would be required and welcome actually.)

It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the encoded
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signature

I would be careful with this argument: signature validation should be done
on raw bytes. Requiring an object to preserve byte-level accuracy while the
very purpose of OOP is to provide abstractions might be conflicting. The
signing topic can be solved by keeping the raw signed payload around.

5 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-02-24 12:08, schrieb Nicolas Grekas:

The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot
be
achieve because there's no type to compose.

Yes, that's the point: The behavior and the type are intimately tied
together. The Uri/Url classes are representing values, not services. You
wouldn't extend an int either. For DateTimeImmutable inheritance being
legal causes a ton of needless bugs (especially around serialization
behavior).

Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By
making

For a given specification (RFC 3986 / WHATWG) there is exactly one
correct interpretation of a given URL. “Fine-tuning” means that you are
no longer following the specification.

the type non-final, we keep things open enough for userland to build on
it.

This works:

 final class HttpUrl {
     private readonly \Uri\Rfc3986\Uri $uri;
     public function __construct(string $uri) {
         $this->uri = new \Uri\Rfc3986\Uri($uri);
         if ($this->uri->getScheme() !== 'http') {
             throw new ValueError('Scheme must be http');
         }
     }
     public function toRfc3986(): \Uri\Rfc3986\Uri {
         return $this->uri;
     }
 }

Userland can easily build their convenience wrappers around the classes,
they just need to export them to the native classes which will then
guarantee that the result is fully validated and actually a valid
URI/URL. Keep in mind that the ext/uri extension will always be
available, thus users can rely on the native implementation.

By making the classes non-final, there will be one base type to build
upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve
my
main concern).

Mate already explained why a native UriInterface was intentionally
removed from the RFC in https://news-web.php.net/php.internals/126425.

The RFC is also missing whether __debugInfo returns raw or non-raw
components. Then, I'm wondering if we need this per-component break for
debugging at all? It might be less confusing (on this encoding aspect)
to
dump basically what __serialize() returns (under another key than __uri
of
course).

That would also work for me.

It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the encoded
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signature

I would be careful with this argument: signature validation should be
done
on raw bytes. Requiring an object to preserve byte-level accuracy while
the
very purpose of OOP is to provide abstractions might be conflicting.
The
signing topic can be solved by keeping the raw signed payload around.

Yes, the SAML signature behavior is wrong, but I did not write the SAML
specification. I just pointed out how it a possible use-case where
choosing the raw or normalized form depends on the component and where a
“get all components” function would be dangerous.

Best regards
Tim Düsterhus

5 months ago by Nicolas Grekas — view source

unread

Am 2025-02-24 12:08, schrieb Nicolas Grekas:

The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot
be
achieve because there's no type to compose.

Yes, that's the point: The behavior and the type are intimately tied
together. The Uri/Url classes are representing values, not services. You
wouldn't extend an int either. For DateTimeImmutable inheritance being
legal causes a ton of needless bugs (especially around serialization
behavior).

DatetimeImmutable is a good example of community-proven usefulness for
inheritance:
the carbon package has a huge success because it does add a ton of nice
helpers (that are better maintained in userland) while still providing
compatibility with functions that accept the native type.

The fact that the native implementation had bugs when inheritance was used
doesn't mean inheritance is a problem. It's just bugs that need to be
fixed. Conceptually nothing makes those bugs inevitable.

Closing the class would have hindered community-innovation. The same
applies here.

Then, if people make mistakes in their child classes, their problem. But
the community shouldn't be forbidden to extend a class just because
mistakes can happen.

Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By
making

For a given specification (RFC 3986 / WHATWG) there is exactly one
correct interpretation of a given URL. “Fine-tuning” means that you are
no longer following the specification.

See Carbon example, it's not specifically about fine-tuning. We cannot
anticipate how creative people are. Nor should we prevent them from being
so, from the PoV of the PHP engine designers.

the type non-final, we keep things open enough for userland to build on

it.

This works:
 final class HttpUrl {
     private readonly \Uri\Rfc3986\Uri $uri;
     public function __construct(string $uri) {
         $this->uri = new \Uri\Rfc3986\Uri($uri);
         if ($this->uri->getScheme() !== 'http') {
             throw new ValueError('Scheme must be http');
         }
     }
     public function toRfc3986(): \Uri\Rfc3986\Uri {
         return $this->uri;
     }
 }
Userland can easily build their convenience wrappers around the classes,
they just need to export them to the native classes which will then
guarantee that the result is fully validated and actually a valid
URI/URL. Keep in mind that the ext/uri extension will always be
available, thus users can rely on the native implementation.

This is an example of what I call community-fragmentation: one hardcoded
type that should only be used as an implementation detail, but will leak at
type-boundaries and will make things inflexible. Each project will have to
think about such designs, and many more will get it wrong. (We will be the
ones to blame since we're the ones educated on the topic.)

By making the classes non-final, there will be one base type to build
upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve
my
main concern).

Mate already explained why a native UriInterface was intentionally
removed from the RFC in https://news-web.php.net/php.internals/126425.

The only one option remains - making the class non-final.

Nicolas

5 months ago by Marco Pivetta — view source

unread

On Mon, 24 Feb 2025 at 14:45, Nicolas Grekas nicolas.grekas+php@gmail.com
wrote:

Am 2025-02-24 12:08, schrieb Nicolas Grekas:

The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot
be
achieve because there's no type to compose.

Yes, that's the point: The behavior and the type are intimately tied
together. The Uri/Url classes are representing values, not services. You
wouldn't extend an int either. For DateTimeImmutable inheritance being
legal causes a ton of needless bugs (especially around serialization
behavior).

DatetimeImmutable is a good example of community-proven usefulness for
inheritance:
the carbon package has a huge success because it does add a ton of nice
helpers (that are better maintained in userland) while still providing
compatibility with functions that accept the native type.

The fact that the native implementation had bugs when inheritance was used
doesn't mean inheritance is a problem. It's just bugs that need to be
fixed. Conceptually nothing makes those bugs inevitable.

Closing the class would have hindered community-innovation. The same
applies here.

TBH, data-point from someone that spends time removing Carbon usages here
:-P

The DateTimeImmutable type should've been final from the start: it is
trivial to declare a userland interface, and then use the
DateTimeImmutable type as an implementation detail of a userland-provided
interface.

PSR-7, for example, will benefit greatly from this new RFC, without ever
having to expose the underlying value type to userland.

Inheritance is a tool to be used when there is LSP-compliant divergence
from the original type, and here, the PHP RFC aims at modeling something
that doesn't have alternative implementations: it's closed for
modification, and that's good.

Marco Pivetta

https://mastodon.social/@ocramius

https://ocramius.github.io/

5 months ago by Sebastian Bergmann — view source

unread

Am 24.02.2025 um 14:57 schrieb Marco Pivetta:

The DateTimeImmutable type should've been final from the start: it is
trivial to declare a userland interface, and then use the
DateTimeImmutable type as an implementation detail of a userland-
provided interface.

+1

5 months ago by Gina P. Banyard — view source

unread

I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type. The behavior is what some want to close by making the class final. But the result is that the type will also be final. And this would lead to a situation where people tighly couple their code to one single implementation - the internal one.

The situation I'm telling about is when one will accept an argument described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible implementation can ever be passed: the native one. Composition cannot be achieve because there's no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most interested in, but we should not forget that we also ship a type. By making the type non-final, we keep things open enough for userland to build on it. If not, we're going to end up with a fragmented community: some will tightly couple to the native Url implementation, some others will define a UriInterface of their own and will compose it with the native implementation, all these with non-interoperable base types of course, because interop is hard.

By making the classes non-final, there will be one base type to build upon for userland.
(the alternative would be to define native UrlInterface, but that'd increase complexity for little to no gain IMHO - althought that'd solve my main concern).

The open/closed principle does not mean "open to inheritance".
Just pulling in the Wikipedia definition: [1]

In object-oriented programming, the open–closed principle (OCP) states "software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification";

You can extend a class by using a decorator or the delegation pattern.
But most importantly, I would like to focus on the "closed for modification" part of the principle.
Unless we make all the methods final, inheritance allows you to modify the behaviour of the methods, which is in opposition to the principle.

Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.

Quoting Dijkstra:

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of such a type,

you know with absolute certainty how it behaves and what you can do with it, and know that if a consumer needs a WhatWg URI it will not reject it.
This also means consumers of said WhatWg\Uri type do not need to care about validation of it.

If one is able to extend a WhatWg URI, then none of the above applies, and you just have a raw string with fancy methods.

I.e. you are now vague, and any consumer of the type needs to do validation because it cannot trust the type, and you have created a useless abstraction.

It also seems you did not read the relevant "Why a common URI interface is not supported?" [2] section of the RFC.
The major reason why this RFC has had so many iterations and been in discussion for so long is because Máté tried, again and again, to have a common interface.
But this just does not make any sense, you cannot make something extremely concrete vague and abstract, unless you want to lose all the benefits of the abstraction.

Best regards,
Gina P. Banyard

[1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
[2] https://wiki.php.net/rfc/url_parsing_api#why_a_common_uri_interface_is_not_supported

5 months ago by Hammed Ajao — view source

unread

What's wrong with declaring all the methods as final eg.
https://github.com/lnear-dev/ada-url/blob/main/ada_url.stub.php

On Monday, 24 February 2025 at 11:08, Nicolas Grekas <
nicolas.grekas+php@gmail.com> wrote:

I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type.
The behavior is what some want to close by making the class final. But the
result is that the type will also be final. And this would lead to a
situation where people tighly couple their code to one single
implementation - the internal one.

The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By making
the type non-final, we keep things open enough for userland to build on it.
If not, we're going to end up with a fragmented community: some will
tightly couple to the native Url implementation, some others will define a
UriInterface of their own and will compose it with the native
implementation, all these with non-interoperable base types of course,
because interop is hard.

By making the classes non-final, there will be one base type to build upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve my
main concern).

The open/closed principle does not mean "open to inheritance".
Just pulling in the Wikipedia definition: [1]

In object-oriented programming, the open–closed principle (OCP) states
"software entities (classes, modules, functions, etc.) should be open for
extension, but closed for modification";

You can extend a class by using a decorator or the delegation pattern.
But most importantly, I would like to focus on the "closed for
modification" part of the principle.
Unless we make all the methods final, inheritance allows you to modify
the behaviour of the methods, which is in opposition to the principle.

Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg
spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.

Quoting Dijkstra:

The purpose of abstracting is not to be vague, but to create a new
semantic level in which one can be absolutely precise.

A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of
such a type,
you know with absolute certainty how it behaves and what you can do
with it, and know that if a consumer needs a WhatWg URI it will not reject
it.
This also means consumers of said WhatWg\Uri type do not need to care
about validation of it.

If one is able to extend a WhatWg URI, then none of the above applies,
and you just have a raw string with fancy methods.
I.e. you are now vague, and any consumer of the type needs to do
validation because it cannot trust the type, and you have created a
useless abstraction.

It also seems you did not read the relevant "Why a common URI interface
is not supported?" [2] section of the RFC.
The major reason why this RFC has had so many iterations and been in
discussion for so long is because Máté tried, again and again, to have a
common interface.
But this just does not make any sense, you cannot make something extremely
concrete vague and abstract, unless you want to lose all the benefits of
the abstraction.

Best regards,

Gina P. Banyard

[1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
[2]
https://wiki.php.net/rfc/url_parsing_api#why_a_common_uri_interface_is_not_supported

5 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-02-24 15:05, schrieb Hammed Ajao:

What's wrong with declaring all the methods as final eg.
https://github.com/lnear-dev/ada-url/blob/main/ada_url.stub.php

It is not possible to construct a subclass in a generic fashion, because
you don't know the constructor’s signature and you also don’t know if it
added some properties with a certain semantic. That means that the
with*()ers are unable to return an instance of the subclass, leading
to confusing behavior in cases like these:

 final class HttpUrl extends \Uri\Rfc3986\Uri {
     public function __construct(string $uri, public readonly bool

$allowInsecure) {
parent::__construct($uri);

         if ($this->getScheme() !== 'https') {
             if ($allowInsecure) {
                if ($this->getScheme() !== 'http') {
                    throw new ValueError('Scheme must be https or

http');
}
} else {
throw new ValueError('Scheme must be https');
}
}
}
}

 $httpUrl = (new HttpUrl('https://example.com'))->withPath('/foo');
 get_class($httpUrl); // \Uri\Rfc3986\Uri

Best regards
Tim Düsterhus

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Hammed,

What's wrong with declaring all the methods as final eg.

https://github.com/lnear-dev/ada-url/blob/main/ada_url.stub.php

I've just noticed your message, sorry. Coincidentally - as I wrote a few
days ago -, I'm also experimenting with making methods final.

Máté

5 months ago by Nicolas Grekas — view source

unread

Le lun. 24 févr. 2025 à 14:57, Gina P. Banyard internals@gpb.moe a écrit :

On Monday, 24 February 2025 at 11:08, Nicolas Grekas <
nicolas.grekas+php@gmail.com> wrote:

I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type.
The behavior is what some want to close by making the class final. But the
result is that the type will also be final. And this would lead to a
situation where people tighly couple their code to one single
implementation - the internal one.

The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By making
the type non-final, we keep things open enough for userland to build on it.
If not, we're going to end up with a fragmented community: some will
tightly couple to the native Url implementation, some others will define a
UriInterface of their own and will compose it with the native
implementation, all these with non-interoperable base types of course,
because interop is hard.

By making the classes non-final, there will be one base type to build upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve my
main concern).

The open/closed principle does not mean "open to inheritance".
Just pulling in the Wikipedia definition: [1]

In object-oriented programming, the open–closed principle (OCP) states
"software entities (classes, modules, functions, etc.) should be open for
extension, but closed for modification";

You can extend a class by using a decorator or the delegation pattern.

Yes.
You can strike decoration with a non-final class (and no base interface),
that's my point.

But most importantly, I would like to focus on the "closed for
modification" part of the principle.
Unless we make all the methods final, inheritance allows you to modify
the behaviour of the methods, which is in opposition to the principle.

Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg
spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.

Quoting Dijkstra:

The purpose of abstracting is not to be vague, but to create a new
semantic level in which one can be absolutely precise.

A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of
such a type,
you know with absolute certainty how it behaves and what you can do
with it, and know that if a consumer needs a WhatWg URI it will not reject
it.
This also means consumers of said WhatWg\Uri type do not need to care
about validation of it.

If one is able to extend a WhatWg URI, then none of the above applies,
and you just have a raw string with fancy methods.
I.e. you are now vague, and any consumer of the type needs to do
validation because it cannot trust the type, and you have created a
useless abstraction.

A couple of non-final Url classes would still be absolutely useful: e.g. as
a consumer/callee, I would have stated very clearly that I need an object
that behaves like native Url objects. Then, if the implementation doesn't,
that's on the caller. The abstraction would do its job. I don't think the
extra guarantees you're describing would be useful in practice (but you
could still do an exact ::class comparison if you'd really want to).

It also seems you did not read the relevant "Why a common URI interface
is not supported?" [2] section of the RFC.

This sentence comes to me as unnecessarily confrontational. I’d really like
to keep this discussion as constructive as possible so that php-internal
remains a welcoming space for everyone.

The major reason why this RFC has had so many iterations and been in
discussion for so long is because Máté tried, again and again, to have a
common interface.
But this just does not make any sense, you cannot make something extremely
concrete vague and abstract, unless you want to lose all the benefits of
the abstraction.

I was considering the alternative of providing TWO interfaces indeed. Sorry
if that wasn't clear enough.

Nicolas

5 months ago by ignace nyamagana butera — view source

unread

Hi,

Thanks for all the efforts making this RFC happen, it'll be a game
changer in the domain!

I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type.
The behavior is what some want to close by making the class final. But
the result is that the type will also be final. And this would lead to a
situation where people tighly couple their code to one single
implementation - the internal one.

The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By
making the type non-final, we keep things open enough for userland to
build on it. If not, we're going to end up with a fragmented community:
some will tightly couple to the native Url implementation, some others
will define a UriInterface of their own and will compose it with the
native implementation, all these with non-interoperable base types of
course, because interop is hard.

By making the classes non-final, there will be one base type to build
upon for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve
my main concern).
 > 5 - Can the returned array from __debugInfo be used in a "normal"
 > method like `toComponents` naming can be changed/improve to ease
 > migration from parse_url or is this left for userland library ?

I would prefer not expose this functionality for the same reason that
there are no raw properties provided: The user must make an explicit
choice whether they are interested in the raw or in the normalized
version of the individual components.
The RFC is also missing whether __debugInfo returns raw or non-raw
components. Then, I'm wondering if we need this per-component break for
debugging at all? It might be less confusing (on this encoding aspect)
to dump basically what __serialize() returns (under another key than
__uri of course).
This would also close the avenue of calling __debugInfo() directly (at
the cost of making it possibly harder to move away from parse_url(), but
I don't think we need to make this simpler - getting familiar with the
new API before would be required and welcome actually.)
It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the *encoded*
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signature
I would be careful with this argument: signature validation should be
done on raw bytes. Requiring an object to preserve byte-level accuracy
while the very purpose of OOP is to provide abstractions might be
conflicting. The signing topic can be solved by keeping the raw signed
payload around.

Hi Nicolas,

 > 5 - Can the returned array from __debugInfo be used in a "normal"
 > method like `toComponents` naming can be changed/improve to ease
 > migration from parse_url or is this left for userland library ?

I would prefer not expose this functionality for the same reason that
there are no raw properties provided: The user must make an explicit
choice whether they are interested in the raw or in the normalized
version of the individual components.

I only mention this because I saw the debugInfo method being
implemented. TBH I would be more be in favor of removing the method all
together I fail to see the added value of such method unless we want to
hide the class internal property in which case it should then "just"
show the raw URL and nothing more.

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Nicolas,

For now, let me just quickly respond to your question regarding
__debugInfo():

The RFC is also missing whether __debugInfo returns raw or non-raw

components. Then, I'm wondering if we need this per-component break for
debugging at all? It might be less confusing (on this encoding aspect) to
dump basically what __serialize() returns (under another key than __uri of
course).
This would also close the avenue of calling __debugInfo() directly (at the
cost of making it possibly harder to move away from parse_url(), but I
don't think we need to make this simpler - getting familiar with the new
API before would be required and welcome actually.)

I mostly have already answered this in my latest message to Ignace: yes, I
think it makes sense to provide a clear picture about the anatomy of an URL
in some cases. The method uses raw component values in order not to skew
the original data.

Máté

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

All URI components - with the exception of the host - can be
retrieved in two formats:

I believe you mean - with the excepotion of the Port

Even though I specifically meant WHATWG's host that is only available in
only
one format, you are right, the port is never available in two formats. So
I've
changed the wording accordingly.

0 - It is a unfortunate that there's no IDNA support for RFC3986, I
understand the reasoning behind that decision but I was wondering if it
was possible to optin its use when the ext-intl extension is present ?

Good question, I think it's probably not the main concern. My specific
concern is that
RFC 3987 has around same length as RFC 3986, in a lot of cases it uses the
exact
wording of the initial RFC but changes URI to IRI, and of course adds the
IDNA specific parts. Maybe it's just me, but it's not easy to find it out
exactly what
has to be implemented above RFC 3986, and also, how it can be best achieved?
By extending the class for RFC 3986? Creating a totally separate class that
can
transform itself to an RFC 3986 URI? These and quite some other questions
have
to be answered first, which I would like to postpone.

1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they
will also get a Uri::toDisplayString and Uri::getHostForDisplay
maybe this should be stated in the Futurscope ?

It's a question that I also asked from myself. For now, I'd say that
Rfc3986/Uri shouldn't have these methods, since it doesn't support any such
capabilities. But Rfc3986\Iri should likely have these toString methods.

4 - For consistency I would use toRawString and toString just like it is
done for components.

I'm fine with this, I also think doing so would reasonably continue the
convention
getters do.

5 - Can the returned array from __debugInfo be used in a "normal" method
like toComponents naming can be changed/improve to ease migration from
parse_url or is this left for userland library ?

I intend to add the __debugInfo() method purely to help debugging. Without
this
method, even I had a hard time when trying to compare the expected vs actual
URIs in my tests.

But more importantly, sometimes the recomposed string is not enough to have
a
good understanding exactly what value each component has. For example
one can naively assume that the "mailto:kocsismate@php.net" URI has a
user(info) component of "kocsismate" and a hostname of "php.net" (I probably
also did so before reading the RFCs). The representation provided by
__debugInfo() can quickly highlight that "kocsismate@php.net" is the path
in fact.
One could try to call the individual getters to find the needed component,
but having
such a method like __debugInfo() provides a much more clear picture about
the anatomy of
the URI.

But otherwise I don't know how useful this method would be. Is there anything
else
besides helping the migration?

Regards,
Máté

4 months ago by ignace nyamagana butera — view source

unread

Hi Ignace,
  > All URI components - with the exception of the host - can be
retrieved in two formats:

I believe you mean - with the excepotion of the Port
Even though I specifically meant WHATWG's host that is only available in
only
one format, you are right, the port is never available in two formats.
So I've
changed the wording accordingly.
0 - It is a unfortunate that there's no IDNA support for RFC3986, I
understand the reasoning behind that decision but I was wondering if it
was possible to optin its use when the ext-intl extension is present ?
Good question, I think it's probably not the main concern. My specific
concern is that
RFC 3987 has around same length as RFC 3986, in a lot of cases it uses
the exact
wording of the initial RFC but changes URI to IRI, and of course adds the
IDNA specific parts. Maybe it's just me, but it's not easy to find it
out exactly what
has to be implemented above RFC 3986, and also, how it can be best achieved?
By extending the class for RFC 3986? Creating a totally separate class
that can
transform itself to an RFC 3986 URI? These and quite some other
questions have
to be answered first, which I would like to postpone.
1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they
will also get a `Uri::toDisplayString` and `Uri::getHostForDisplay`
maybe this should be stated in the Futurscope ?
It's a question that I also asked from myself. For now, I'd say that
Rfc3986/Uri shouldn't have these methods, since it doesn't support any such
capabilities. But Rfc3986\Iri should likely have these toString methods.
4 - For consistency I would use toRawString and toString just like
it is
done for components.
I'm fine with this, I also think doing so would reasonably continue the
convention
getters do.
5 - Can the returned array from __debugInfo be used in a "normal"
method
like `toComponents` naming can be changed/improve to ease migration
from
parse_url or is this left for userland library ?
I intend to add the __debugInfo() method purely to help debugging.
Without this
method, even I had a hard time when trying to compare the expected vs actual
URIs in my tests.

But more importantly, sometimes the recomposed string is not enough to
have a
good understanding exactly what value each component has. For example
one can naively assume that the "mailto:kocsismate@php.net
mailto:kocsismate@php.net" URI has a
user(info) component of "kocsismate" and a hostname of "php.net <http://
php.net>" (I probably
also did so before reading the RFCs). The representation provided by
__debugInfo() can quickly highlight that "kocsismate@php.net
mailto:kocsismate@php.net" is the path in fact.
One could try to call the individual getters to find the needed
component, but having
such a method like __debugInfo() provides a much more clear picture
about the anatomy of
the URI.

But otherwise I don't know how useful this method would be. Is there
anything else
besides helping the migration?

Regards,
Máté

Thanks for the clarification.

I have other questions upon further readings:

around Uri\UninitializedUriException If I look at the behaviour of
DatetimeImmutable in the same scenario or a Userland object instead of
throwing an exception an error is thrown

see:

Shouldn't the URI feature follow the same path for consistency ? Instead
of throwing an exception it should throw an Error on uninitialized issue
at least.

around Normalization. In case of query normalization, sorting the
query string is not mention does it means that with the current feature

http://example.com?foo=bar&foo=rab
is different from
http://example.com?foo=rab&foo=bar

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

around Uri\UninitializedUriException If I look at the behaviour of

DatetimeImmutable in the same scenario or a Userland object instead of
throwing an exception an error is thrown

see:

https://3v4l.org/d4VrY

https://3v4l.org/Wn7En

Shouldn't the URI feature follow the same path for consistency ? Instead
of throwing an exception it should throw an Error on uninitialized issue
at least.

Yes, you are right! Uri\UninitializedUriException should extend an Error
indeed,
since people shouldn't try to catch it either.

around Normalization. In case of query normalization, sorting the
query string is not mention does it means that with the current feature

http://example.com?foo=bar&foo=rab http://example.com?foo=bar&foo=rab
is different from
http://example.com?foo=rab&foo=bar http://example.com?foo=rab&foo=bar

Yes, that's the case, this feature is not implemented. As far as I see
though, it's better
not to change the order of query parameters, especially the order of
duplicated
parameters in order not to accidentally change the intended meaning of the
query string.
What's your stance here?

Máté

4 months ago by Paul M. Jones — view source

unread

Hi Maté and all,

There is a pre-existing userland implementation of WHATWG-URL at https://github.com/TRowbotham/URL-Parser. Packagist reports 600K+ downloads https://packagist.org/packages/rowbot/url. It is from Trevor Rowbotham, who is acknowledged in the WHATWG-URL specification itself https://url.spec.whatwg.org/#acknowledgments.

(There is one alternative implementation, https://packagist.org/packages/esperecyan/url https://github.com/esperecyan/url, but it does not look as recent or robust.)

If we want a full-featured WHATWG-URL implementation in core, would it not be preferable (with Trevor's permission) to convert rowbot/url from userland to core instead? Surely conversion from an existing, well-tested, widely-used implementation would be easier/better/faster than writing an implementation from scratch.

-- pmj

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Paul,

If we want a full-featured WHATWG-URL implementation in core, would it not
be preferable (with Trevor's permission) to convert rowbot/url from
userland to core instead? Surely conversion from an existing, well-tested,
widely-used implementation would be easier/better/faster than writing an
implementation from scratch.

There's no way I would have written an implementation from scratch. I'm
using the url module of the Lexbor C library (
https://github.com/lexbor/lexbor/) for handling WHATWG URLs. It's already
bundled in core, and it's also battle tested, and it has exceptional
maintenance. All I had to implement is the glue between userland and the C
library.

Máté

4 months ago by Paul M. Jones — view source

unread

Hi Maté,

There's no way I would have written an implementation from scratch. I'm using the url module of the Lexbor C library (https://github.com/lexbor/lexbor/) for handling WHATWG URLs. It's already bundled in core, and it's also battle tested, and it has exceptional maintenance.

I did not mean to imply writing a parser from scratch; my apologies for phrasing it poorly.

All I had to implement is the glue between userland and the C library.

That is more what I was getting at. Rowbot has a lot of what looks to be good design work on structures that come out of the parsing, in addition to a separate parser class.

The RFC might benefit from an explicit and intentional review of, and maybe incorporation of, some of the pre-existing Rowbot design work. At least one thing from Rowbot is absolutely not applicable to the RFC (e.g. the PSR-3 logging); maybe none of rest of it will be applicable either, but as prior art from someone acknowledged in the WHATWG-URL spec, I think it bears your close attention.

As an overview, the following is a brief comparison between Rowbot and the RFC; any missed or misrepresented functionality is unintentional.

RFC

One non-final readonly Url class:

5 getRaw...() methods, 8 get...() methods, and one get...ForDisplay() method
immutability via 8 with...() methods, broadly expecting properly-encoded arguments, and soft-erroring on invalid characters
a static parse() method, with relative parsing capability and a place to capture errors
equals() to compare two URLs
toString() for machine-friendly string recomoposition
toDisplayString() for human-friendly string recomposition
resolve() to resolve a relative URL using the current URL as the base
serialize/deserialize; "the serialized form only includes the recomposed URI itself exposed as the __uri field, but the individual properties or URI components are not present."
no URLSearchParams implementation

Rowbot

(None of the classes are readonly or final; these look to hew closely to the WHATWG-URL spec.)

A BasicURLParser class:

affords relative parsing capability and an option parameter for the target URLRecord
returns a URLRecord

A URLRecord class:

public mutable properties for the URL components
$scheme is a Scheme implementation with equals() and other is...() methods
$host is a HostInterface (and implementations) with equals() and other is...() methods
$path is a PathInterface (and PathList implementation) with PathSegment manipulation methods
setUsername() and setPassword() mutators
serializing
getOrigin(), includesCredentials(), isEqual()

A URL class:

Composed of a URLRecord and a URLSearchParams object
Constructor takes a string, parses it to a URLRecord, and retains the URLRecord
a static parse() method with relative parsing, as a convenience method
__toString() and toString() return the serialized URLRecord
Virtual properties for $href, $origin, $protocol, $username, $password, $host, $hostname, $port, $pathname, $search, $searchParams, $hash
Mutability of virtual properties via magic __set()
Readability of virtual properties via magic __get()

A URLSearchParams class:

search params manipulation methods
implements Countable, Iterator, Stringable
composed of a QueryList implementation and (optionally) the originating URLRecord

-- pmj

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Paul,

Rowbot

(None of the classes are readonly or final; these look to hew closely to
the WHATWG-URL spec.)

A BasicURLParser class:

affords relative parsing capability and an option parameter for the
target URLRecord

returns a URLRecord

A URLRecord class:

public mutable properties for the URL components

$scheme is a Scheme implementation with equals() and other is...()
methods

$host is a HostInterface (and implementations) with equals() and other
is...() methods

$path is a PathInterface (and PathList implementation) with PathSegment
manipulation methods

setUsername() and setPassword() mutators

serializing

getOrigin(), includesCredentials(), isEqual()

A URL class:

Composed of a URLRecord and a URLSearchParams object

Constructor takes a string, parses it to a URLRecord, and retains the
URLRecord

a static parse() method with relative parsing, as a convenience method

__toString() and toString() return the serialized URLRecord

Virtual properties for $href, $origin, $protocol, $username, $password,
$host, $hostname, $port, $pathname, $search, $searchParams, $hash

Mutability of virtual properties via magic __set()

Readability of virtual properties via magic __get()

I like some of the solutions this library uses - the usage of dedicated
value objects for some components (Scheme, HostInterface, PathInterface) -,
but
these features are what make the implementation extremely slow compared to
the implementation the RFC proposes. I didn't dig into the details when
I performed a very quick benchmark last week, so I can only assume that the
excessive usage of objects makes the library much slower than what's
possible
even for a userland library (obviously, an internal C implementation will
always be faster). According to my results, the RFC's implementation was
two orders of magnitude faster than the Rowbot library for parsing a
very basic "https://example.com" URL 1000 times (~0.002 sec vs ~0.56 sec).

What I want to say with this is that it's perfectly fine to optimize a
userland library for ergonomics and for the usage of advanced OOP in mind,
but an internal
implementation should also keep efficiency in mind besides developer
experience. That's why I don't see myself implement separate objects for
some of
the components for now. But nothing would block us from doing it later, if
we found out it's necessary.

I believe the most fundamental difference between the Rowbot library and
the RFC is that the RFC has native support for percent-decoding (because
most properties are accessible in 2 variants), while the library completely
leaves this task for the user. Apart from that, the mutable design of the
library
is fragile for the same reason as the DateTime class is not safe to use in
most cases, so that's definitely a no-go for me.

This RFC is a synthesis of almost a year of discussion and refinement,
collaborated by some very clever folks, who have a lot of hands-on
experience of
URL parsing and handling. That's why I would say that input from Trevor
Rowbotham is also welcome in the discussion (especially his experience of
some
edge cases he had to deal with), but the said library is nowhere near as
widely adopted for it to qualify as something we must definitely take into
consideration
when designing PHP's new URL parsing API.

A URLSearchParams class:

search params manipulation methods

implements Countable, Iterator, Stringable

composed of a QueryList implementation and (optionally) the originating
URLRecord

I like this concept too. And in fact, support for such a class is on my
to-do list, and is mentioned in the "Future Scope". I just didn't want to
make the RFC
even longer, because we already have a lot of details to discuss.

Máté

4 months ago by Paul M.Jones — view source

unread

Hi Maté and all,

Regarding Rowbot slowness compared to the RFC:

I can only assume that the excessive usage of objects makes the library much slower than what's possible
even for a userland library (obviously, an internal C implementation will always be faster). According to my results, the RFC's implementation was
two orders of magnitude faster than the Rowbot library for parsing a very basic "https://example.com" URL 1000 times (~0.002 sec vs ~0.56 sec).

I would not presume that the dedicated value objects are what "makes the [Rowbot] library much slower" than the RFC -- instead, my first intuition is that the parsing operations are slower in userland than in C, and are primarily responsible for the comparative slowness. Speedwise, creation of multiple objects from the parsed results would be a rounding error compared to the parsing itself.

What I want to say with this is that it's perfectly fine to optimize a userland library for ergonomics and for the usage of advanced OOP in mind, but an internal
implementation should also keep efficiency in mind besides developer experience. That's why I don't see myself implement separate objects for some of
the components for now. But nothing would block us from doing it later, if we found out it's necessary.

I think that's fair. The main thing that stands out to me is not the Scheme, Host, etc. value objects, but that the RFC presents no UrlRecord -- which is very definitely part the WHATWG-URL specification. That is, from reading the spec, I'd expect to see a UrlRecord, and a Url composed from it.

I believe the most fundamental difference between the Rowbot library and the RFC is that the RFC has native support for percent-decoding (because
most properties are accessible in 2 variants), while the library completely leaves this task for the user.

I have some thoughts on that, but I'll save them for later.

Meanwhile, AFAICT, neither Rowbot nor the RFC provide a percent encoding mechanism, for consumers to put together properly-encoded values. Have I missed it in the RFC, or is it somehow not necessary, or something else?

This RFC is a synthesis of almost a year of discussion and refinement, collaborated by some very clever folks, who have a lot of hands-on experience of
URL parsing and handling.

I would not presume otherwise! Even so, everyone makes mistakes and oversights from time to time, including very clever folks of the kind you describe above.

That's why I would say that input from Trevor Rowbotham is also welcome in the discussion (especially his experience of some edge cases he had to deal with)

I agree -- it would be great for the RFC team to seek him out and invite him to comment in this thread.

but the said library is nowhere near as widely adopted for it to qualify as something we must definitely take into consideration
when designing PHP's new URL parsing API.

Not to be too blunt, but the Rowbot library is far more widely adopted than the RFC currently is; I think Rowbot represents an intersection of theory and practice that one would be unwise to discard without intentional and extensive consideration.

A URLSearchParams class:

I like this concept too. And in fact, support for such a class is on my to-do list, and is mentioned in the "Future Scope".

Because it is part of the WHATWG-URL spec, I think it deserves first-class treatment in this RFC ...

I just didn't want to make the RFC even longer, because we already have a lot of details to discuss.

... but yeah, the sheer volume of the RFC makes it difficult to review and pick apart.

Which leads to my last point: I would really like to see at least two separate RFCs here. They be a lot easier to review and critique that way:

one for dealing with URIs as they exist now, especially one that the honors the ways-of-working that exist in userland; and,
one for dealing with WHATWG-URL in its entirety, with all its differences (some subtle, some not) from URIs.

I can see arguments for either one being the "base" on which the other would build.

-- pmj

2 months ago by kocsismate90@gmail.com — view source

unread

Hi Paul,

I would not presume that the dedicated value objects are what "makes the

[Rowbot] library much slower" than the RFC -- instead,

my first intuition is that the parsing operations are slower in userland

than in C, and are primarily responsible for the comparative slowness.

Speedwise, creation of multiple objects from the parsed results would be a

rounding error compared to the parsing itself.

Yes, I may have arrived at the wrong conclusion based on the right factors:
the Rowbot library uses objects for not just representing the components,
but even the parser states and other things, whereas in the C library,
parsing is just an enormous switch-case. I know that instantiating objects
doesn't
take a lot of time, but I guess the performance difference between a very
nicely written, full OO PHP code and an optimized C code will start to be
very much noticeable with a larger iteration number. Anyway, I shouldn't
have tried to compare the performance of the two solutions, since it's
really not
a fair comparison, and not the main point.

I think that's fair. The main thing that stands out to me is not the

Scheme, Host, etc. value objects, but that the RFC presents no UrlRecord --

which is very definitely part the WHATWG-URL specification. That is, from

reading the spec, I'd expect to see a UrlRecord, and a Url composed from it.

I believe the UrlRecord is a minor detail of the specification that is
possible to omit without sacrificing anything useful: having a record in
addition
to the URL class doesn't bring much to the table. For similar reasons, the
RFC doesn't implement the WHATWG getters either, and the pure
components are exposed instead (the "Component retrieval" section writes
about this). So the RFC does not entirely implement the API prescribed by
the
WHATWG URL spec, however it accurately follows the parsing details -- which
is the main benefit in my opinion.

Meanwhile, AFAICT, neither Rowbot nor the RFC provide a percent encoding
mechanism, for consumers to put together properly-encoded values.

Have I missed it in the RFC, or is it somehow not necessary, or something

else?

Percent-encoding is usually automatically done for WHATWG (even if soft
errors may be triggered during the process), so it was not a top priority
for me just yet.
But I definitely want to include some sort of percent-encoding support in
the followup I plan. But in any case, thanks for raising awareness of this
topic.

Because it is part of the WHATWG-URL spec, I think it deserves first-class

treatment in this RFC ...

Having yet another class in the proposal would open the possibility for a
whole lot of new discussion. We should draw the line somewhere in order not
to waste everyone's time, or the PHPFoundation's budget any longer, should
the RFC fail for any reason. And I just draw the line here, since it's a
nice to have
feature, and we have a meaningful set of functionality even without it.

Which leads to my last point: I would really like to see at least two
separate RFCs here. They be a lot easier to review and critique that way:

one for dealing with URIs as they exist now, especially one that the
honors the ways-of-working that exist in userland; and,

one for dealing with WHATWG-URL in its entirety, with all its
differences (some subtle, some not) from URIs.

I can see arguments for either one being the "base" on which the other
would build.

I may have agreed to pursue two separate RFCs a few months earlier, but not
anymore, around the very end. Although I should mention that the original
RFC tried to deal with WHATWG URLs only, RFC 3986 URIs were added later,
due to public demand. Possibly I should have stepped in around the time
when I included RFC 3986 support. However, I have to mention that working
on both specifications parallelly helped me understand a lot of the subtle
differences between the two specifications, and after bringing
these differences to the surface, the final API design could reflect and
tackle them.

Regards,
Máté

4 months ago by Ignace Nyamagana Butera — view source

unread

Hi Ignace,
1) around `Uri\UninitializedUriException` If I look at the
behaviour of
`DatetimeImmutable` in the same scenario or a Userland object
instead of
throwing an exception an error is thrown

see:

- https://3v4l.org/d4VrY
- https://3v4l.org/Wn7En

Shouldn't the URI feature follow the same path for consistency ?
Instead
of throwing an exception it should throw an Error on uninitialized
issue
at least.
Yes, you are right! Uri\UninitializedUriException should extend an
Error indeed,
since people shouldn't try to catch it either.
2) around Normalization. In case of query normalization, sorting the
query string is not mention does it means that with the current
feature

`http://example.com?foo=bar&foo=rab`
<http://example.com?foo=bar&foo=rab>;
is different from
`http://example.com?foo=rab&foo=bar`
<http://example.com?foo=rab&foo=bar>;
Yes, that's the case, this feature is not implemented. As far as I see
though, it's better
not to change the order of query parameters, especially the order of
duplicated
parameters in order not to accidentally change the intended meaning of
the query string.
What's your stance here?

Máté

Hi Maté,

Thanks for the clarifications, I ask for the latter because I am trying
to create a polyfill using league/uri-interfaces so

my questions come essentially from me trying to create the correct
polyfill to better understand the new class (specifically,

the RFC3986 Uri). You can find the ongoing work here if you want to see

https://github.com/bakame-php/aide-uri/blob/main/src/Uri.php

While implementing the polyfill I am finding easier DX wise to make the
constructor private and use instead named constructors for
instantiation. I would be in favor of

Uri::parse and Uri::tryParse like it is done currently with Enum and
the from and tryfrom named constructors.

My reasoning is as follow:

there's no right way or wrong way to instantiate an URI there are only
contexts. While the parse method is all about parsing a string, one
could legitimately use other named constructors like
Uri::fromComponents which would take for instance the result of
parse_url to build a new URI. This can become handy in the case of
RFC3986 URI if you need to create an new URI not related to the http
scheme and that do not use all the components like the email, data or
FTP schemes.

By allowing creating URI based on their respective components value
you make it easier for dev to use the class. Also this means that if we
want to have a balance API then a toComponents method should come hand
in hand with the named constructor.

I would understand if that idea to add both components related methods
is rejected, they could be implemented on userland, but the main point
was to prove that from the VO or the developer POV in absence of a
clearly defined instantiation process, having a traditional constructor
fails to convey all the different way to create an URI.

4 months ago by Paul M. Jones — view source

unread

Hi Ignace & Maté & all,

https://github.com/bakame-php/aide-uri/blob/main/src/Uri.php
While implementing the polyfill I am finding easier DX wise to make the constructor private and use instead named constructors for instantiation. I would be in favor of Uri::parse and Uri::tryParse like it is done currently with Enum and the from and tryfrom named constructors.

Hear, hear. Uri-Interop https://github.com/uri-interop/interface has discovered two interfaces in existing projects:

one method with a parseUri(stringStringable $uriString) : UriComponents signature to parse a string and create a URI instance; and,
a separate method with a newUri(?string $scheme, ?string $username, ..., ?string $fragment) : UriComponents signature that creates a URI instance from the individual components.

Neither of them dictates a constructor signature, but having the parser method separated from the factory method turns out to be quite useful. Presenting the two options as separate methods would reflect existing implementations.

As a side note, the RFC uses the term user for the identifying part of the user info. It's perfectly reasonable, as user is the most-commonly-used term in existing URI projects. https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#userusername

However, WHATWG-URL consistently calls it username, putting the URL portion of the RFC at odds with the WHATWG-URL spec. Calling it username would be more in line with the spec. That would likely mean calling it username in the URI portion of the RFC as well. (Uri-Interop reviewers found username more suitable as well. https://github.com/uri-interop/interface?tab=readme-ov-file#why-username-and-not-user)

-- pmj

4 months ago by tim@bastelstu.be — view source

unread

Hi

However, WHATWG-URL consistently calls it username, putting the URL portion of the RFC at odds with the WHATWG-URL spec. Calling it username would be more in line with the spec. That would likely mean calling it username in the URI portion of the RFC as well.

This makes sense to me. The WHATWG URL standard uses username, RFC
3986 uses user, but considers that deprecated in favor of the generic
userinfo. user along might be somewhat ambiguous since it could
refer to the entire userinfo section or to just the part before the
first colon.

Best regards
Tim Düsterhus

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

While implementing the polyfill I am finding easier DX wise to make the

constructor private and use instead named constructors for instantiation. I
would be in favor of

Uri::parse and Uri::tryParse like it is done currently with Enum and
the from and tryfrom named constructors.

My reasoning is as follow:

there's no right way or wrong way to instantiate an URI there are only
contexts. While the parse method is all about parsing a string, one could
legitimately use other named constructors like Uri::fromComponents which
would take for instance the result of parse_url to build a new URI. This
can become handy in the case of RFC3986 URI if you need to create an new
URI not related to the http scheme and that do not use all the components
like the email, data or FTP schemes.

By allowing creating URI based on their respective components value you
make it easier for dev to use the class. Also this means that if we want to
have a balance API then a toComponents method should come hand in hand
with the named constructor.

I would understand if that idea to add both components related methods is
rejected, they could be implemented on userland, but the main point was to
prove that from the VO or the developer POV in absence of a clearly defined
instantiation process, having a traditional constructor fails to convey all
the different way to create an URI.

There are a few things which came to my mind:

Currently, the underlying C libraries don't support a fromComponents
feature. How I could naively imagine this to work is that the components
are recomposed to a URI string based on the relevant algorithm (for RFC
3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
this string is parsed and validated. Unfortunately, I recently realized
that this approach may leave room for some kind of parsing confusion
attack, namely when the scheme is for example "https", the authority is
empty, and the path is "example.com". This will result in a
https://example.com URI. I believe a similar bug is not possible with the
rest of the components because they have their delimiters. So possibly some
other solution will be needed, or maybe adding some additional validation
(?).
Nicolas raised my awareness that if URIs didn't have a proper
constructor, then one wouldn't be able to use URI objects as parameter
default values, like below:
function (Uri $foo = new Uri('blah'))
I think this omission would cause some usability regression. For this
reason, it may make sense to have a distinguished way of instantiating an
Uri.
I have a similar feeling for a toComponents() method as for another named
constructor instead of __construct(): I am not completely against it, but
I'm not totally convinced about it.

Máté

4 months ago by Ignace Nyamagana Butera — view source

unread

Hi Ignace,
While implementing the polyfill I am finding easier DX wise to
make the constructor private and use instead named constructors
for instantiation. I would be in favor of

`Uri::parse` and `Uri::tryParse` like it is done currently with
Enum and the `from` and `tryfrom` named constructors.

My reasoning is as follow:

 there's no right way or wrong way to instantiate an URI there are
only contexts. While the parse method is all about parsing a
string, one could legitimately use other named constructors like
`Uri::fromComponents` which would take for instance the result of
parse_url to build a new URI. This can become handy in the case of
RFC3986 URI if you need to create an new URI not related to the
http scheme and that do not use all the components like the email,
data or FTP schemes.

 By allowing creating URI based on their respective components
value you make it easier for dev to use the class. Also this means
that if we want to have a balance API then a `toComponents` method
should come hand in hand with the named constructor.

I would understand if that idea to add both components related
methods is rejected, they could be implemented on userland, but
the main point was to prove that from the VO or the developer POV
in absence of a clearly defined instantiation process, having a
traditional constructor fails to convey all the different way to
create an URI.
There are a few things which came to my mind:

Currently, the underlying C libraries don't support a
fromComponents feature. How I could naively imagine this to work is
that the components are recomposed to a URI string based on the
relevant algorithm (for RFC 3986:
https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
this string is parsed and validated. Unfortunately, I recently
realized that this approach may leave room for some kind of parsing
confusion attack, namely when the scheme is for example "https", the
authority is empty, and the path is "example.com
http://example.com". This will result in a https://example.com URI.
I believe a similar bug is not possible with the rest of the
components because they have their delimiters. So possibly some other
solution will be needed, or maybe adding some additional validation (?).

Nicolas raised my awareness that if URIs didn't have a proper
constructor, then one wouldn't be able to use URI objects as parameter
default values, like below:
function (Uri $foo = new Uri('blah'))
I think this omission would cause some usability regression. For this
reason, it may make sense to have a distinguished way of instantiating
an Uri.

I have a similar feeling for a toComponents() method as for another
named constructor instead of __construct(): I am not completely
against it, but I'm not totally convinced about it.

Máté

Hi Máté,

for RFC 3986:
https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
this string is parsed and validated. Unfortunately, I recently
realized that this approach may leave room for some kind of parsing
confusion attack, namely when the scheme is for example "https", the
authority is empty, and the path is "example.com
<http://example.com>". This will result in a https://example.com
URI. I believe a similar bug is not possible with the rest of the
components because they have their delimiters. So possibly some
other solution will be needed, or maybe adding some additional
validation (?).

This is not correct according to RFC3986
https://datatracker.ietf.org/doc/html/rfc3986#section-3

*When authority is present, the path must either be empty or begin with
a slash ("/") character. When authority is not present, the path cannot
begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path component can never be null but that's another discussion. Like I said having a fromComponenta named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I have one last question regarding the URI implementations which are raised by my polyfill:

Did you also took into account the delimiters when submitting data via the withers ? In other words is

$uri->withQuery('?foo=bar');
//the same as
$uri->withQuery('foo=bar');

I know it is the case in of the WHATWG specification but I do not know if you kept this behaviour in your implementation for the WhatWgUrl for the Rfc3986 or for both. I would lean toward not accepting this "normalization" but since this is not documented in the RFC I wanted to know what is the expected behaviour.

Thanks for the hard work

4 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:

Hi Máté,

for RFC 3986:
https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
this string is parsed and validated. Unfortunately, I recently
realized that this approach may leave room for some kind of parsing
confusion attack, namely when the scheme is for example "https", the
authority is empty, and the path is "example.com
http://example.com". This will result in a https://example.com
URI. I believe a similar bug is not possible with the rest of the
components because they have their delimiters. So possibly some
other solution will be needed, or maybe adding some additional
validation (?).

This is not correct according to RFC3986
https://datatracker.ietf.org/doc/html/rfc3986#section-3

*When authority is present, the path must either be empty or begin with
a slash ("/") character. When authority is not present, the path cannot
begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException 🙂 for
RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft
error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path
component can never be null but that's another discussion. Like I
said having a fromComponenta named constructor would allow the
"removal" of the need for a UriBuilder (in your future section) and
would IMHO be useful outside of the context of the http(s) scheme but I
can understand it being left out of the current implementation it might
be brought back for future improvements.

I just tested this with the implementation and it also appears to not
yet be correct:

 var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // `NULL` 
 var_dump((new

Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); //
string(11) "example.com"
var_dump((new
Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); //
string(19) "https://example.com"

and

 var_dump((new

Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); //
string(3) "foo"

Best regards
Tim Düsterhus

4 months ago by Ignace Nyamagana Butera — view source

unread

Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:

Hi Máté,

   for RFC 3986:
https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
   this string is parsed and validated. Unfortunately, I recently
   realized that this approach may leave room for some kind of parsing
   confusion attack, namely when the scheme is for example "https", the
   authority is empty, and the path is "example.com
http://example.com". This will result in a https://example.com
   URI. I believe a similar bug is not possible with the rest of the
   components because they have their delimiters. So possibly some
   other solution will be needed, or maybe adding some additional
   validation (?).

This is not correct according to RFC3986
https://datatracker.ietf.org/doc/html/rfc3986#section-3

*When authority is present, the path must either be empty or begin
with a slash ("/") character. When authority is not present, the path
cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException 🙂 for
RFC3986 and in case of the WhatwgUrl algorithm it should trigger a
soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the
path component can never be null but that's another discussion.
Like I said having a fromComponenta named constructor would allow
the "removal" of the need for a UriBuilder (in your future section)
and would IMHO be useful outside of the context of the http(s) scheme
but I can understand it being left out of the current implementation
it might be brought back for future improvements.

I just tested this with the implementation and it also appears to not
yet be correct:

    var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL
    var_dump((new
Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); //
string(11) "example.com"
    var_dump((new
Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString());
// string(19) "https://example.com"

and

    var_dump((new
Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); //
string(3) "foo"

Best regards
Tim Düsterhus

Hi Tim and Maté upon further inspection and verification of RFC3986 I
also see an issue with the example used for normalization in the RFC.
According to RFC3986
(https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) :

  The reg-name syntax allows percent-encoded octets in order to
   represent non-ASCII registered names in a uniform way that is
    independent of the underlying name resolution technology.  Non-ASCII
    characters must first be encoded according to UTF-8 [STD63 <https://www.rfc-editor.org/rfc/rfc3986.html#ref-STD63>], and then
    each octet of the corresponding UTF-8 sequence must be percent-
    encoded to be represented as URI characters.  URI producing
    applications must not use percent-encoding in host unless it is used
    to represent a UTF-8 character sequence.  When a non-ASCII registered
    name represents an internationalized domain name intended for
    resolution via the DNS, the name must be transformed to the IDNA
    encoding [RFC3490 <https://www.rfc-editor.org/rfc/rfc3490>] prior to name lookup.

From this we can infer that:

Host encoding can only happen for UTF-8 sequence but in your example
"ex%61mple.com" is used which is not conforming to the rules (ie it
should throw an InvalidUriException IMHO for the Uri class) I presume
for WhatWg URL it will get correctly converted with a soft error (??).
That when available IDNA is preferred to percent-encoded sequences

Best regards

Ignace Nyamagana Butera

4 months ago by Ignace Nyamagana Butera — view source

unread

Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:

Hi Máté,

   for RFC 3986:
https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
   this string is parsed and validated. Unfortunately, I recently
   realized that this approach may leave room for some kind of parsing
   confusion attack, namely when the scheme is for example "https", the
   authority is empty, and the path is "example.com
http://example.com". This will result in a https://example.com
   URI. I believe a similar bug is not possible with the rest of the
   components because they have their delimiters. So possibly some
   other solution will be needed, or maybe adding some additional
   validation (?).

This is not correct according to RFC3986
https://datatracker.ietf.org/doc/html/rfc3986#section-3

*When authority is present, the path must either be empty or begin
with a slash ("/") character. When authority is not present, the
path cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException 🙂 for
RFC3986 and in case of the WhatwgUrl algorithm it should trigger a
soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the
path component can never be null but that's another discussion.
Like I said having a fromComponenta named constructor would allow
the "removal" of the need for a UriBuilder (in your future section)
and would IMHO be useful outside of the context of the http(s)
scheme but I can understand it being left out of the current
implementation it might be brought back for future improvements.

I just tested this with the implementation and it also appears to not
yet be correct:

    var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL
    var_dump((new
Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); //
string(11) "example.com"
    var_dump((new
Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString());
// string(19) "https://example.com"

and

    var_dump((new
Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); //
string(3) "foo"

Best regards
Tim Düsterhus

Hi Tim and Maté upon further inspection and verification of RFC3986 I
also see an issue with the example used for normalization in the RFC.
According to RFC3986
(https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) :
  The reg-name syntax allows percent-encoded octets in order to
   represent non-ASCII registered names in a uniform way that is
    independent of the underlying name resolution technology.  Non-ASCII
    characters must first be encoded according to UTF-8 [STD63 <https://www.rfc-editor.org/rfc/rfc3986.html#ref-STD63>], and then
    each octet of the corresponding UTF-8 sequence must be percent-
    encoded to be represented as URI characters.  URI producing
    applications must not use percent-encoding in host unless it is used
    to represent a UTF-8 character sequence.  When a non-ASCII registered
    name represents an internationalized domain name intended for
    resolution via the DNS, the name must be transformed to the IDNA
    encoding [RFC3490 <https://www.rfc-editor.org/rfc/rfc3490>] prior to name lookup.
From this we can infer that:

Host encoding can only happen for UTF-8 sequence but in your example
"ex%61mple.com" is used which is not conforming to the rules (ie it
should throw an InvalidUriException IMHO for the Uri class) I presume
for WhatWg URL it will get correctly converted with a soft error (??).

That when available IDNA is preferred to percent-encoded sequences

Best regards

Ignace Nyamagana Butera

Hi Maté and all,

I spotted another inconsistency in the normalization under RFC3986

According to the RFC
(https://www.rfc-editor.org/rfc/rfc3986.html#section-2.1)

For consistency, URI producers and normalizers should use uppercase hexadecimal
digits for all percent-encodings.

So during normalization for any component uppercased percent-encodings
should be used which is not the case for the example in the RFC. see for
instance

$uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com"); // percent-encoded form of https://你好你好.com
echo $uri->toString(); // https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com

the `toString` method should return
`https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com` instead.

Best regards

Ignace Nyamagana Butera

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

I spotted another inconsistency in the normalization under RFC3986

Thanks for spotting this: apparently, it is due to a small bug in the
uriparser library, which I managed to fix locally, PR is on its way to
upstream.

Máté

4 months ago by Ignace Nyamagana Butera — view source

unread

Hi Ignace,
I spotted another inconsistency in the normalization under RFC3986
Thanks for spotting this: apparently, it is due to a small bug in the
uriparser library, which I managed to fix locally, PR is on its way to
upstream.

Máté

Hi Máté I have a couple of questions regarding RFC3986\Uri

I believe during normalization of IPv6 host the letter a-f should be
lowercase in accordance with the RFC since

RFC3986 follows https://www.rfc-editor.org/rfc/rfc3513 which has been
replaced by https://www.rfc-editor.org/rfc/rfc4291 which is updated by
https://www.rfc-editor.org/rfc/rfc5952#section-4.3 which recommends
lowecasing the letters. (yeah that was quite a digging I know 🙂 )

Since the withers expect well encoded components does it means that it
is the same for the constructor. What is

the expected result for the following code ?


$uri =new Uri\Rfc3986\Uri("https://example,com/?foo[]=1&foo[]=2");

Should the above trigger an exception because the query component contain invalid characters or
is it acceptable ? Asking because currently our dear old parse_url does not fail on this and
probably most PHP developers expect this not to fail.

IMHO I am in favor of it failing to get a consistent experience when using the class because
otherwse you introduce an inconsistency between the constructor behaviour and the rest of the class
API.

Best regards,
Ignace Nyamagana Butera

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

upon further inspection and verification of RFC3986 I also see an issue
with the example used for normalization in the RFC. According to RFC3986 (
https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) :

The reg-name syntax allows percent-encoded octets in order to
represent non-ASCII registered names in a uniform way that is
independent of the underlying name resolution technology. Non-ASCII
characters must first be encoded according to UTF-8 [STD63 https://www.rfc-editor.org/rfc/rfc3986.html#ref-STD63], and then
each octet of the corresponding UTF-8 sequence must be percent-
encoded to be represented as URI characters. URI producing
applications must not use percent-encoding in host unless it is used
to represent a UTF-8 character sequence. When a non-ASCII registered
name represents an internationalized domain name intended for
resolution via the DNS, the name must be transformed to the IDNA
encoding [RFC3490 https://www.rfc-editor.org/rfc/rfc3490] prior to name lookup.

From this we can infer that:

Host encoding can only happen for UTF-8 sequence but in your example "ex%
61mple.com" is used which is not conforming to the rules (ie it should
throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg
URL it will get correctly converted with a soft error (??).

Oh, that's a very interesting catch again. If your interpretation is
correct, then I think it must also be some bug
with the parser library, but I have to dig into the code first, or reach
out to its author. :)

I have some suspicion though that the "URI producing applications" part may
not apply for this case, at least I have a hard-time
to decide what this expression really means. The RFC also uses "URI
reference parsers" that is really
a straightforward name, while "URI producers" isn't. For example, there is
a paragraph in the RFC:

URI producers and normalizers should omit the ":" delimiter that
separates host from port if the port component is empty. Some schemes do
not allow the userinfo and/or port subcomponents.

Clearly, omitting ":" is not done during parse-time, but when a URI
(reference) is produced. So I find it possible that
"URI producing" mean when the URI string is created, not when the URI is
parsed, although the RFC usually
uses URI and URI reference consistently. So I'm not sure. Maybe it's a
typo, and it should have been "URI normalizers".

Regards,
Máté

5 months ago by kocsismate90@gmail.com — view source

unread

Hi Tim,

Thank you again for the thorough review!

The naming of these methods seems to be a little inconsistent. It should
either be:
 ->getHostForDisplay()
 ->toStringForDisplay()
or
 ->getDisplayHost()
 ->toDisplayString()
but not a mix between both of them.

Yes, I completely agree with your concern. I'm just not sure yet which
combination I'd prefer.
Probably the latter one?

Yes. Besides the remark above, my previous arguments still apply (e.g.
with()ers not being able to construct instances for subclasses,
requiring to override all of them). I'm also noticing that serialization
is unsafe with subclasses that add a $__uri property (or perhaps any
property at all?).

Hm, yes, you are right indeed that withers cannot really create new
instances on
their own because the whole URI string is needed to instantiate a
new object... which is only
accessible if it's reconstructed by swapping the relevant component with
its new value.

Please note that trying to serialize a $__uri property will result in an
exception.

The toDisplayString() method that you mentioned above is not in the
RFC. Did you mean toHumanFriendlyString()? Which one is correct?

The toHumanFriendlyString() method stuck there from a previous version of
the proposal,
since then I converted it to toDisplayString().

The example output of the $errors array does not match the stub. It
contains a failure property, should that be softError instead?

The $softError property is also an outdated name: I recently changed it to
$failure
to be consistent with the wording that the WHATWG specification uses.

The RFC states "When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results in
a failure."

What happens for Rfc3986 when passing an invalid URI to the constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since $errors is not applicable for 3986?

The first two questions are answered right at the top of the parsing
section:

"the constructor: It expects a URI, and optionally, a base URL in order to
support reference resolution.
When parsing is unsuccessful, a Uri\InvalidUriException is thrown."

The $errors property will contain an empty array though, as you supposed. I
don't see much problem
with using the same exception in both cases, however I'm also fine
with making the $errors property
nullable in order to indicate that returning errors is not supported by the
implementation triggering
the error.

The RFC does not specify when UninitializedUriException is thrown.

That's a very good catch! I completely forgot about some exceptions. This
one is used
for indicating that an URI is not correctly initialized: when a URI
instance is created
without actually invoking the constructor, or the parse method, or
__unserialize(),
then any methods that try to use the internally stored URI will trigger
this exception.

The RFC does not specify when UriOperationException is thrown.

Generally speaking I believe it would help understanding if you would
add a /** @throws InvalidUriException */ to each of the methods in the
stub to make it clear which ones are able to throw (e.g. resolve(), or
the withers). It's harder to find this out from “English” rather than
“code” :-)

Good idea! I've added the PHPDoc as well as created a dedicated "Exceptions"
section.

In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if %26 is decoded to & in a query-string. Or if %3D is
decoded to =. This really is the same case as with %2F in a path.
The explanation

Thanks for calling these cases out, I've significantly reworked the
relevant sections.
First of all, I added much more details to the general overview about
percent-encoding:
https://wiki.php.net/rfc/url_parsing_api#percent-encoding_decoding as well
as extended
the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section
with more information
about the two component representations, and added a general clarification
related to reserved
characters. Additionally, the
https://wiki.php.net/rfc/url_parsing_api#component_modification
section
makes it clear how percent-encoding is performed when the withers are used.

After thinking about the question a lot, finally the current
encoding-decoding rules seem
logical to me, but please double-check them. It's easy to misinterpret such
long and complex
specifications.

Long story short: when parsing an URI or modifying a component, RFC 3986
fails hard if
an invalid character is found, while WHATWG implementation automatically
percent-encodes
it while also triggering a soft-error.

While retrieving the "normalized-decoded" representation of a URI component,
percent-decoding is
performed when possible:

in case of RFC3986: reserved and invalid characters are not
percent-decoded (only unreserved ones are)
in case of WHATWG: invalid characters and characters with special meaning
(that fall into the percent-encode set
of the given component) are not percent-decoded

The relevant sections will give a little more reasoning why I went with
these rules.

"the URI is normalized (when applicable), and then the reserved

characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components."

alone is not clear to me. “reserved characters that are not allowed in a
component”. I assume this means that %2F (/) in a path will not be
decoded, but %3F (?) will, because a bare ? can't appear in a path?

I hope that this question is also clear after my clarifications + the
reconsidered logic.

In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:
 $url->getRawScheme() // does not exist, because WhatWgUrl always
normalizes the scheme

Done.

to better point out the differences between the two APIs with regard to
normalization (it's mentioned, but having it in the code blocks would
make it more visible).

Done.

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ? and # as necessary. Will the same happen
for Rfc3986? Will the encoding of # also happen for the query-string
component? The RFC only mentions the path component.

The above referenced sections will give a clear answer for this question as
well.
TLDR: after your message, I realized that automatic percent-encoding also
triggers a (soft)
error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri,
so it won't do any automatic percent-encoding. It's unfortunate, because
this behavior is not
consistent with WHATWG, but it's more consistent with the parsing rules of its
own specification,
where there are only hard errors, and there's no such thing as "automatic
correction".

I'm also wondering if there are cases where the withers would not
round-trip, i.e. where $url->withPath($url->getPath()) would not
result in the original URL?

I am currently not aware of any such situation... I even wrote about this
aspect fairly
long, because I think "roundtripability" is a very important attribute.
Thank you for
raising awareness of this!

Can you add examples where the authority / host contains IPv6 literals?
It would be useful to specifically show whether or not the square
brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening 2001:db8:0:0:0:0:0:1 to 2001:db8::1).

Good idea again! I've added an example containing an IPv6 host at the very
end of the component retrieval section. And yes, they will be
enclosed within a [] pair as
per the spec.

It also surprised me, but IP address normalization is only performed by
WHATWG
during recomposition! But nowhere else...

In “Component Recomposition” the RFC states "The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".

Does this mean that toString() for Rfc3986 will always return the
original input?

Yes, effectively that's the case, only WHATWG modifies the input according
to my knowledge.
In the past, I had the impression that RFC 3986 also did a few changes,
but then I had to realize that it was not the case after I had dug deep
into the code of uriparser.

It would be useful to know whether or not the classes implement
__debugInfo() / how they appear when var_dump()ing them.

I've added an example.

That's all I managed to write for now, but I'll try to answer the rest of
the messages and feedback
as soon as possible. :)

Regards,
Máté

4 months ago by tim@bastelstu.be — view source

unread

Hi

Apologies for getting back to you just now.

What happens for Rfc3986 when passing an invalid URI to the
constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since $errors is not applicable for 3986?

[…]

The $errors property will contain an empty array though, as you
supposed. I
don't see much problem
with using the same exception in both cases, however I'm also fine
with making the $errors property
nullable in order to indicate that returning errors is not supported by
the
implementation triggering
the error.

I think I would prefer:

 namespace Uri {
     class InvalidUriException extends \Uri\UriException
     {
     }
 }

 namespace Uri\WhatWg {
     class InvalidUrlException extends \Uri\InvalidUriException {
         /** @var list<UrlValidationError> */
         public readonly array $errors;
     }
 }

(note the use of Url in the name of the sub-exception)

While this would result in a little more boilerplate, it would make
static analysis tools more useful, since the $errors array could be
properly typed instead of being just array<mixed>.

In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if %26 is decoded to & in a query-string. Or if %3D is
decoded to =. This really is the same case as with %2F in a path.
The explanation

[…]
The relevant sections will give a little more reasoning why I went with
these rules.

I've tested some of the examples against the implementation, but it does
not match the description. Is the implementation up to date?

 &lt;?php

 $url = new Uri\WhatWg\Url("https://example.com/foo/bar%2Fbaz");

 var_dump($url->getPath());                            //

/foo/bar%2Fbaz
var_dump($url->getRawPath()); //
/foo/bar%2Fbaz

results in:

 string(12) "/foo/bar/baz"
 string(14) "/foo/bar%2Fbaz"

The implementation for Rfc3986 appears to be correct.

"the URI is normalized (when applicable), and then the reserved

characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded
that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components."

alone is not clear to me. “reserved characters that are not allowed in
a
component”. I assume this means that %2F (/) in a path will not be
decoded, but %3F (?) will, because a bare ? can't appear in a
path?

I hope that this question is also clear after my clarifications + the
reconsidered logic.

Please also give an explicit example for %3F in a path. I know that it
is reserved from reading the Rfc3986, but I think it's a little
unintuitive. You can adjust the last example in the component retrieval
section to make it show all cases. So:

 $uri = new

Uri\Rfc3986\Uri("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");

 echo $uri->getHost();                           //

[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getRawHost(); //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getPath(); // /foo/bar%3Fbaz
echo $uri->getRawPath(); // /foo/bar%3Fbaz
echo $uri->getQuery(); //
foo=bar%26baz%3Dqux
echo $uri->getRawQuery(); //
foo=bar%26baz%3Dqux

During testing I also noticed that the Rfc3986 implementation removes
trailing slashes from the path when using the normalized version. This
was a little unexpected, because to me this is the difference between a
directory and a file. I don't think there are clear examples showing
that. So:

 $uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/");

 echo $uri->getPath();     // /foo/bar
 echo $uri->getRawPath();  // /foo/bar/

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ? and # as necessary. Will the same
happen
for Rfc3986? Will the encoding of # also happen for the query-string
component? The RFC only mentions the path component.

The above referenced sections will give a clear answer for this
question as
well.
TLDR: after your message, I realized that automatic percent-encoding
also
triggers a (soft)
error case for WHATWG, so I changed my mind with regards to
Uri\Rfc3986\Uri,
so it won't do any automatic percent-encoding. It's unfortunate,
because
this behavior is not
consistent with WHATWG, but it's more consistent with the parsing rules
of its
own specification,
where there are only hard errors, and there's no such thing as
"automatic
correction".

Is the implementation already up to date with this change? When I try:

 var_dump(
 	(new Uri\Rfc3986\Uri('https://example.com/foo/path'))
 		->withPath('some/path?foo=bar')
 		->toString()
 );

I get

 string(36) "https://example.comsome/path?foo=bar";

which is completely wrong.

It also surprised me, but IP address normalization is only performed by
WHATWG
during recomposition! But nowhere else...

I think this might be a misunderstanding of the WHATWG specification. It
seems to be also normalized during parsing:

When I do the following in my Google Chrome:

 (new URL('https://[0:0::1]')).host;

I get [::1], which indicates the normalization happening. And likewise
will:

 (new URL('https://[2001:db8:0:0:0:0:0:1]')).host;

result in [2001:db8::1].

I've also tested this with the implementation to see if this is just
something that is not clear in the RFC text, but correctly handled in
the implementation and noticed that the behavior is pretty broken.

Consider this script:

 &lt;?php
 $url = 'https://[2001:db8:0:0:0:0:0:1]/foo/path';

 var_dump((new Uri\Rfc3986\Uri($url))->getHost());
 var_dump((new Uri\WhatWg\Url($url))->getAsciiHost());

This outputs:

 string(20) "2001:db8:0:0:0:0:0:1"
 string(23) "[8193:3512:0:0:0:0:0:1]"

For Rfc3986: The square brackets are missing.
For WhatWg: The IPv6 is completely broken.

My expectation be be [2001:db8:0:0:0:0:0:1] for Rfc3986 and
[2001:db8::1] for WhatWg. I have also tested the behavior of
withHost() when leaving out the square brackets. The Rfc3986 correctly
throws an Exception, but WhatWg silently does nothing:

 $url = 'https://example.com/foo/path';

 var_dump((new

Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString());

results in

 string(28) "https://example.com/foo/path";

Best regards
Tim Düsterhus

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Tim,

I think I would prefer:

 namespace Uri {
     class InvalidUriException extends \Uri\UriException
     {
     }
 }

 namespace Uri\WhatWg {
     class InvalidUrlException extends \Uri\InvalidUriException {
         /** @var list<UrlValidationError> */
         public readonly array $errors;
     }
 }
(note the use of Url in the name of the sub-exception)

While this would result in a little more boilerplate, it would make
static analysis tools more useful, since the $errors array could be
properly typed instead of being just array<mixed>.

OK, this makes sense to me, and I've just implemented it.

In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if %26 is decoded to & in a query-string. Or if %3D is
decoded to =. This really is the same case as with %2F in a path.
The explanation

[…]
The relevant sections will give a little more reasoning why I went with
these rules.

I've tested some of the examples against the implementation, but it does
not match the description. Is the implementation up to date?
 &lt;?php

 $url = new Uri\WhatWg\Url("https://example.com/foo/bar%2Fbaz");

 var_dump($url->getPath());                            //
/foo/bar%2Fbaz
var_dump($url->getRawPath()); //
/foo/bar%2Fbaz

results in:
 string(12) "/foo/bar/baz"
 string(14) "/foo/bar%2Fbaz"

Yes, it is currently up-to-date, but I made some changes in WHATWG encoding
not long ago and I didn't notice that
the chosen behavior negatively affects this case... Let me share the
details, because decoding of WHATWG
URLs seems very problematic.

Originally, my intention was to percent-decode characters based on the
individual components' "percent-encode set" (i.e.
https://url.spec.whatwg.org/#fragment-percent-encode-set for the fragment).
These are the characters that are
automatically percent-encoded when encountered. One of my problems with
this behavior was that characters in "percent-encode sets"
are not entirely in line with "URL code points" (basically valid characters
in an URL: https://url.spec.whatwg.org/#url-code-points).
Most notably, the "#", the "[", and "]" characters are present in some
percent-encoding sets, while missing from the valid URL
code points.

If characters were percent-decoded based on the "percent-encode sets", then
there would be some issues when the result is
passed to a wither: the WHATWG setter algorithms emit a soft error in these
cases (e.g. in case of the query string, the
https://url.spec.whatwg.org/#dom-url-search steps trigger
https://url.spec.whatwg.org/#query-state, where the 3.1. step takes
into action). To be fair, soft errors are not exposed in case of WHATWG
withers, so it's currently rather a theoretical problem
than an actual one (but I'm still considering adding a $softErrors
parameter to WHATWG withers).

In any case, I believe the end of the "Component modification section" of
the RFC shares some background information
regarding percent-decoding behavior.

At last, when I changed the RFC so that only those characters were
percent-decoded which were "URL code points", I didn't notice
that the example you referred to above would go outdated: as "/" is an URL
code point, it's currently percent-decoded by getPath().
Unfortunately, I still don't know what the best approach would be.

Please also give an explicit example for %3F in a path. I know that it
is reserved from reading the Rfc3986, but I think it's a little
unintuitive. You can adjust the last example in the component retrieval
section to make it show all cases. So:
 $uri = new
Uri\Rfc3986\Uri("https://
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");
 echo $uri->getHost();                           //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getRawHost(); //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getPath(); // /foo/bar%3Fbaz
echo $uri->getRawPath(); // /foo/bar%3Fbaz
echo $uri->getQuery(); //
foo=bar%26baz%3Dqux
echo $uri->getRawQuery(); //
foo=bar%26baz%3Dqux

Why is this behavior unintuitive? I think the already added examples should
already make it clear that percent-encoded
characters are never percent-decoded (the component modification part also
has one example).

During testing I also noticed that the Rfc3986 implementation removes
trailing slashes from the path when using the normalized version. This
was a little unexpected, because to me this is the difference between a
directory and a file. I don't think there are clear examples showing
that. So:
 $uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/");

 echo $uri->getPath();     // /foo/bar
 echo $uri->getRawPath();  // /foo/bar/

Yes, I agree it's weird. I'll have a look at the code again if the
normalizer removes the trailing slash, or I messed up something.

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ? and # as necessary. Will the same
happen
for Rfc3986? Will the encoding of # also happen for the query-string
component? The RFC only mentions the path component.

I think the question for RFC 3986 is answered in the PHP RFC by the
following paragraph:

In order to offer consistent behavior with the parsing rules of RFC 3986,
withers of Uri\Rfc3986\Uri also only accept properly formatted input,
meaning characters
that are not allowed to be present in a component must be
percent-encoded. Let's see what this means in practice through the
following example

Effectively, RFC 3986 has different behavior than what WHATWG does.

The latter question ("Will the encoding of # also happen for the
query-string component?")
was supposed to be answered by the RFC, because of this sentence:

WHATWG algorithm automatically percent-encodes characters that fall into
the percent-encoding
character set of the given component

It may be possible that "the given" part is misleading, but the behavior
actually follows the WHATWG spec
for all components. In any case, I change a few words to make this clear.

Is the implementation already up to date with this change? When I try:

 var_dump(
    (new Uri\Rfc3986\Uri('https://example.com/foo/path'))
            ->withPath('some/path?foo=bar')
            ->toString()
 );

I get

 string(36) "https://example.comsome/path?foo=bar";

which is completely wrong.

I haven't completely implemented withers yet for RFC 3986 (first and
foremost validation is missing),
so that's why you experienced this behavior. I would fix this later, but
only if the vote succeeds. I've already
worked a lot on the implementation without having any promise of the RFC
to succeed.

I think this might be a misunderstanding of the WHATWG specification. It
seems to be also normalized during parsing:

When I do the following in my Google Chrome:
 (new URL('https://[0:0::1]')).host;
I get [::1], which indicates the normalization happening. And likewise
will:
 (new URL('https://[2001:db8:0:0:0:0:0:1]')).host;
result in [2001:db8::1].

Yes, I realized that you are right. IP6 support used to be indeed
incomplete or buggy until now,
but I took some time, and corrected the behavior.

My expectation be be [2001:db8:0:0:0:0:0:1] for Rfc3986 and
[2001:db8::1] for WhatWg. I have also tested the behavior of
withHost() when leaving out the square brackets. The Rfc3986 correctly
throws an Exception, but WhatWg silently does nothing:
 $url = 'https://example.com/foo/path';

 var_dump((new
Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString());

results in
 string(28) "https://example.com/foo/path";

This looks like this is the result of WHATWG's host setter algorithm (
https://url.spec.whatwg.org/#dom-url-hostname).
After debugging the behavior, I noticed that "new
Uri\WhatWg\Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse
the port after the first ":" character. However, the setter algorithm
obviously doesn't reach this point, since it only tries to
parse the host, and then it stops (because of the state override). So I'm
not sure this gotcha can be cured.

I tried to reproduce the problem in Chrome, but I realized that the URL
properties are not validated at all
when they are set ("url.hostname = "2001:db8:0:0:0:0:0:1";" will change the
hostname no problem)...

Regards,
Máté

3 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-04-13 14:10, schrieb Máté Kocsis:

 namespace Uri {
     class InvalidUriException extends \Uri\UriException
     {
     }
 }

 namespace Uri\WhatWg {
     class InvalidUrlException extends \Uri\InvalidUriException {
         /** @var list<UrlValidationError> */
         public readonly array $errors;
     }
 }
(note the use of Url in the name of the sub-exception)

While this would result in a little more boilerplate, it would make
static analysis tools more useful, since the $errors array could be
properly typed instead of being just array<mixed>.
OK, this makes sense to me, and I've just implemented it.

Great. Don't forget to adjust the RFC text (that's the more important
part :-)).

At last, when I changed the RFC so that only those characters were
percent-decoded which were "URL code points", I didn't notice
that the example you referred to above would go outdated: as "/" is an
URL
code point, it's currently percent-decoded by getPath().
Unfortunately, I still don't know what the best approach would be.

I see, thank you. I did some tests myself and read the spec. I've also
checked https://github.com/whatwg/url/issues/565.

Perhaps the correct solution would be to offer only the non-raw methods
for WHATWG URL and to not attempt any additional percent-decoding there?
My reasoning is that the WHATWG URL is a living standard anyways, so
trying to add additional semantics on top will result in sadness. My
understanding is also that it is primarily intended for interaction with
web browsers or to embed these URLs into HTML. For access control, e.g.
in your framework the RFC3986 URI should be used. It's what HTTP uses
internally and it supports well-defined normalization.

What do you think?

Please also give an explicit example for %3F in a path. I know that
it
is reserved from reading the Rfc3986, but I think it's a little
unintuitive. You can adjust the last example in the component
retrieval
section to make it show all cases. So:
 $uri = new
Uri\Rfc3986\Uri("https://
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");
 echo $uri->getHost();                           //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getRawHost(); //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getPath(); // /foo/bar%3Fbaz
echo $uri->getRawPath(); // /foo/bar%3Fbaz
echo $uri->getQuery(); //
foo=bar%26baz%3Dqux
echo $uri->getRawQuery(); //
foo=bar%26baz%3Dqux
Why is this behavior unintuitive? I think the already added examples
should

Unintuive probably is not the best word. But I expect users to primarily
interact with the path component of an URL (e.g. within their
framework’s router). So I think it makes sense to be extra explicit with
examples there. As an example, I recently learned that Symfony's router
does not support (encoded) slashes within a component:

 #[Route('/test/{message}', name: 'test')]

will work for http://localhost:8000/test/foo, but not for
http://localhost:8000/test/foo%2fbar, resulting in:

 No route found for "GET http://localhost:8000/test/foo%2fbar";

So if you would just extend the: “Let's have a look at some other tricky
example with Uri\Rfc3986\Uri:” to my suggestion, I would be happy :-)

Note: I believe there is a small mistake in the example when you last
modified it. It says:

 echo $uri->getHost();                           //

[2001:0db8:0001:0000:0000:0ab9:C0a8:0102]

Should the 'C' in 'C0a8' also be lowercased?

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ? and # as necessary. Will the same
happen
for Rfc3986? Will the encoding of # also happen for the query-string
component? The RFC only mentions the path component.

I think the question for RFC 3986 is answered in the PHP RFC by the
following paragraph:

In order to offer consistent behavior with the parsing rules of RFC
3986,
withers of Uri\Rfc3986\Uri also only accept properly formatted input,
meaning characters
that are not allowed to be present in a component must be
percent-encoded. Let's see what this means in practice through the
following example

Yes, thank you for pointing that out.

Effectively, RFC 3986 has different behavior than what WHATWG does.

Understood, makes sense.

The latter question ("Will the encoding of # also happen for the
query-string component?")
was supposed to be answered by the RFC, because of this sentence:

WHATWG algorithm automatically percent-encodes characters that fall
into
the percent-encoding
character set of the given component

It may be possible that "the given" part is misleading, but the
behavior
actually follows the WHATWG spec
for all components. In any case, I change a few words to make this
clear.

Yes, that makes sense. It's also explained in the “Percent-encoding &
decoding” subsection of the “Important concepts” section, but I already
forgot about that when I got down to the “Component recomposition” bit.
My mistake! :-)

I haven't completely implemented withers yet for RFC 3986 (first and
foremost validation is missing),
so that's why you experienced this behavior. I would fix this later,
but
only if the vote succeeds. I've already
worked a lot on the implementation without having any promise of the
RFC
to succeed.

Understood.

My expectation be be [2001:db8:0:0:0:0:0:1] for Rfc3986 and
[2001:db8::1] for WhatWg. I have also tested the behavior of
withHost() when leaving out the square brackets. The Rfc3986
correctly
throws an Exception, but WhatWg silently does nothing:
 $url = 'https://example.com/foo/path';

 var_dump((new
Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString());

results in
 string(28) "https://example.com/foo/path";
This looks like this is the result of WHATWG's host setter algorithm (
https://url.spec.whatwg.org/#dom-url-hostname).
After debugging the behavior, I noticed that "new
Uri\WhatWg\Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse
the port after the first ":" character. However, the setter algorithm
obviously doesn't reach this point, since it only tries to
parse the host, and then it stops (because of the state override). So
I'm
not sure this gotcha can be cured.

I tried to reproduce the problem in Chrome, but I realized that the URL
properties are not validated at all
when they are set ("url.hostname = "2001:db8:0:0:0:0:0:1";" will change
the
hostname no problem)...

I just tested it with node.js:

 > u = new URL('https://example.com/foo/path');
 URL {
   href: 'https://example.com/foo/path',
   origin: 'https://example.com',
   protocol: 'https:',
   username: '',
   password: '',
   host: 'example.com',
   hostname: 'example.com',
   port: '',
   pathname: '/foo/path',
   search: '',
   searchParams: URLSearchParams {},
   hash: ''
 }
 > u.hostname = '2001:db8:0:0:0:0:0:1'
 '2001:db8:0:0:0:0:0:1'
 > u
 URL {
   href: 'https://example.com/foo/path',
   origin: 'https://example.com',
   protocol: 'https:',
   username: '',
   password: '',
   host: 'example.com',
   hostname: 'example.com',
   port: '',
   pathname: '/foo/path',
   search: '',
   searchParams: URLSearchParams {},
   hash: ''
 }
 > u.toString()
 'https://example.com/foo/path'
 > u.hostname = '[2001:db8:0:0:0:0:0:1]'
 '[2001:db8:0:0:0:0:0:1]'
 > u
 URL {
   href: 'https://[2001:db8::1]/foo/path',
   origin: 'https://[2001:db8::1]',
   protocol: 'https:',
   username: '',
   password: '',
   host: '[2001:db8::1]',
   hostname: '[2001:db8::1]',
   port: '',
   pathname: '/foo/path',
   search: '',
   searchParams: URLSearchParams {},
   hash: ''
 }
 > u.toString()
 'https://[2001:db8::1]/foo/path'

So it indeed seems to be a limitation of the WHATWG specification and
your PHP implementation is consistent with node.js. That is a good thing
and when a user stumbles upon this, we can point them towards node.js /
the spec. Not great, but this is workable!

Best regards
Tim Düsterhus

3 months ago by Ignace Nyamagana Butera — view source

unread

Perhaps the correct solution would be to offer only the non-raw
methods for WHATWG URL and to not attempt any additional
percent-decoding there? My reasoning is that the WHATWG URL is a
living standard anyways, so trying to add additional semantics on top
will result in sadness. My understanding is also that it is primarily
intended for interaction with web browsers or to embed these URLs into
HTML. For access control, e.g. in your framework the RFC3986 URI
should be used. It's what HTTP uses internally and it supports
well-defined normalization.

What do you think?

Hi Tim and Maté

As a primary user of RFC3986/87 and with my experiences with WHATWG URL
I fully support the removal of the raw methods on the WhatWgUrl
implementation. The specification defines in one go via a state machine
parsing, validation and normalization basically you always work with
normalized URLs. I believe Javascript developers and browser vendors
expect normalization out of the box for security and coherence between
browsers. So in the context of browsers raw values are never expected
nor wanted. I always wonder how you could extract raw value since the
specification always talk about codepoints and parse the URL while
normalizing the input.

As Tim also pointed out, the WHATWG is a living standard so the URL
produces today may not be the one produces tomorrow which would then add
more burden on the maintenance side if you constantly need to update how
raw values are being extract in a specification that does not even
consider them (does not offer an official way to access them).

Last but not least I tried several time to implement a polyfill for the
Whatwg Url and I fail for that specific reason. I always go back to my
initial comment both specs are great in that they complement each other.
They may overlaps but they are fundamently different, so their public
API should probably also reflect that. (ie WhatwgURL supports IDN host,
RFC3986 does not) encoding differs for query string and so on. Trying to
offer the same API for both even for raw method is IMHO not helping.
And probably it may ease even your implementation since you would not
have to worry about more edge cases.

Best regards,

Ignace Nyamagana Butera

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Tim,

Perhaps the correct solution would be to offer only the non-raw methods
for WHATWG URL and to not attempt any additional percent-decoding there?
My reasoning is that the WHATWG URL is a living standard anyways, so
trying to add additional semantics on top will result in sadness. My
understanding is also that it is primarily intended for interaction with
web browsers or to embed these URLs into HTML. For access control, e.g.
in your framework the RFC3986 URI should be used. It's what HTTP uses
internally and it supports well-defined normalization.

What do you think?

This was one of my (unspoken) ideas as well. I used to think there must
have been a correct logic
for percent-decoding of WHATWG components, but if none of us can come up
with a sensible
idea, then it's best not to try it, I agree.

Unintuive probably is not the best word. But I expect users to primarily

interact with the path component of an URL (e.g. within their
framework’s router). So I think it makes sense to be extra explicit with
examples there. As an example, I recently learned that Symfony's router
does not support (encoded) slashes within a component:
 #[Route('/test/{message}', name: 'test')]
will work for http://localhost:8000/test/foo, but not for
http://localhost:8000/test/foo%2fbar, resulting in:
 No route found for "GET http://localhost:8000/test/foo%2fbar";
So if you would just extend the: “Let's have a look at some other tricky
example with Uri\Rfc3986\Uri:” to my suggestion, I would be happy :-)

Alright, I'll add it. It won't hurt for sure!

Note: I believe there is a small mistake in the example when you last

modified it. It says:
 echo $uri->getHost();                           //
[2001:0db8:0001:0000:0000:0ab9:C0a8:0102]

Should the 'C' in 'C0a8' also be lowercased?

Yes, nice catch! I swear I double checked it multiple times if there was
any uppercase letters that should
be lowercased...

So it indeed seems to be a limitation of the WHATWG specification and
your PHP implementation is consistent with node.js. That is a good thing
and when a user stumbles upon this, we can point them towards node.js /
the spec. Not great, but this is workable!

Thank you for the test! To be honest, I pretty much don't like how WHATWG
setters are specified, they seem to behave very "ad hoc" based on what I
saw so far. :(

Regards,
Máté

3 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-04-15 23:55, schrieb Máté Kocsis:

This was one of my (unspoken) ideas as well. I used to think there must
have been a correct logic
for percent-decoding of WHATWG components, but if none of us can come
up
with a sensible
idea, then it's best not to try it, I agree.

Sweet. I believe this was/is the last remaining blocker for the RFC or
is there still anyone else from your side that needs to be discussed? I
need to give the RFC another read once you made the adjustment to remove
the WhatWg raw methods (and adjusted the corresponding explanations),
but I think I'm happy then :-)

For the latest changes from Tuesday, I see that you added the
WhatWg-specific InvalidUrlException. The Uri\InvalidUriException
however still has the $errors property. I think you might have
forgotten to remove it, since the Rfc3986 implementation / the base
exception does not expose any errors, right?

Best regards
Tim Düsterhus

3 months ago by kocsismate90@gmail.com — view source

unread

Hi,

Tim Düsterhus tim@bastelstu.be ezt írta (időpont: 2025. ápr. 17., Cs,
9:22):

Hi

Am 2025-04-15 23:55, schrieb Máté Kocsis:

This was one of my (unspoken) ideas as well. I used to think there must
have been a correct logic
for percent-decoding of WHATWG components, but if none of us can come
up
with a sensible
idea, then it's best not to try it, I agree.

Sweet. I believe this was/is the last remaining blocker for the RFC or
is there still anyone else from your side that needs to be discussed? I
need to give the RFC another read once you made the adjustment to remove
the WhatWg raw methods (and adjusted the corresponding explanations),
but I think I'm happy then :-)

No, I also think that was the last one, as I don't have any questions left.
Although,
we should finalize what the WHATWG getters should be named? I like the
explicit "raw"
that you suggested, but I can also see that it may be confusing for some
people. Altogether
I think I prefer adding "raw" so that it's clear that they behave similarly
how the raw RFC 3986 getters
do.

For the latest changes from Tuesday, I see that you added the
WhatWg-specific InvalidUrlException. The Uri\InvalidUriException
however still has the $errors property. I think you might have
forgotten to remove it, since the Rfc3986 implementation / the base
exception does not expose any errors, right?

I made the changes in the RFC in a hurry, so yes, I forgot to remove the
property. Thanks!

Máté

3 months ago by ignace nyamagana butera — view source

unread

I still have one last question regarding the RFC3986 URI path component.
Currently the path is nullable but according to the RFC the path can not be
nullable
According to the RFC the path can have up to 5 ABNF representation

path = path-abempty ; begins with "/" or is empty

                / path-absolute   ; begins with "/" but not "//"
                / path-noscheme   ; begins with a non-colon segment
                / path-rootless   ; begins with a segment
                / path-empty      ; zero characters

  path-abempty  = *( "/" segment )
  path-absolute = "/" [ segment-nz *( "/" segment ) ]
  path-noscheme = segment-nz-nc *( "/" segment )
  path-rootless = segment-nz *( "/" segment )
  path-empty    = 0<pchar>

but none of which is null. The path can only be a string empty or not.
so I would change the getPath and withPath signature

to highlight that fact. Apart from that I have no more comments.

Hi,

Tim Düsterhus tim@bastelstu.be ezt írta (időpont: 2025. ápr. 17., Cs,
9:22):

Hi

Am 2025-04-15 23:55, schrieb Máté Kocsis:

This was one of my (unspoken) ideas as well. I used to think there must
have been a correct logic
for percent-decoding of WHATWG components, but if none of us can come
up
with a sensible
idea, then it's best not to try it, I agree.

Sweet. I believe this was/is the last remaining blocker for the RFC or
is there still anyone else from your side that needs to be discussed? I
need to give the RFC another read once you made the adjustment to remove
the WhatWg raw methods (and adjusted the corresponding explanations),
but I think I'm happy then :-)

No, I also think that was the last one, as I don't have any questions
left. Although,
we should finalize what the WHATWG getters should be named? I like the
explicit "raw"
that you suggested, but I can also see that it may be confusing for some
people. Altogether
I think I prefer adding "raw" so that it's clear that they behave
similarly how the raw RFC 3986 getters
do.

For the latest changes from Tuesday, I see that you added the
WhatWg-specific InvalidUrlException. The Uri\InvalidUriException
however still has the $errors property. I think you might have
forgotten to remove it, since the Rfc3986 implementation / the base
exception does not expose any errors, right?

I made the changes in the RFC in a hurry, so yes, I forgot to remove the
property. Thanks!

Máté

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

Currently the path is nullable but according to the RFC the path can not
be nullable
According to the RFC the path can have up to 5 ABNF representation

Uh, this is something that I also forgot to sync between the implementation
and the RFC. I also recently found out that
the get*Path() methods should be non-nullable for both classes, so I
recently fixed them. Sorry for the confusion!

Regards,
Máté

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

Uh, this is something that I also forgot to sync between the
implementation and the RFC. I also recently found out that
the get*Path() methods should be non-nullable for both classes, so I
recently fixed them. Sorry for the confusion!

Actually, I realized after checking the RFC that it was up-to-date this
time with the recent changes. So maybe you read an older version, didn't
you?

Regards,
Máté

3 months ago by Paul M. Jones — view source

unread

Hi Maté and all,

A one-off comment about the exceptions:

The RFC posits that Uri\UriException extends Exception, and Uri\InvalidUriException extends UriException.

However, pre-existing userland solutions to the URI problem lean more heavily on the native PHP InvalidArgumentException, which extends LogicException. (Cf. https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#exceptions.)

(LogicException "represents an error in the program logic. This kind of exception should lead directly to a fix in your code.")

As such, the InvalidUriException would better extend from LogicException.

What then to do with UriException ? It's a base, it never gets thrown anywhere. If a base is actually necessary, perhaps it should be renamed _UriLogicException) and extend LogicException; then InvalidUriException can extend from that base. This leaves room for a UriRuntimeException base, should one ever be needed.

-- pmj

3 months ago by tim@bastelstu.be — view source

unread

Hi

As such, the InvalidUriException would better extend from LogicException.

No. There is a de facto policy of “not using SPL exceptions in new
code”. The replacement for LogicException is the Error hierarchy.

Also, as you quoted yourself, LogicException would be not appropriate to
use as the base for InvalidUriException, since passing invalid URIs is
not a programming error. The point of the URI classes is that they
validate URIs, thus malformed inputs are expected in correctly written code.

See also https://github.com/php/php-src/pull/9220 for the rationale
behind the exception hierarchy in ext/random (which is the first API
that was rewritten for “modern PHP”). The choices there also served as
the basis for the new ext/date hierarchy in PHP 8.3:
https://wiki.php.net/rfc/datetime-exceptions

Best regards
Tim Düsterhus

3 months ago by Paul M. Jones — view source

unread

Hi

As such, the InvalidUriException would better extend from LogicException.

No. There is a de facto policy of “not using SPL exceptions in new code”. The replacement for LogicException is the Error hierarchy.

Ah so -- I was not aware. I retract the comment, and thanks for the correction.

-- pmj

3 months ago by tim@bastelstu.be — view source

unread

Hi

As such, the InvalidUriException would better extend from LogicException.

No. There is a de facto policy of “not using SPL exceptions in new code”. The replacement for LogicException is the Error hierarchy.

Ah so -- I was not aware. I retract the comment, and thanks for the correction.

Yes, we absolutely should make this an official policy in the new-ish
policies repository (https://github.com/php/policies) to give folks an
official resource to reference and hopefully making it easier for RFC
authors to make the “correct choice” without someone needing to remember
the existing gentleman’s agreement.

I've put writing such a policy RFC onto my TODO list to handle when I
have the time.

Best regards
Tim Düsterhus

3 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-04-17 13:18, schrieb Máté Kocsis:

Sweet. I believe this was/is the last remaining blocker for the RFC or
is there still anyone else from your side that needs to be discussed?
I
need to give the RFC another read once you made the adjustment to
remove
the WhatWg raw methods (and adjusted the corresponding explanations),
but I think I'm happy then :-)

No, I also think that was the last one, as I don't have any questions
left.
Although,
we should finalize what the WHATWG getters should be named? I like the
explicit "raw"
that you suggested, but I can also see that it may be confusing for
some
people. Altogether
I think I prefer adding "raw" so that it's clear that they behave
similarly
how the raw RFC 3986 getters
do.

In https://news-web.php.net/php.internals/127114 I suggest to only
provide the "non-raw" methods, so I believe you misread that. I've just
given the RFC another read and thought about the naming and I believe I
still prefer not having the "raw" in the name:

Having the raw in the name makes the API very clunky / verbose to
use.
Other implementations, such as in browsers or node.js, also simply use
the component name without any indication of the output being raw.
Future changes to the WHATWG URL specification might introduce some
normalization for components that currently doesn't have normalization.
This would make the raw naming a misnomer and might require new
methods / deprecations on PHP's end.

So it seems to be safer to use the naming without the raw and then in
the documentation explain what happens with useful examples, just like
the RFC already does.

Other than that, I noticed the following small issues:

The UrlValidationError class is final in the implementation, but not
in the RFC text. I assume that is an oversight.

In the "Advanced examples" section, the "another tricky example". There
is a duplicate ?foo=bar%26baz%3Dqux in the query-string. I assume that
is unintentional and not part of the example.

In the "Advanced examples" section, the "another tricky example". I
think it would be useful to have an explicit comparison to the output of
the WHATWG URL, especially around the IPv6 normalization. I've seen that
this is also mentioned later, but it's probably useful to have here as
well.

In the "Component modification" section, for the "In order to offer
consistent behavior with the parsing rules of RFC 3986, withers of
Uri\Rfc3986\Uri also only accept properly formatted input," example:

There is a echo $uri->getRawHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] call, but the host is never
modified. That appears to be an error.

In the "Serialization" section: The explanation of the serialization
format is overly specific regarding the implementation details. I would
simplify that to just say "it supports serialization by using the
toRawString() output and performs strict checks during unserialization"
or similar. The reason is that I want to make some suggestions to the
serialization format to provide greater flexibility for future changes
during the technical review of the implementation :-)

I did not give the implementation another test, since with the removal
of the percent-decoding for WHATWG, the RFC just does what the other
specifications already require. So this all makes sense to me and any
differences would simply be a regular bug in the code, rather than the
RFC text.

Best regards
Tim Düsterhus

3 months ago by ignace nyamagana butera — view source

unread

Hi Maté

I see you updated the RFC but I believe there's still some errors in the
example:

$url = Uri\WhatWg\Url::parse("/foo", ".com"); //
Throws Uri\WhatWg\InvalidUrlException because of $baseUri

Since parse is used shouldn't it return null instead of throwing ?

$uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com");
// percent-encoded form of https://你好你好.com
https://xn--6qqa088eba.comecho $uri->toString();
// https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com

RFC3986 host normalization states that URL encoded part should be encoded
using uppercased letter so on normalization:

https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com should be
https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com

I updated my polyfill to reflect the latest changes in the RFC

3 months ago by tim@bastelstu.be — view source

unread

Hi

$url = Uri\WhatWg\Url::parse("/foo", ".com"); //
Throws Uri\WhatWg\InvalidUrlException because of $baseUri

Since parse is used shouldn't it return null instead of throwing ?

In this case the $baseUri is invalid. Since this is not expected to be
an untrusted value, it makes sense to me to throw an
InvalidUrlException here. The null return should only be used for an
invalid $uri.

Best regards
Tim Düsterhus

3 months ago by ignace nyamagana butera — view source

unread

Le dim. 27 avr. 2025, 22:32, Tim Düsterhus tim@bastelstu.be a écrit :

Hi

$url = Uri\WhatWg\Url::parse("/foo", ".com"); //
Throws Uri\WhatWg\InvalidUrlException because of $baseUri

Since parse is used shouldn't it return null instead of throwing ?

In this case the $baseUri is invalid. Since this is not expected to be
an untrusted value, it makes sense to me to throw an
InvalidUrlException here. The null return should only be used for an
invalid $uri.

Best regards
Tim Düsterhus

Hi,

I understand that but then I fail to see the added value of the parse
method vs the default constructor since from the RFC the only difference is
that the parse named constructor should instead of throwing return null. If
the parse method can still throw from a consumer POV it looses much of it's
utility. If I really want that level of knowledge using the constructor
should be the only way to go AFAIK.

Best regards,
Ignace Nyamagana Butera

3 months ago by tim@bastelstu.be — view source

unread

Hi

I understand that but then I fail to see the added value of the parse
method vs the default constructor since from the RFC the only difference is
that the parse named constructor should instead of throwing return null. If
the parse method can still throw from a consumer POV it looses much of it's
utility. If I really want that level of knowledge using the constructor
should be the only way to go AFAIK.
Since the $baseUri is a known existing URI, I expect it to be always
be valid, otherwise it would be a programming error. The (relative) $uri
is the bit that comes from an untrusted source. Handling both cases by
returning null would make the API much worse, since it is no longer
clear which of the values is invalid.

Perhaps as a solution, it would make sense to change the signature to:

 (string $uri, ?self $baseUri = null)

instead to enforce that the $baseUri must be valid. This might also
improve performance, by allowing to avoid repeatedly parsing the
$baseUri, e.g. when bulk processing a number of relative links.

Best regards
Tim Düsterhus

3 months ago by kocsismate90@gmail.com — view source

unread

Hey Ignace,

I see you updated the RFC but I believe there's still some errors in the
example:
$url = Uri\WhatWg\Url::parse("/foo", ".com"); // Throws
Uri\WhatWg\InvalidUrlException because of $baseUri

After following the suggestion of Tim, I changed the type of the $baseUrl
parameters at last: now, an URI/URL instance is
accepted instead of a string. As Tim mentioned, this can indeed fix some
performance issues when one uses the same
base URL for instantiating multiple URIs/URLs.

RFC3986 host normalization states that URL encoded part should be encoded

using uppercased letter so on normalization:

https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com should be https://

%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com

Yes, indeed! This example output has stuck here from before I fixed the
implementation, so thanks for pointing it out!

Regards,
Máté

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Tim,

In https://news-web.php.net/php.internals/127114 I suggest to only
provide the "non-raw" methods, so I believe you misread that. I've just
given the RFC another read and thought about the naming and I believe I
still prefer not having the "raw" in the name:

Having the raw in the name makes the API very clunky / verbose to
use.

Other implementations, such as in browsers or node.js, also simply use
the component name without any indication of the output being raw.

Future changes to the WHATWG URL specification might introduce some
normalization for components that currently doesn't have normalization.
This would make the raw naming a misnomer and might require new
methods / deprecations on PHP's end.

So it seems to be safer to use the naming without the raw and then in
the documentation explain what happens with useful examples, just like
the RFC already does.

We discussed this off the list, and the recommendation made sense to me at
last.
I described the rationale in the RFC around the end of the "Component
retrieval" / "Basic examples" section.

Additionally, I recorded a few WHATWG related "deviations" from the
specified getter and setter steps along with
the rationale of these choices.

Other than that, I noticed the following small issues:

I fixed all these small errors, thanks for pointing them out.

In the "Serialization" section: The explanation of the serialization
format is overly specific regarding the implementation details. I would
simplify that to just say "it supports serialization by using the
toRawString() output and performs strict checks during unserialization"
or similar. The reason is that I want to make some suggestions to the
serialization format to provide greater flexibility for future changes
during the technical review of the implementation :-)

After an off-the-list discussion, I updated the RFC text so that it
reflects the
desired behavior (that is consistent with the serialization format of
ext/random).

Regards,
Máté

3 months ago by tim@bastelstu.be — view source

unread

Hi

[…]

Thank you. I have just given the RFC another full read (the 2025/04/27
21:44 version) and I do not have any further remarks. I'm happy with
everything that is said in the RFC and I'm really looking forward to
vote “Yes” :-)

Best regards
Tim Düsterhus

3 months ago by ignace nyamagana butera — view source

unread

Hi Maté,

I found another typo in the RFC examples due to the use of boolean as
parameters

// The fragment component of Uri\WhatWg\Url can also be taken into
account$url = new
Uri\WhatWg\Url("https://example.com#foo");$url->equals(new
Uri\WhatWg\Url("https://example.com"), true); // false

The $excludeFragment is true by default so in the example it should be
false instead. Perhaps using an Enum instead would make the
DX easier than using a boolean ?
I believe the same issue is in all examples regarding the use of that
parameter.

Best regards,
Ignace Nyamagana Butera

Hi

[…]

Thank you. I have just given the RFC another full read (the 2025/04/27
21:44 version) and I do not have any further remarks. I'm happy with
everything that is said in the RFC and I'm really looking forward to
vote “Yes” :-)

Best regards
Tim Düsterhus

3 months ago by ignace nyamagana butera — view source

unread

On Mon, Apr 28, 2025 at 9:05 AM ignace nyamagana butera nyamsprod@gmail.com
wrote:

Hi Maté,

I found another typo in the RFC examples due to the use of boolean as
parameters

// The fragment component of Uri\WhatWg\Url can also be taken into account$url = new Uri\WhatWg\Url("https://example.com#foo");$url->equals(new Uri\WhatWg\Url("https://example.com"), true); // false

The $excludeFragment is true by default so in the example it should be
false instead. Perhaps using an Enum instead would make the
DX easier than using a boolean ?
I believe the same issue is in all examples regarding the use of that
parameter.

Best regards,
Ignace Nyamagana Butera

Hi

[…]

Thank you. I have just given the RFC another full read (the 2025/04/27
21:44 version) and I do not have any further remarks. I'm happy with
everything that is said in the RFC and I'm really looking forward to
vote “Yes” :-)

Best regards
Tim Düsterhus

Hi I would propose to use the following Enum in the Uri namespace

enum UriComparison {

case IncludeFragment;
case ExcludeFragment;

}

It is a bit verbose for less error prone and by default the equals
method on both class would use UriComparison::ExcludeFragment PS:
naming can change as long

as the enum reduces the errors.

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

The $excludeFragment is true by default so in the example it should be

false instead. Perhaps using an Enum instead would make the
DX easier than using a boolean ?
I believe the same issue is in all examples regarding the use of that
parameter.

You are right, I completely messed up the value of the $excludeFragment
variables in the examples. After having thought about your suggestion, I'm
fine with adding the enum.
It's a bit verbose indeed, but at least it properly conveys the meaning of
the parameter, so hopefully it will reduce the number of WTFs when people
start to use the new API. :)

I fiddled a little bit with the implementation, and I went with the
Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

Regards,
Máté

3 months ago by ignace nyamagana butera — view source

unread

Hi Maté,

I fiddled a little bit with the implementation, and I went with the
Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

If no one objects with your name choice I am fine with it, as long as it is
not a boolean :) I will adapt my polyfill. I think I have no more
remarks from my side of things, great job! Since I do not have the right
to vote I hope this one will pass when time for voting comes!

Hi Ignace,

The $excludeFragment is true by default so in the example it should be

false instead. Perhaps using an Enum instead would make the
DX easier than using a boolean ?
I believe the same issue is in all examples regarding the use of that
parameter.

You are right, I completely messed up the value of the $excludeFragment
variables in the examples. After having thought about your suggestion, I'm
fine with adding the enum.
It's a bit verbose indeed, but at least it properly conveys the meaning of
the parameter, so hopefully it will reduce the number of WTFs when people
start to use the new API. :)

I fiddled a little bit with the implementation, and I went with the
Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

Regards,
Máté

3 months ago by ignace nyamagana butera — view source

unread

Hi Maté and Time,

I have one last question while reviewing my polyfill implementation. Is it
worth it adding a SensitiveParameter attribute on the argument of the
following methods ?

Uri\Rfc3986\Uri::withUserInfo
Uri\WhatWg\Url::withPassword

I'm fine with any answer ? Does it warrant a paragraph in the RFC ? That I
do not know but I feel the question may be raised ?

Best regards,
Ignace Nyamagana Butera

On Mon, Apr 28, 2025 at 11:31 PM ignace nyamagana butera <
nyamsprod@gmail.com> wrote:

Hi Maté,

I fiddled a little bit with the implementation, and I went with the
Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

If no one objects with your name choice I am fine with it, as long as it
is not a boolean :) I will adapt my polyfill. I think I have no more
remarks from my side of things, great job! Since I do not have the right
to vote I hope this one will pass when time for voting comes!

On Mon, Apr 28, 2025 at 11:21 PM Máté Kocsis kocsismate90@gmail.com
wrote:

Hi Ignace,

The $excludeFragment is true by default so in the example it should be

false instead. Perhaps using an Enum instead would make the
DX easier than using a boolean ?
I believe the same issue is in all examples regarding the use of that
parameter.

You are right, I completely messed up the value of the $excludeFragment
variables in the examples. After having thought about your suggestion, I'm
fine with adding the enum.
It's a bit verbose indeed, but at least it properly conveys the meaning
of the parameter, so hopefully it will reduce the number of WTFs when
people start to use the new API. :)

I fiddled a little bit with the implementation, and I went with the
Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

Regards,
Máté

3 months ago by tim@bastelstu.be — view source

unread

Hi

I have one last question while reviewing my polyfill implementation. Is it
worth it adding a SensitiveParameter attribute on the argument of the
following methods ?

Uri\Rfc3986\Uri::withUserInfo

Uri\WhatWg\Url::withPassword

I'm fine with any answer ? Does it warrant a paragraph in the RFC ? That I
do not know but I feel the question may be raised ?

Good catch. Since they may throw an exception for malformed inputs, they
should have the attribute. Especially since folks might try to use
special characters in passwords, which might need encoding.

No paragraph in the RFC needed, but the attribute should be added to the
“stub”.

Best regards
Tim Düsterhus

3 months ago by ignace nyamagana butera — view source

unread

Hi Máté and Tim

I read the following in the RFC

Withers of Uri\WhatWg\Url follow the relevant “setter steps” that are
defined https://url.spec.whatwg.org/#dom-url-protocol by WHATWG URL.
Unfortunately, these algorithms sometimes have surprising behavior where
modification fails silently, and the original values are kept. For
example. Even
though this RFC acknowledges the fact that the WHATWG URL “setter steps”
have gotchas, it doesn't try to prevent them - as doing so would be spec
-incompliant.

Reading the WHATWG URL specification and checking how

Chrome,
Firefox
and even https://github.com/TRowbotham/URL-Parser

behave I see that mutator either silently reject the invalid input on
setter or normalize them I was wondering if it still make sense to still
say that URL mutator can throws InvalldUrlException ? Since AFAIK only a
TypeError could actually be thrown if the wrong input is given, no
specially crafted string can make the spec throw unless I have overlooked
it.

Hi

I have one last question while reviewing my polyfill implementation. Is
it
worth it adding a SensitiveParameter attribute on the argument of the
following methods ?

Uri\Rfc3986\Uri::withUserInfo

Uri\WhatWg\Url::withPassword

I'm fine with any answer ? Does it warrant a paragraph in the RFC ? That
I
do not know but I feel the question may be raised ?

Good catch. Since they may throw an exception for malformed inputs, they
should have the attribute. Especially since folks might try to use
special characters in passwords, which might need encoding.

No paragraph in the RFC needed, but the attribute should be added to the
“stub”.

Best regards
Tim Düsterhus

3 months ago by ignace nyamagana butera — view source

unread

Hi Máté and Tim

Why can't the Url::resolve method also expose the $errors parameter like
the constructor and the parse static method ? As far as I understand it
nothing prevents the API from exposing the errors during URI resolution
which is a proxy method for the constructor call just like the parse
named constructor ?

On Wed, Apr 30, 2025 at 9:58 AM ignace nyamagana butera nyamsprod@gmail.com
wrote:

Hi Máté and Tim

I read the following in the RFC

Withers of Uri\WhatWg\Url follow the relevant “setter steps” that are
defined https://url.spec.whatwg.org/#dom-url-protocol by WHATWG URL.
Unfortunately, these algorithms sometimes have surprising behavior where
modification fails silently, and the original values are kept. For example.
Even though this RFC acknowledges the fact that the WHATWG URL “setter
steps” have gotchas, it doesn't try to prevent them - as doing so would be
spec-incompliant.

Reading the WHATWG URL specification and checking how

Chrome,

Firefox

and even https://github.com/TRowbotham/URL-Parser

behave I see that mutator either silently reject the invalid input on
setter or normalize them I was wondering if it still make sense to still
say that URL mutator can throws InvalldUrlException ? Since AFAIK only a
TypeError could actually be thrown if the wrong input is given, no
specially crafted string can make the spec throw unless I have overlooked
it.

Hi

I have one last question while reviewing my polyfill implementation. Is
it
worth it adding a SensitiveParameter attribute on the argument of the
following methods ?

Uri\Rfc3986\Uri::withUserInfo

Uri\WhatWg\Url::withPassword

I'm fine with any answer ? Does it warrant a paragraph in the RFC ?
That I
do not know but I feel the question may be raised ?

Good catch. Since they may throw an exception for malformed inputs, they
should have the attribute. Especially since folks might try to use
special characters in passwords, which might need encoding.

No paragraph in the RFC needed, but the attribute should be added to the
“stub”.

Best regards
Tim Düsterhus

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

Why can't the Url::resolve method also expose the $errors parameter like

the constructor and the parse static method ? As far as I understand it
nothing prevents the API from exposing the errors during URI resolution
which is a proxy method for the constructor call just like the parse
named constructor ?

Sure, that's also a good catch! It was an omission until now, and I've
recently fixed this: so now Uri\WhatWg\Url::resolve() has a 2nd parameter
($softErrors).

Regards,
Máté

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

I have just added the SensitiveParameter attribute to the
Uri\Rfc3986\Uri::withUserInfo() and Uri\WhatWg\Url::withPassword() methods.

Reading the WHATWG URL specification and checking how

Chrome,

Firefox

and even https://github.com/TRowbotham/URL-Parser

behave I see that mutator either silently reject the invalid input on
setter or normalize them I was wondering if it still make sense to still
say that URL mutator can throws InvalldUrlException ? Since AFAIK only a
TypeError could actually be thrown if the wrong input is given, no
specially crafted string can make the spec throw unless I have overlooked
it.

I double the checked the implementation, and I quickly managed to find a
case when an exception is thrown:

$url = new Uri\WhatWg\Url("https://example.com");
$url->withHost("[1.2.3.4");

The above code will throw a Uri\WhatWg\InvalidUrlException that refers to
the "IPv6-unclosed" WHATWG URL error,
so I think it makes sense to keep the current behavior, especially with
respect to possible future changes of the specification.

Regards,
Máté

3 months ago by Paul M. Jones — view source

unread

Hi Maté and all,

Hi Tim,
...

So it seems to be safer to use the naming without the raw and then in
the documentation explain what happens with useful examples, just like
the RFC already does.

We discussed this off the list, and the recommendation made sense to me at last.

I am glad to see it!

Removing the raw() methods from the Whatwg\Url class opens up another opportunity.

The Rfc3986\Uri raw() methods present a departure from existing userland expectations when working with URIs. No existing URI package that I'm aware of retains the normalized values as their "main" values; the values are generally retained-as-given (i.e. "raw"). Nor do they afford getting two versions of the retained values (one raw, one normalized).

This might be solved by renaming the Rfc3986\Uri methods so that the "main" methods return the raw values, and the alternative methods return the normalized versions. For example, getPath() would become getNormalizedPath(), and getRawPath() would become getPath().

But that's pretty verbose, and on considering it further, I think I think there are two classes combined inside Rfc3986\Uri.

Proposal:

Instead of a single Rfc3986\Uri class that tries to hold both raw and normalized values and logic at the same time, introduce a NormalizedUri class to operate with normalized values, and treat the current Uri class as operating with raw values. That would, among other things:

fulfill existing userland expectations;
eliminate the getRaw() methods;
replace the toString()/toRawString() with a single idiomatic __toString() in each class;
move normalization logic into the NormalizedUri class.

Optionally, there could be one additional method one or both classes, toNormalizedUri(), to create and return a normalized instance. For Uri the return would be a new NormalizedUri; for NormalizedUri, the return would either be itself ($this) or a clone of itself.

If the RFC pursues that approach, it will also lend itself to either an abstract they each extend or (preferably) an interface they each implement. If an interface, I opine it should be called Uri; the current Uri class might become RawUri (with NormalizedUri not needing a rename).

Thoughts?

-- pmj

3 months ago by ignace nyamagana butera — view source

unread

Hi Paul,

The Rfc3986\Uri raw() methods present a departure from existing
userland expectations when working with URIs. No existing URI package that
I'm aware of retains the normalized values as their "main" values; the
values are generally retained-as-given (i.e. "raw"). Nor do they afford
getting two versions of the retained values (one raw, one normalized).

As a maintainer of a userland URI package I disagree with this approach. I
believe offering both raw and normalized methods in a single class while
representing a new approach in PHP also offers a better representation of
URIs in general. The current approach in userland mixes both raw and
half normalized components as well as RFC3986 and RFC3987 specification
with ambiguity around normalization, input, constructior, what needs to be
encoded where and when, something this proposal has been successful at
avoiding by using the raw and normalized methods.

fulfill existing userland expectations;

Existing userland expectations are mostly built around parse_url which is
one of the reasons the RFC exists to improve the status quo and to
introduce in PHP valid parsers against recognizable URI specifications. Yes
some adaptation will be needed to use them in userland but I believe this
work is easy to do, talking from the POV of a URI package maintainer.

replace the toString()/toRawString() with a single idiomatic
__toString() in each class;

For all the reasons explained in the RFC, adding a __toString method is a
bad architectural design for an URI. There are so many ways to represent an
URI that having a __toString for string representation gives a false
sense of "there can be only one true representation for a single URI" which
is not true. URI can be normalized, raw, and have different representations
depending on the context in which it will be used. So again, I believe the
RFC made the right call to not implement the Stringable interface to force
the developer to make the right call or to encapsulate the value object
into a proper URI representational class or method that can use the exposed
raw and normalized representation of each component to produce the expected
URI representation.

move normalization logic into the NormalizedUri class.
The classes follow specifications that describe how normalization should
be. Why would you split the responsibilities in other classes ? What would
be the added value ?

Again, I understand this is new code and current URI packages, mine
included, will have to adapt but on the longer run I believe the proposed
API is more predictive and easier to reason about. To quote someone
"Comfort and the fear of change are the greatest enemies of success."

Best regards,
Ignace Nyamagana Butera

Hi Maté and all,

Hi Tim,
...

So it seems to be safer to use the naming without the raw and then in
the documentation explain what happens with useful examples, just like
the RFC already does.

We discussed this off the list, and the recommendation made sense to me
at last.

I am glad to see it!

Removing the raw() methods from the Whatwg\Url class opens up another
opportunity.

The Rfc3986\Uri raw() methods present a departure from existing userland
expectations when working with URIs. No existing URI package that I'm aware
of retains the normalized values as their "main" values; the values are
generally retained-as-given (i.e. "raw"). Nor do they afford getting two
versions of the retained values (one raw, one normalized).

This might be solved by renaming the Rfc3986\Uri methods so that the
"main" methods return the raw values, and the alternative methods return
the normalized versions. For example, getPath() would become
getNormalizedPath(), and getRawPath() would become getPath().

But that's pretty verbose, and on considering it further, I think I think
there are two classes combined inside Rfc3986\Uri.

Proposal:

Instead of a single Rfc3986\Uri class that tries to hold both raw and
normalized values and logic at the same time, introduce a NormalizedUri
class to operate with normalized values, and treat the current Uri class as
operating with raw values. That would, among other things:

fulfill existing userland expectations;

eliminate the getRaw() methods;

replace the toString()/toRawString() with a single idiomatic
__toString() in each class;

move normalization logic into the NormalizedUri class.

Optionally, there could be one additional method one or both classes,
toNormalizedUri(), to create and return a normalized instance. For Uri the
return would be a new NormalizedUri; for NormalizedUri, the return would
either be itself ($this) or a clone of itself.

If the RFC pursues that approach, it will also lend itself to either an
abstract they each extend or (preferably) an interface they each implement.
If an interface, I opine it should be called Uri; the current Uri class
might become RawUri (with NormalizedUri not needing a rename).

Thoughts?

-- pmj

3 months ago by Paul M. Jones — view source

unread

Hi Ignace & Maté and all,

tl;dr: I argue against Ignace's objections to splitting the URI class into two classes (one that retains raw URI values and another that normalizes values as-it-goes). Jump to the very end for a discussion regarding the with() methods (search for the word "asymmetry" herein).

The current approach in userland mixes both raw and half normalized components as well as RFC3986 and RFC3987 specification with ambiguity around normalization, input, constructior, what needs to be encoded where and when

Based on my research into existing URI projects https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md I don't think that's an accurate assessment of the ecosystem.

For example, can you point out which projects mix "raw and half-normalized components"? Nette is the only one that comes to mind, in that (during parsing) it applies rawurldecode() to the host, user, password, and fragment; but that's only one of the 18 projects.

Likewise, of the 15 URI-centric projects, only one of them (league/uri) offers both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/iri and rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-URL centric. So I don't see much ambiguity in any projects there.

As far as normalization, only one project (opis) affords the ability to normalize at creation time, though five of them offer a normalize() method with various effects (https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#normalizing). So, again, I don't see much ambiguity there either; they don't do normalizing as-you-go, it's something you have to apply explicitly.

Regarding inputs, they all presume "raw" inputs. Regarding constructors, they mostly side with a full URI string. Regarding encoding, they mostly retain values in their encoded form (there are three outliers, cf. https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#component-encoding).

With all that in mind, we can see that the various authors of userland projects have settled on remarkably similar patterns of usage that they found valuable and useful for working with URIs.

fulfill existing userland expectations;

Existing userland expectations are mostly built around parse_url

That's kind of true; 9 of the 18 projects use parse_url(), and 7/18 implement the RFC 3986 parsing algorithm ...

which is one of the reasons the RFC exists to improve the status quo and to introduce in PHP valid parsers against recognizable URI specifications. Yes some adaptation will be needed to use them in userland but I believe this work is easy to do, talking from the POV of a URI package maintainer.

... but I don't imagine that replacing parse_url() in those projects with the RFC 3986 algo would cause those projects to change any of their other design decisions. What adaptations do you think would be needed around that replacement?

replace the toString()/toRawString() with a single idiomatic __toString() in each class;

For all the reasons explained in the RFC, adding a __toString method is a bad architectural design for an URI. There are so many ways to represent an URI that having a __toString for string representation gives a false sense of "there can be only one true representation for a single URI" which is not true.

For Rfc3986\Uri, it looks like there are only two that are recognized: raw and normalized. Are there other string representations you feel the Uri class should recognize?

(For Whatwg\Url, it looks like there are also only two: as-parsed, and as ASCII, but I'm not addressing that part of the RFC here.)

move normalization logic into the NormalizedUri class.

The classes follow specifications that describe how normalization should be. Why would you split the responsibilities in other classes ? What would be the added value ?

For one, unless I am missing something, there is an asymmetry between the get() methods and the with() methods. What I'm seeing is that (e.g.) Uri::withPath() expects a raw path argument, but getPath() returns the normalized version. For symmetry, I would expect either:

Uri::withPath(raw_value) : self and Uri::getPath() : raw_value, or
Uri::withRawPath(raw_value) : self and Uri::getRawPath() : raw_value

Thus my first intuition that the "main" values in the URI need to be the raw ones, and that getting the normalized ones should be the more verbose case (e.g. getNormalizedPath() : normalized_value).

So, one value added by splitting the classes is to resolve that asymmetry. Consumers expecting to get back from the URI what they put into it can use the raw Uri variation; "API clients or signers fall in this category that want to avoid introducing any unnecessary changes to URIs, in order to avoid causing subtle bugs."

Other consumers, who want to do things this new and different way (normalized as-you-go, unlike anything currently in userland) can use the NormalizedUri.

(Or you could flip it around and say that the normalized variation is the Uri class, and the raw version is RawUri.)

-- pmj

3 months ago by ignace nyamagana butera — view source

unread

Hi Paul,

I will try to address your concerns. Keep in mind that I am not the author
of the RFC but I do like how it is currently shaped with some caveats but
those can be put under future improvements.

So, one value added by splitting the classes is to resolve that asymmetry.

First, I agree with you. The method naming in the Uri\Rfc3986\Uri class
could be improved even though it does not represent a showstopper to me,
Adding the raw prefix or indeed flipping the raw* method and using
normalized* would perhaps make for some clarification but I will leave that
decision to Máté.
Apart from that, I believe the current RFC (especially around RFC3986) does
address most if not all the issues regarding the specification. RFC3986
provides information around 3 key URI features: parsing, resolution and
equivalence. In order to offer resolution and equivalence you ought to
address normalization and thus encoding. Any userland package that does
offer those features is required to handle component encoding/normalization
first before performing the expected operation. Hence why I believe that if
the new URI class does offer equivalence by consequence it can/should be
able to expose URI component normalization out of the box. The need for a
separate class is IMHO not needed.

For example, can you point out which projects mix "raw and
half-normalized components"?

Laminas for example or any PSR implementing class will try to encode the
input string regardless of its encoding hence the wording around not to
double encode the string you often encounter in mutator method docblock.
The Uri on the other hand only expects well formed and encoded strings
which leaves room for no wrong interpretation. This is an area that is left
to be filled by URI packages for instance.

For Rfc3986\Uri, it looks like there are only two that are recognized:
raw and normalized. Are there other string representations you feel the Uri
class should recognize?

If there are at least two representations possible then a __toString
method is still a bad design because it may lead the developper to think
that this is the only one string representation which is not true. Both
representations are equivalent and represent as much the URI. And as a
bonus, not having a __toString method prevents accidental URI comparison
using the == sign instead of using the correct equals method. (I know
that because I've seen codebase where PSR-7 URI instances are compared
using the class __toString method which is just wrong).

PS1: I do appreciate the work you did put into your study around URI
packages in the PHP ecosystem but we should not restrict the new API to
only resolve or align to those used solutions instead we should try to
expose an API susceptible to allow more flexibility than what PHP currently
offers.
PS2: I do not think the new API will replace the URI packages, we will
still need them because, in the case of RFC3986 URI class, parsing is just
one aspect or URI consumption, we still need scheme specific validation
that only PHP userland package can offer.

Best regards,
Ignace Nyamagana Butera

Hi Ignace & Maté and all,

tl;dr: I argue against Ignace's objections to splitting the URI class into
two classes (one that retains raw URI values and another that normalizes
values as-it-goes). Jump to the very end for a discussion regarding the
with() methods (search for the word "asymmetry" herein).

On Apr 28, 2025, at 15:47, ignace nyamagana butera nyamsprod@gmail.com
wrote:

The current approach in userland mixes both raw and half normalized
components as well as RFC3986 and RFC3987 specification with ambiguity
around normalization, input, constructior, what needs to be encoded where
and when

Based on my research into existing URI projects <
https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md>; I
don't think that's an accurate assessment of the ecosystem.

For example, can you point out which projects mix "raw and half-normalized
components"? Nette is the only one that comes to mind, in that (during
parsing) it applies rawurldecode() to the host, user, password, and
fragment; but that's only one of the 18 projects.

Likewise, of the 15 URI-centric projects, only one of them (league/uri)
offers both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/iri
and rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-URL
centric. So I don't see much ambiguity in any projects there.

As far as normalization, only one project (opis) affords the ability to
normalize at creation time, though five of them offer a normalize() method
with various effects (<
https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#normalizing>).
So, again, I don't see much ambiguity there either; they don't do
normalizing as-you-go, it's something you have to apply explicitly.

Regarding inputs, they all presume "raw" inputs. Regarding constructors,
they mostly side with a full URI string. Regarding encoding, they mostly
retain values in their encoded form (there are three outliers, cf. <
https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#component-encoding

).

With all that in mind, we can see that the various authors of userland
projects have settled on remarkably similar patterns of usage that they
found valuable and useful for working with URIs.

fulfill existing userland expectations;

Existing userland expectations are mostly built around parse_url

That's kind of true; 9 of the 18 projects use parse_url(), and 7/18
implement the RFC 3986 parsing algorithm ...

which is one of the reasons the RFC exists to improve the status quo and
to introduce in PHP valid parsers against recognizable URI specifications.
Yes some adaptation will be needed to use them in userland but I believe
this work is easy to do, talking from the POV of a URI package maintainer.

... but I don't imagine that replacing parse_url() in those projects with
the RFC 3986 algo would cause those projects to change any of their other
design decisions. What adaptations do you think would be needed around that
replacement?

replace the toString()/toRawString() with a single idiomatic
__toString() in each class;

For all the reasons explained in the RFC, adding a __toString method
is a bad architectural design for an URI. There are so many ways to
represent an URI that having a __toString for string representation
gives a false sense of "there can be only one true representation for a
single URI" which is not true.

For Rfc3986\Uri, it looks like there are only two that are recognized: raw
and normalized. Are there other string representations you feel the Uri
class should recognize?

(For Whatwg\Url, it looks like there are also only two: as-parsed, and as
ASCII, but I'm not addressing that part of the RFC here.)

move normalization logic into the NormalizedUri class.

The classes follow specifications that describe how normalization
should be. Why would you split the responsibilities in other classes ? What
would be the added value ?

For one, unless I am missing something, there is an asymmetry between the
get() methods and the with() methods. What I'm seeing is that (e.g.)
Uri::withPath() expects a raw path argument, but getPath() returns the
normalized version. For symmetry, I would expect either:

Uri::withPath(raw_value) : self and Uri::getPath() : raw_value, or

Uri::withRawPath(raw_value) : self and Uri::getRawPath() : raw_value

Thus my first intuition that the "main" values in the URI need to be the
raw ones, and that getting the normalized ones should be the more verbose
case (e.g. getNormalizedPath() : normalized_value).

So, one value added by splitting the classes is to resolve that asymmetry.
Consumers expecting to get back from the URI what they put into it can use
the raw Uri variation; "API clients or signers fall in this category that
want to avoid introducing any unnecessary changes to URIs, in order to
avoid causing subtle bugs."

Other consumers, who want to do things this new and different way
(normalized as-you-go, unlike anything currently in userland) can use the
NormalizedUri.

(Or you could flip it around and say that the normalized variation is the
Uri class, and the raw version is RawUri.)

-- pmj

4 months ago by Dennis Snell — view source

unread

Hi Dennis,

I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.

I think Ignace's examples already highlighted that the two specifications differ in nuances so much that even I had to admit after months of trying to squeeze them into the same interface that doing so would be irresponsible.
The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs with scheme-specific behavior - like ldap apparently), but even the UriInterface of PSR-7 can build upon it. On the other hand, Uri\WhatWg\Url will be useful for representing browser links and any other URLs for the web (i.e. an HTTP application router component should use this class).

Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?

I am not 100% sure what I brought up to Tim, but certainly, the biggest difference between the two specs regarding percent-encoding was recently documented in the RFC: https://wiki.php.net/rfc/url_parsing_api#percent-encoding
. The other main difference is how the host component is stored: WHATWG automatically percent-decodes it, while RFC3986 doesn't. This is summarized in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval
section (a bit below).
This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that https://example.com does not replace the actual host part if one is provided in $url. For example, this code should work.
    $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc
’, ‘https://example.com
’ );
    $url->domain === 'wiki.php.net
'
Yes. it's the case. Both classes only use the base URL for relative URIs.

Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com
" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com
”

I got your point, so I implemented your suggestion. Actually, I made yet another larger API change in the meanwhile, but in any case, the WHATWG implementation now supports IDNA the following way:
$url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘";, null);

echo $url->getHost(); // xn--go8h.com

echo $url->getHostForDisplay(); // 🐘.com
echo $url->toString(); // https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98

echo $url->toDisplayString(); / https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98

Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out at the end of https://externals.io/message/126182#126184
), and adding support for RFC3987 (therefore IRIs) would be a very heavy amount of work, it's just not feasible within this RFC :( To make things worse, its code should be written from scratch, since I haven't found any suitable C library yet for this purpose. That's why I'll leave them for

On other notes, let me share some of the changes since my previous message to the mailing list:

First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the proposal after Arnaud's feedback. Now, both the normalized (and decoded), as well as the non-normalized representation can equally be retrieved from the same URI instance. This was necessary to change in order for users to be able to consistently use URIs. Now, if someone needs an exact URI component value, they can use the getRaw*() getter. If they want the normalized and percent-decoded form then a get*() getter should be used. For more information, the

https://wiki.php.net/rfc/url_parsing_api#component_retrieval
section should be consulted.

This seems like a good change.

I made a few less important API changes, like converting the WhatWgError class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing the return type of some getters (removing nullability) etc.

Love this.

I fixed quite some smaller details of the implementation along with a very important spec incompatibility: until now, the "path" component didn't contain the leading "/" character when it should have. Now, both classes conform to their respective specifications with regards to path handling.

This is a late thought, and surely amenable to a later RFC, but I was thinking about the get/set path methods and the issue of the / and %2F.

- If we exposed getPathIterator() or getPathSegments() could we not report these in their fully-decoded forms? That is, because the path segments are separated by some invocation or array element, they could be decoded?
- Probably more valuably, if withPath() accepted an array, could we not allow fully non-escaped PHP strings as path segments which the URL class could safely and by-default handle the escaping for the caller?

Right now, if someone haphazardly joins path segments in order to set withPath() they will likely be unaware of that nuance and get the path wrong. On the grand scale of things, I suspect this is a really minor risk. However, if they could send in an array then they would never need to be aware of that nuance in order to provide a fully-reliable URL, up to the class rejecting path segments which cannot be represented.

I think the RFC is now mature enough to consider voting in the foreseeable future, since most of the concerns which came up until now are addressed some way or another. However, the only remaining question that I still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should be final? Personally, I don't see much problem with opening them for extension (other than some technical challenges that I already shared a few months ago), and I think people will have legitimate use cases for extending these classes. On the other hand, having final classes may allow us to make slightly more significant changes without BC concerns until we have a more battle-tested API, and of course completely eliminate the need to overcome the said technical challenges. According to Tim, it may also result in safer code because spec-compliant base classes cannot be extended by possibly non-spec compliant/buggy children. I don't necessarily fully agree with this specific concern, but here it is.

I’ve taken another fresh and full review of the RFC and I just want to share my appreciation for how well-written it seems, and how meticulously you have taken everyone’s feedback and incorporated it. It seems mature enough to me as well, and I think it’s in a good place. Still, here are some additional thoughts (and a previous one again) related to some of aspects, mostly naming.

The HTML5 library has ::createFromString() instead of parse(). Did you consider following this form? It doesn’t seem that important, but could be a nice improvement in consistency among the newer spec-compliant APIs. Further, I think createFromString() is a little more obvious in intent, as parse() is so generic.

Given the issues around equivalence, what about isEquivalent() instead of equals()? In the RFC I think you have been careful to use the “equivalence” terminology, but then in the actual interface we fall back to equals() and lose some of the nuance.

Something about not implementing getRawScheme() and friends in the WHATWG class seems off. Your rationale makes sense, but then I wonder what the problem is in exposing the raw untranslated components, particularly since the “raw” part of the name already suggests some kind of danger or risk in using it as some semantic piece.

Tim brought up the naming of getHost() and getHostForDisplay() as well as the correspondence with the toString() methods. I’m not sure if it was overlooked or I missed the followup, but I wonder what your thoughts are on passing an enum to these methods indicating the rendering context. Here’s why: I see developers reach for the first method that looks right. In this case, that would almost always be getHost(), yet getHost() or toString() or whatever is going to be inappropriate in many common cases. I see two ways of baking in education into the API surface: creating two symmetric methods (e.g. getDisplayableHost() and getNonDisplayableHost()); or requiring an enum forcing the choice (e.g. getHost( ForDisplay | ForNonDisplay )). In the case on an enum this could be equally applied across all of the relevant methods where such a distinction exists. On one hand this could be seen as forcing callers to make a choice, but on the other hand it can also be seen as a safeguard against an extremely-common foot-gun, making such an easy oversight impossible.

Just this week I stumbled upon an issue with escaping the hash/fragment part of a URL. I think that browsers used to decode percent-encodings in the fragment but they all stopped and this was removed from the WHATWG HTML spec no-percent-escaping. The RFC currently shows getFragment() decoding percent-encoded fragments, However, I believe that the WHATWG URL spec only indicates percent-encoding when setting the fragment. You can test this in a browser with the following example: Chrome, Firefox, and Safari exhibit the same behavior.

u = new URL(window.location)
u.hash = ‘one and two’;
u.hash === ‘#one%20and%20two’;
u.toString() === ‘….#one%20and%20two’;

So I think it may be more accurate and consistent to handle Whatwg\Url::getFragment in the same way as getScheme(). When setting a fragment we should percent-encode the appropriate characters, but when reading it, we should never interpret those characters — it should always return the “raw” value of the fragment.

Once again, thank you for the great work you’ve put into this. I’m so excited to have it. All my comments should be understood exclusively within the WHATWG domain as I don’t have the same experience with the RFC3986 side.

Dennis Snell

Regards,
Máté

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Dennis,

This is a late thought, and surely amenable to a later RFC, but I was
thinking about the get/set path methods and the issue of the / and %2F.

If we exposed getPathIterator() or getPathSegments() could we not
report these in their fully-decoded forms? That is, because the path
segments are separated by some invocation or array element, they could be
decoded?

Probably more valuably, if withPath() accepted an array, could we not
allow fully non-escaped PHP strings as path segments which the URL class
could safely and by-default handle the escaping for the caller?

Yes, these are very good ideas, and actually they are in line with how I
would imagine a second iteration. Probably, getPathSegments() could return
the "%2F" (percent-encoded form of "/") percent-decoded, sure. But the rest
of the reserved characters will also be an issue, since they can also appear
within the path (i.e. "&" inside "Document & Settings" etc.)
percent-encoded. So percent decoding of reserved characters should still be
taken into account.

Right now, if someone haphazardly joins path segments in order to set

withPath() they will likely be unaware of that nuance and get the path
wrong. On the grand scale of things, I suspect this is a really minor risk.
However, if they could send in an array then they would never need to be
aware of that nuance in order to provide a fully-reliable URL, up to the
class rejecting path segments which cannot be represented.

Yes, consuming an array is also a good idea, but for the same reason as
above, it's not enough to take care of correctly percent-encoding "/" in
order
to have a valid URI as a result. (Of course I'm still talking about RFC
3986, WHATWG still performs automatic percent-encoding)

The HTML5 library has ::createFromString() instead of parse(). Did you
consider following this form? It doesn’t seem that important, but could be
a nice improvement in consistency among the newer spec-compliant APIs.
Further, I think createFromString() is a little more obvious in intent,
as parse() is so generic.

Given the issues around equivalence, what about isEquivalent() instead
of equals()? In the RFC I think you have been careful to use the
“equivalence” terminology, but then in the actual interface we fall back to
equals() and lose some of the nuance.

In my implementation, I tried to choose terminology that people are
familiar with instead of using the technicus terminus of URIs. Instead of
recompose(), I used toString() (or some variant of it), instead of
isEquivalent(),
I used equals(). Parse() is probably an outlier, since it's the
correct name of the exact process. But in any case, I consider these names
adequately short, and I think they very clearly convey their intent. Using
the technicus
terminus would probably even more suit those who have deep familiarity with
URIs, but this group will likely be the minority forever. For the rest of
the people, the current names make more sense, so I'd prefer keeping them
as-is.

Something about not implementing getRawScheme() and friends in the
WHATWG class seems off. Your rationale makes sense, but then I wonder what
the problem is in exposing the raw untranslated components, particularly
since the “raw” part of the name already suggests some kind of danger or
risk in using it as some semantic piece.

Hm, interesting remark. Do I understand correctly that you are suggesting
to expose getRawScheme() and getRawHost() with their original value? If so,
then this has technical challenges: the WHATWG parser doesn't store
the original value of these two components, so they are effectively lost
when automatically transformation happens during parsing. But this is
normal, since the WHATWG specification doesn't really care about the
original value of these components.

Tim brought up the naming of getHost() and getHostForDisplay() as well
as the correspondence with the toString() methods. I’m not sure if it was
overlooked or I missed the followup, but I wonder what your thoughts are on
passing an enum to these methods indicating the rendering context. Here’s
why: I see developers reach for the first method that looks right. In this
case, that would almost always be getHost(), yet getHost() or
toString() or whatever is going to be inappropriate in many common cases.
I see two ways of baking in education into the API surface: creating two
symmetric methods (e.g. getDisplayableHost() and
getNonDisplayableHost()); or requiring an enum forcing the choice (e.g.
getHost( ForDisplay | ForNonDisplay )). In the case on an enum this could
be equally applied across all of the relevant methods where such a
distinction exists. On one hand this could be seen as forcing callers to
make a choice, but on the other hand it can also be seen as a safeguard
against an extremely-common foot-gun, making such an easy oversight
impossible.

I am myself also a bit lost on the countless names that I tried out in the
implementation, but I think I had toHumanFriendlyString() and
toDisplayFriendlyString() methods at some point. These then ended up being
toString() and toDisplayString() after some iterations. I would be ok with
renaming getHost() and toString() so that their names suggest they don't
use IDNA, but I'd clearly need a good enough suggestion, since neither
"MachineFriendly", nor "NonDisplayable" sound like the best alternative for
me. I was also considering using getIdnaHost() and toIdnaString(), but I
realized these are the worst looking names I have come up with so far.

Just this week I stumbled upon an issue with escaping the hash/fragment
part of a URL. I think that browsers used to decode percent-encodings in
the fragment but they all stopped and this was removed from the WHATWG HTML
spec no-percent-escaping. The RFC currently shows getFragment()
decoding percent-encoded fragments, However, I believe that the WHATWG URL
spec only indicates percent-encoding when setting the fragment. You can
test this in a browser with the following example: Chrome, Firefox, and
Safari exhibit the same behavior.
u = new URL(window.location)
u.hash = ‘one and two’;
u.hash === ‘#one%20and%20two’;
u.toString() === ‘….#one%20and%20two’;
So I think it may be more accurate and consistent to handle
Whatwg\Url::getFragment in the same way as getScheme(). When setting a
fragment we should percent-encode the appropriate characters, but when
reading it, we should never interpret those characters — it should always
return the “raw” value of the fragment.

Thank you for the suggestion and for noticing this problem. I believe you
must have read a version of the RFC where I was still trying to find out
the correct percent-decoding rules for WHATWG. At some point, I was
completely misunderstanding what the specification prescribed, so I had to
make quite some changes in the RFC regarding this aspect + finally I
managed to describe elaborately the reasoning behind the choices. Now I
think the rules make sense.

Yes, my implementation automatically percent-encodes the input when parsing
or modifying a WHATWG URL. You are also right that WHATWG never
percent-decodes the output due to the following reason:

... the point of view of a maintainer of the WHATWG specification is that

webservers may legitimately choose to consider encoded and decoded paths
distinct, and a standard cannot force them not to do so.

The said author made this clear in multiple comments, but this one is
linked in the RFC:
https://github.com/whatwg/url/issues/606#issuecomment-926395864

So basically all the non-raw getters return a value that is considered by
WHATWG non-equivalent with the original input. This is also explained in
the "Component retrieval" section in more detail now (
https://wiki.php.net/rfc/url_parsing_api#component_retrieval). I hope

Regards,
Máté

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Dennis,

I am myself also a bit lost on the countless names that I tried out in the

implementation, but I think I had toHumanFriendlyString() and
toDisplayFriendlyString() methods at some point. These then ended up being
toString() and toDisplayString() after some iterations. I would be ok with
renaming getHost() and toString() so that their names suggest they don't
use IDNA, but I'd clearly need a good enough suggestion, since neither
"MachineFriendly", nor "NonDisplayable" sound like the best alternative for
me. I was also considering using getIdnaHost() and toIdnaString(), but I
realized these are the worst looking names I have come up with so far.

What about getPunycodeHost(), getUnicodeHost(), toPunycodeString(),
toUnicodeString()? Or getAsciiHost() and toAsciiString() may also work.
These are the best names I managed to come up with so far.

In the meantime, I renamed RFC 3986's toString() methods too according to
another suggestion:

toString() became toRawString()
toNormalizedString() became toString()

The new names mirror exactly what their getter counterparts do.

Máté

4 months ago by Dennis Snell — view source

unread

Hi Dennis,

I am myself also a bit lost on the countless names that I tried out in the implementation, but I think I had toHumanFriendlyString() and toDisplayFriendlyString() methods at some point. These then ended up being toString() and toDisplayString() after some iterations. I would be ok with renaming getHost() and toString() so that their names suggest they don't use IDNA, but I'd clearly need a good enough suggestion, since neither "MachineFriendly", nor "NonDisplayable" sound like the best alternative for me. I was also considering using getIdnaHost() and toIdnaString(), but I realized these are the worst looking names I have come up with so far.

What about getPunycodeHost(), getUnicodeHost(), toPunycodeString(), toUnicodeString()? Or getAsciiHost() and toAsciiString() may also work. These are the best names I managed to come up with so far.

In the meantime, I renamed RFC 3986's toString() methods too according to another suggestion:

toString() became toRawString()

toNormalizedString() became toString()

The new names mirror exactly what their getter counterparts do.

Máté

Hi Máté,

I’ve been pondering these names for the past week and a half and I couldn’t think of anything, but at first glance I like getUnicodeHost() and getAsciiHost(). These communicate a little bit the nuance, though they aren’t totally in-your-face (which in this case I wish there were a more obvious pair that is).

Other pairs I was toying with but don’t like are:
- getPrintHost() / getDataHost()
- getDisplayHost() / getAPIHost()
- getDisplayHost() / getEncodedHost()
- getDisplayHost() / getEscapedHost()

(the same pairs would apply to the other methods, like toDisplayString() / toEncodedString())

This seems to be taking a lot of effort and time, but thank you still for engaging with it — naming is hard! But it’s worth it.

4 months ago by Dennis Snell — view source

unread

Hi Dennis,

I am myself also a bit lost on the countless names that I tried out in the implementation, but I think I had toHumanFriendlyString() and toDisplayFriendlyString() methods at some point. These then ended up being toString() and toDisplayString() after some iterations. I would be ok with renaming getHost() and toString() so that their names suggest they don't use IDNA, but I'd clearly need a good enough suggestion, since neither "MachineFriendly", nor "NonDisplayable" sound like the best alternative for me. I was also considering using getIdnaHost() and toIdnaString(), but I realized these are the worst looking names I have come up with so far.

What about getPunycodeHost(), getUnicodeHost(), toPunycodeString(), toUnicodeString()? Or getAsciiHost() and toAsciiString() may also work. These are the best names I managed to come up with so far.

In the meantime, I renamed RFC 3986's toString() methods too according to another suggestion:

toString() became toRawString()

toNormalizedString() became toString()

The new names mirror exactly what their getter counterparts do.

Máté

Hi Máté,

I’ve been pondering these names for the past week and a half and I couldn’t think of anything, but at first glance I like getUnicodeHost() and getAsciiHost(). These communicate a little bit the nuance, though they aren’t totally in-your-face (which in this case I wish there were a more obvious pair that is).

Other pairs I was toying with but don’t like are:
- getPrintHost() / getDataHost()
- getDisplayHost() / getAPIHost()
- getDisplayHost() / getEncodedHost()
- getDisplayHost() / getEscapedHost()

(the same pairs would apply to the other methods, like toDisplayString() / toEncodedString())

This seems to be taking a lot of effort and time, but thank you still for engaging with it — naming is hard! But it’s worth it.

Just for fun I have tossed this into DeepSeek-R1 671B

WHATWG URLs have two representations: one for humans and one for machines. The reason for having two is that URLs may have IDNA domains which are punycode encoded and there are security issues around showing that to huamns. For example, if a person reads "https://xn--google.com" they may assume that the domain belongs to Google, when in fact it points to "https://䕮䕵䕶䕱.com". You are a modern programming language designer working on a standard library to expose a URL parser and you want the interface of this library to educate developers on where to use the appropriate representation. Given a URL object $u of class URL, propose two methods for converting that URL to a string. The name of the methods should communicate their use, and when a developer searches for the right method to get the string form, they should not be presented with a non-prefixed and prefixed pair like toString() and toHumanString(). Instead, the methods names should form a kind of symmetric pair like toEncodedString() and toDisplayString(). Use your knowledge of WHATWG URL nuances, browser security issues, human developers making typical mistakes, and propose at least ten pairs of words that could be used for returning these two different representations.

A few of the ideas that it returned which stuck out were:

- toDataString() / toViewString() and getDataHost() / getViewHost()
- toSerializedString() / toReadableString() and getSerializedHost() / getReadableHost()
- toProcessingString() / toSafeDisplayString() and getProcessingHost() / getSafeDisplayHost()

After checking in the Gecko source code, I sadly only found helper methods which take a URL/URI and transform them:

- prepareUrlForDisplay()
- unEscapeURIForUI()

Node seems to punt on this by providing URL.format() with a { unicode: boolean } option. These all seem to miss the mark, in my opinion, because of how easy it is to assume that toString() or .host is what you’re after.

Thanks for entertaining the extra follow-up here.

Warmly,
Dennis Snell

11 months ago by Larry Garfield — view source

unread

Hi Ignace, Niels,

Sorry for being silent for so long, I was working hard on the
implementation besides some summer activities :) I can say that I had
really good progress in the last month and now I think (hope) that I
managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:

I'm not fluent enough in the different parsing styles to comment on the difference there.

I do have concerns about the class design, though. Given the improvements to the language, the accessor methods offer zero benefit at all. Public-read properties (readonly or otherwise) would be faster and offer no less of a guarantee. If you want to allow someone to extend the class and provide some custom logic, use aviz instead of readonly and extenders can use hooks instead of the methods. The getters don't offer any value anymore.

It took me a while to realize that, I think, the fromWhatWg() method is using an in/out parameter for error handling. That is an insta-no on my part. in/out reference parameters make sense in C, maybe C++, and basically nowhere else. I view them as a code smell everywhere they're used in PHP. Better alternatives include exceptions or union returns.

It looks like you've removed the with*() methods. Why? That means it cannot be used as a builder mechanism, which is plenty valuable. (Though could be an issue with query as a string vs array.)

The WhatWgError looks to me like it's begging to be an Enum.

I am confused by the new ini value. It's for use in cases where you're NOT parsing the URL yourself, but relying on some other extension that does URL parsing internally as a side effect?

As usual, I am not a fan of an ini setting, but I cannot think of a different alternative off hand.

--Larry Garfield

8 months ago by kocsismate90@gmail.com — view source

unread

Hi Larry,

I do have concerns about the class design, though. Given the improvements

to the language, the accessor methods offer zero benefit at all.
Public-read properties (readonly or otherwise) would be faster and offer no
less of a guarantee. If you want to allow someone to extend the class and
provide some custom logic, use aviz instead of readonly and extenders can
use hooks instead of the methods. The getters don't offer any value
anymore.

Yes, I knew you wouldn't like my traditional style with private properties

getters... :) So let me try to answer your suggestions: first of all, I
believe the readonly class modifier serves its purpose, and I definitely
want to keep it because it can ensure that all URI instances are immutable.
That's why I cannot use property hooks, since they are incompatible with
readonly. So only the possibility of using asymmetric visibility remains:
however, since extenders still cannot hook them, this idea should also be
rejected. Otherwise, I would consider using readonly with public read,
although I believe traditional methods are better suited for overriding
(easier syntax, decades of experience) than property hooks (my 2cents).

It took me a while to realize that, I think, the fromWhatWg() method is
using an in/out parameter for error handling. That is an insta-no on my
part. in/out reference parameters make sense in C, maybe C++, and
basically nowhere else. I view them as a code smell everywhere they're
used in PHP. Better alternatives include exceptions or union returns.

Yes, originally the RFC used a reference parameter to return the error
during parsing. I knew it was controversial, but that's what was a
consistent choice with other internal functions/methods.
After your feedback, I changed this behavior to an union type return type:

public static function parse(string $uri, ?string $baseUrl = null):
static|array {}

So that in case of failure, an array of Uri\WhatWgError objects are
returned. This practice is not really idiomatic with PHP, so personally I'm
not sure I like it, but neither did I particularly like passing a parameter
by reference...

It looks like you've removed the with*() methods. Why? That means it
cannot be used as a builder mechanism, which is plenty valuable. (Though
could be an issue with query as a string vs array.)

As I answered to Dennis, they were reclaimed in the meanwhile.

The WhatWgError looks to me like it's begging to be an Enum.

It's probably not that visible at the first glance, but Uri\WhatWgError has
2 properties: an error code, and a position, so it's not feasible to make
it an enum. I'd however create a separate Uri\WhatWgErrorCode enum
containing all the error codes, so that the class constants could be
removed from Uri\WhatWgError, but I felt it's overengineering so I decided
not to do this.

Regards,
Máté

8 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2024-11-24 21:40, schrieb Máté Kocsis:

It took me a while to realize that, I think, the fromWhatWg() method
is
using an in/out parameter for error handling. That is an insta-no on
my
part. in/out reference parameters make sense in C, maybe C++, and
basically nowhere else. I view them as a code smell everywhere
they're
used in PHP. Better alternatives include exceptions or union returns.

Yes, originally the RFC used a reference parameter to return the error
during parsing. I knew it was controversial, but that's what was a
consistent choice with other internal functions/methods.
After your feedback, I changed this behavior to an union type return
type:

public static function parse(string $uri, ?string $baseUrl = null):
static|array {}

So that in case of failure, an array of Uri\WhatWgError objects are
returned. This practice is not really idiomatic with PHP, so personally
I'm
not sure I like it, but neither did I particularly like passing a
parameter
by reference...

I disagree with this change and believe that with the current
capabilities of PHP the out-parameter is the correct API design choice,
because then the “failure” case would be returning a falsy value, which
IMO is pretty idiomatic PHP:

 if (($uri = WhatWgUri::parse($someUri, errors: $errors)) !== null) {
     printf("Your URI '%s' is valid. Here it is: %s", $someUri,

$uri);
} else {
printf("Your URI '%s' is invalid, there were %d errors.\n",
$someUri, $errors);
}

It would also unify the API between Rfc3986Uri and WhatWgUri.

Best regards
Tim Düsterhus

8 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2024-08-26 09:40, schrieb Máté Kocsis:

Please re-read the RFC as it shares a bit more details than my quick
summary above: https://wiki.php.net/rfc/url_parsing_api

I have now finally found the time to go through the discussion thread
and make a first pass through the RFC and have the following remarks.

The RFC is not listed in the overview page: https://wiki.php.net/rfc

I agree with Dennis' remark that the Rfc3986Uri and WhatWgUri
classes must be final. The RFC makes the argument that:

Having separate classes for the two standards makes it possible to
indicate explicit intent at the type level that one specific standard
is required.

Developers extending the classes could accidentally violate the
respective standard, which nullifies the benefit of making invalid
states unrepresentable at the type-level.

This also means that the return type of the “withers” should be self
instead of static, which also means that the “withers” in the
interface must be self. Perhaps this means that they should not exist
on the interface at all. DateTimeInterface only provides the getters,
likely for a similar reason.

I believe the UriException class as the base exception should not be
abstract. There is no real benefit to it, especially since it doesn't
specify any additional abstract methods.

See also the PR introducing the Exception hierarchy for ext/random for
some opinions / arguments regarding the Exception class design:
https://github.com/php/php-src/pull/9220

I'm not sure I like the Interface suffix on the UriInterface
interface. Just Uri\Uri would be equally expressive.

I am not sure about the *User() and *Password() methods existing on
the interface. As the RFC acknowledges, RFC 3986 only specifies a
“userinfo” segment. Should the *User() and *Password() methods
perhaps be specific to the WhatWgUri class?

I'll give the RFC another read later and expect some additional
commentary when I think about this more.

Best regards
Tim Düsterhus

7 months ago by kocsismate90@gmail.com — view source

unread

Hi Tim,

Thanks for your feedback!

The RFC is not listed in the overview page: https://wiki.php.net/rfc

Uh, indeed! I've just fixed it.

I agree with Dennis' remark that the Rfc3986Uri and WhatWgUri
classes must be final. The RFC makes the argument that:

Having separate classes for the two standards makes it possible to
indicate explicit intent at the type level that one specific standard
is required.

Developers extending the classes could accidentally violate the
respective standard, which nullifies the benefit of making invalid
states unrepresentable at the type-level.

On the one hand, I also have some concerns about making these classes final
or non-final
as you probably saw in my last email (the concern came up with a question
about an implementation
detail: https://github.com/php/php-src/pull/14461#discussion_r1847316607).
On the other hand though,
if someone overrides an URI implementation, then I assume there's
definitely a purpose for doing
so (i.e. the child class has additional capabilities, or it can handle
additional protocols). If developers cannot
achieve this via inheritance, then they will do so otherwise (by using
composition, or putting the custom logic
in a helper class etc.). It's just not realistic to prevent logical bugs by
making classes final.

I would rather ask whether it's possible to make the 2 built-in URI
implementations having
quite some special internal behavior behave consistently with userland
classes, even if they are overridden?
For now, the answer seems to be yes (especially after hearing Niels'
solution in the GitHub thread linked above),
but of course new issues may arise later which we don't know about yet. And
of course, it's much easier to make
a class final first and relax the inheritance rules later, than the other
way around... So these are the only reasons
why I'd make the classes final, but otherwise it would be useful to be able
to extend them.

This also means that the return type of the “withers” should be self
instead of static, which also means that the “withers” in the
interface must be self. Perhaps this means that they should not exist
on the interface at all. DateTimeInterface only provides the getters,
likely for a similar reason.

Using the self return type over static would be counterproductive
in my opinion:
it's mostly because static is the correct type semantically, and it can be
useful for
forward compatibility later if we once want to remove the final modifier.

Regarding the analogy with DateTimeInterface, I think this one is wrong:
the ext/uri API is
completely immutable, while ext/date has the mutable DateTime
implementation,
so it's not possible to include setters in the interface, otherwise one
couldn't know
what to expect after modification.

I believe the UriException class as the base exception should not be

abstract. There is no real benefit to it, especially since it doesn't
specify any additional abstract methods.

I have no hard feelings regarding this. If I make it a concrete class, then
likely
implementations will start to throw it instead of more specific subclasses.
That's
probably not an issue, people are not usually interested in the exact
reason of an exception.
Since ext/date also added a generic parent exception (DateError) recently
which wasn't abstract,
then I'm fine with doing the same with ext/uri.

I'm not sure I like the Interface suffix on the UriInterface
interface. Just Uri\Uri would be equally expressive.

Yes, I was expecting this debate :) To be honest, I never liked interfaces
without an "Interface"
suffix, and my dislike didn't go away when I had to use such an interface
somewhere, because it
was difficult for me to find out what the symbol I was typing actually
referred to. But apart from my personal
experiences, I prefer to stay with "UriInterface" because the 2 most well
known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this
name definitely conveys that
people should not try to instantiate it.

I am not sure about the *User() and *Password() methods existing on

the interface. As the RFC acknowledges, RFC 3986 only specifies a
“userinfo” segment. Should the *User() and *Password() methods
perhaps be specific to the WhatWgUri class?

Really good question, and I hesitated a lot about the same (even in some of
my messages to the mailing list).
In fact, RFC 3986 has some notion of user/password, because the specification
mentions the "user:password"
format as deprecated [in favor of passing authentication information in
other places]. So I think the *User() and
*Password() methods are legitimately part of the interface. And it's not
even without precedent to have them in
an interface: PSR-7 made use of the "user" and "password" notions in the
UriInterface::withUserInfo() method
which accepts a $user and a $password parameter. I know people on this
list generally don't like PSR-7,
but t would be useful to know why PHP FIG chose to use these two parameters.

Due to the reasons above, the question for me is really whether we want to
add the *UserInfo() methods to the
interface or at least to Uri/Rfc3986Uri. Since WhatWg doesn't even mention
user info (apart from "userinfo
percent-encode set" which refers to something else), I'd prefer not to add
the methods in question to Uri/UriInterface.
If people insist on it, then I'm fine to add the methods to Uri\Rfc3986Uri
though.

I disagree with this change and believe that with the current

capabilities of PHP the out-parameter is the correct API design choice,
because then the “failure” case would be returning a falsy value, which
IMO is pretty idiomatic PHP:

Yes, I can live with any of the solutions, I'm just not sure which is less
bad. :) If only we had out parameters... But wishful
thinking aside, I am fine with whatever the majority of people prefer.
Probably being able to unify the API of the two
implementations is a good argument no one thought about so far for using
passing by reference...

Regards,
Máté

7 months ago by Christoph M. Becker — view source

unread

I'm not sure I like the Interface suffix on the UriInterface
interface. Just Uri\Uri would be equally expressive.

Yes, I was expecting this debate :) To be honest, I never liked interfaces
without an "Interface"
suffix, and my dislike didn't go away when I had to use such an interface
somewhere, because it
was difficult for me to find out what the symbol I was typing actually
referred to.

By the same argument, you could come up with code like

<?php
class User {
const defaultGroupNameConstant = "users";
private string $nameVariable;
public function getNameMethod() {…}
…
}
?>

But apart from my personal
experiences, I prefer to stay with "UriInterface" because the 2 most well
known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this
name definitely conveys that
people should not try to instantiate it.

DateTimeInterface has been introduced after there had already been
DateTime. Otherwise, we would likely have DateTime, DateTimeMutable and
DateTimeImmutable. (Or only DateTime as immutable class.)

SessionHandler/SessionHandlerInterface have been bad naming choices, in
my opinion. The interface could have been SessionHandler, and the class
DefaultSessionHandler (and should have been final). I dislike these
Interface and Implementation (or abbreviations of these) suffixes.

Christoph

7 months ago by Larry Garfield — view source

unread

I'm not sure I like the Interface suffix on the UriInterface
interface. Just Uri\Uri would be equally expressive.

Yes, I was expecting this debate :) To be honest, I never liked interfaces
without an "Interface"
suffix, and my dislike didn't go away when I had to use such an interface
somewhere, because it
was difficult for me to find out what the symbol I was typing actually
referred to.

By the same argument, you could come up with code like

<?php
class User {
const defaultGroupNameConstant = "users";
private string $nameVariable;
public function getNameMethod() {…}
…
}
?>

But apart from my personal
experiences, I prefer to stay with "UriInterface" because the 2 most well
known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this
name definitely conveys that
people should not try to instantiate it.

DateTimeInterface has been introduced after there had already been
DateTime. Otherwise, we would likely have DateTime, DateTimeMutable and
DateTimeImmutable. (Or only DateTime as immutable class.)

SessionHandler/SessionHandlerInterface have been bad naming choices, in
my opinion. The interface could have been SessionHandler, and the class
DefaultSessionHandler (and should have been final). I dislike these
Interface and Implementation (or abbreviations of these) suffixes.

Christoph

I used to be in favor of *Interface, but over time realized how useless it was. :-) I have stopped doing it in my own code and my code reads way better. Also, the majority of PHP's built in interfaces (Traversable, Countable, etc.) are not suffixed, AFAIK, so it's better to avoid it for consistency. As noted, DateTimeInterface is a special case outlier.

--Larry Garfield

5 months ago by Gina P. Banyard — view source

unread

Hi Máté,

I've read the latest version of the RFC and while I very much like the RFC, I have some remarks.

The paragraph in at the beginning of the RFC in the > Relevant URI specifications > WHATWG URL section seems to be incomplete.

I don't really understand how the UninitializedUriException exception can be thrown?
Is it somehow possible to create an instance of a URI without initializing it?
This seems unwise in general.

I'm not really convinced by using the constructor to be able to create a URI object.
I think it would be better for it to be private/throwing and have two static constructor parse and tryParse,
mimicking the API that exists for creating an instance of a backed enum from a scalar.

I think changing the name of the toString method to toRawString better matches the rest of the proposed API,
and also removes the question as to why it isn't the magic method __toString.

I will echo Tim's concerns about the non-final-ity of the URI classes.
This seems like a recipe for disaster.
I can maybe see the usefulness of extending Rfc3986\Uri by a subclass Ldap\Uri,
but being able to extend the WhatWg URI makes absolutely no sense.
The point of these classes is that if you have an instance of one of these, you know that you have a valid URI.
Being able to subclass a URI and mess with the equals, toString, toNormalizedString methods throws away all the safety guarantees provided by possessing a Uri instance.

Moreover, like Tim previously mentioned, if you subclass you need to override all the methods,
and you might end up in the similar situation which lead to the removal of the common Uri interface in the first place.
Which basically suggests creating a new Uri class instead of extending anyway.

Making these classes final just removes a lot of edge cases, some that I don't think we can anticipate,
while also simplifying other aspects, like serialization.
As you won't need that weird __uri property any longer.

Similarly, I don't understand why the WhatWgError is not final.
Even if subclassing of the Uri classes is allowed, any error it would have would not be a WhatWg one,
so why should you be able to extend it.

Parsing API and why Monads wouldn't solve the soft error case anyway.
This is just a remark, but you wouldn't be able to really implement a monad if you want to support partial success.
So I'm not sure mentioning the lack of monadic support in PHP is the best argument against them for this RFC.

Best regards,

Gina P. Banyard

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

5 months ago by Paul M. Jones — view source

unread

Hi all,

In earlier discussions on the Server-Side Request and Response objects RFC and the after-action sumamry, one of the common non-technical objections was that it would better be handled in userland.

Seeing as there are at least two other WHATWG-URL projects in userland now ...

... does the same objection continue to hold?

-- pmj

5 months ago by Gina P. Banyard — view source

unread

Hi all,

In earlier discussions on the Server-Side Request and Response objects RFC and the after-action sumamry, one of the common non-technical objections was that it would better be handled in userland.

Seeing as there are at least two other WHATWG-URL projects in userland now ...

https://packagist.org/packages/rowbot/url

https://packagist.org/packages/esperecyan/url

... does the same objection continue to hold?

Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.

Best regards,

Gina P. Banyard

5 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-02-23 18:57, schrieb Paul M. Jones:

In earlier discussions on the Server-Side Request and Response
objects RFC and the after-action
sumamry, one of the common
non-technical objections was that it would better be handled in
userland.

I did not read through the entire discussion, but had a look at the
“after-action summary” thread and specifically Côme’s response, which
you apparently agreed with:

My take on that is more that functionality in core needs to be
«perfect», or at least near unanimous.

Or perhaps phrased differently, like I did just a few days ago in:
https://externals.io/message/126350#126355

The type of functionality that is nowadays added to PHP’s standard
library is “building block” functionality: Functions that a userland
developer would commonly need in their custom library or application.

Correctly processing URIs is a common need for developers and it’s
complicated to do right, thus it qualifies as a “building block”.

PHP also already has this functionality in parse_url(), but it's
severely broken. To me it clearly makes sense to gradually provide
better-designed and safer replacement functionality for broken parts of
the standard library. This worked for the randomness functionality in
PHP 8.2, for DOM in PHP 8.4 and hopefully for URIs in PHP 8.5.

Best regards
Tim Düsterhus

5 months ago by Paul M.Jones — view source

unread

Hi there,

...

but had a look at the “after-action summary” thread and specifically Côme’s response, which you apparently agreed with:

My take on that is more that functionality in core needs to be «perfect», or at least near unanimous.

Côme Chilliet's full quote goes on to say "And of course it’s way easier to find a solution which pleases everyone when it’s for something quite simple" -- the example was of str_contains().

Or perhaps phrased differently, like I did just a few days ago in: https://externals.io/message/126350#126355

The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application.

Correctly processing URIs is a common need for developers and it’s complicated to do right, thus it qualifies as a “building block”.

Agreed. Add to that:

Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.

(The previous objections being that this ought to be left in userland.)

I'm repeatedly on record as saying that PHP, as a web-centric language, ought to have more web-centric objects available in core. A Request would be one of those; a Response another; and as being discussed here, a Url.

However, if it is true that ...

"it’s way easier to find a solution which pleases everyone when it’s for something quite simple"
"The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application."
"one of the other stated goals of this RFC is to provide this API to other core extensions"
"Parsing is the single most important operation to use with URIs where a URI string is decomposed into multiple components during the process." (from the RFC)

... then an extensive set of objects and exceptions is not strictly necessary.

Something like function parse_url_whatwg(string $url_string, ?string $base_url = null) : array, with an array of returned components, would meet all of those needs.

Similarly, something like a function parse_url_rfc3986(string $uri_string, ?string $base_url = null) : array does the same for RFC 3986 parsing.

Those things combined provide solid parsing functionality to userland, and make that parsing functionality available to other core extensions.

-- pmj

5 months ago by ignace nyamagana butera — view source

unread

Hi there,

...

but had a look at the “after-action summary” thread and specifically Côme’s response, which you apparently agreed with:

My take on that is more that functionality in core needs to be «perfect», or at least near unanimous.

Côme Chilliet's full quote goes on to say "And of course it’s way easier to find a solution which pleases everyone when it’s for something quite simple" -- the example was of str_contains().

Or perhaps phrased differently, like I did just a few days ago in: https://externals.io/message/126350#126355

The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application.

Correctly processing URIs is a common need for developers and it’s complicated to do right, thus it qualifies as a “building block”.

Agreed. Add to that:

Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.

(The previous objections being that this ought to be left in userland.)

I'm repeatedly on record as saying that PHP, as a web-centric language, ought to have more web-centric objects available in core. A Request would be one of those; a Response another; and as being discussed here, a Url.

However, if it is true that ...

"it’s way easier to find a solution which pleases everyone when it’s for something quite simple"

"The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application."

"one of the other stated goals of this RFC is to provide this API to other core extensions"

"Parsing is the single most important operation to use with URIs where a URI string is decomposed into multiple components during the process." (from the RFC)

... then an extensive set of objects and exceptions is not strictly necessary.

Something like function parse_url_whatwg(string $url_string, ?string $base_url = null) : array, with an array of returned components, would meet all of those needs.

Similarly, something like a function parse_url_rfc3986(string $uri_string, ?string $base_url = null) : array does the same for RFC 3986 parsing.

Those things combined provide solid parsing functionality to userland, and make that parsing functionality available to other core extensions.

-- pmj

Hi Paul,

The problem with your suggestion is that the specification from WHATWG
and RFC3986/3987 are so different and that the function you are
proposing won't be able to cover the outcome correctly (ie give the
developper all the needed information). This is why, for instance, Maté
added the getRaw* method alongside the normalized getter (method without
the Raw prefix).

Also keep in mind that URL construction may also differ between
specifications so instead of just 2 functions you may end up woth 4
methods not counting error jandling. So indeed using an OOP approach
while more complex is IMHO the better approach.

5 months ago by Paul M. Jones — view source

unread

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won't be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url() does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.

Recall that I'm responding at least in part to the comment that "Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections [to the Request/Response objects going into core] do not apply here." If the only reason they don't apply is that the core extensions need a parsing API, that reason becomes obviated by using just functions for the parsing elements.

Unless I'm missing something; happy to hear what that might be.

-- pmj

5 months ago by Faizan Akram Dar — view source

unread

Hi,

On Feb 25, 2025, at 09:55, ignace nyamagana butera nyamsprod@gmail.com
wrote:

The problem with your suggestion is that the specification from WHATWG
and RFC3986/3987 are so different and that the function you are proposing
won't be able to cover the outcome correctly (ie give the developper all
the needed information). This is why, for instance, Maté added the getRaw*
method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g.,
the 3986 parsing function might return an array much like parse_url() does
now, and the WHATWG function might return a completely different array of
components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be
in an object to be useful both to the internal API and to userland.

It most definitely needs to be an object. Arrays are awful DX wise, there
is array shape which modern IDEs like phpstorm support and so does static
analysis but the overall experience remains subpar compared to classes (and
objects).

Recall that I'm responding at least in part to the comment that
"Considering that one of the other stated goals of this RFC is to provide
this API to other core extensions, the previous objections [to the
Request/Response objects going into core] do not apply here." If the only
reason they don't apply is that the core extensions need a parsing API,
that reason becomes obviated by using just functions for the parsing
elements.

Unless I'm missing something; happy to hear what that might be.

-- pmj

Imho Request and Response objects do belong in core, but with a very good
api, something which would replace http foundation/PSR7 altogether.

5 months ago by Rob Landers — view source

unread

Hi,

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won't be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url() does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.

It most definitely needs to be an object. Arrays are awful DX wise, there is array shape which modern IDEs like phpstorm support and so does static analysis but the overall experience remains subpar compared to classes (and objects).

I’m curious why you say this other than an opinion about developer experience? Arrays are values, objects are not. A parsed uri seems more like a value and less like an object. Just reading through the comments so far, it appears that whatever is used will just be wrapped in library code regardless, for userland code, but the objective is to be useful for other extensions and core code. In that case, a hashmap is much easier to work with than a class.

Looking at the objectives of the RFC and the comments here, it almost sounds like it is begging to be a simple array instead of an object.

— Rob

5 months ago by Lynn — view source

unread

Hi,

On Feb 25, 2025, at 09:55, ignace nyamagana butera nyamsprod@gmail.com
wrote:

The problem with your suggestion is that the specification from WHATWG
and RFC3986/3987 are so different and that the function you are proposing
won't be able to cover the outcome correctly (ie give the developper all
the needed information). This is why, for instance, Maté added the getRaw*
method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g.,
the 3986 parsing function might return an array much like parse_url() does
now, and the WHATWG function might return a completely different array of
components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be
in an object to be useful both to the internal API and to userland.

It most definitely needs to be an object. Arrays are awful DX wise, there
is array shape which modern IDEs like phpstorm support and so does static
analysis but the overall experience remains subpar compared to classes (and
objects).

I’m curious why you say this other than an opinion about developer
experience? Arrays are values, objects are not. A parsed uri seems more
like a value and less like an object. Just reading through the comments so
far, it appears that whatever is used will just be wrapped in library code
regardless, for userland code, but the objective is to be useful for other
extensions and core code. In that case, a hashmap is much easier to work
with than a class.

Looking at the objectives of the RFC and the comments here, it almost
sounds like it is begging to be a simple array instead of an object.

— Rob

Depends on there being the intention to have it as parameter type. If it's
designed to be passed around to functions I really don't want it to be an
array. I am maintaining a legacy codebase where arrays are being used as
hashmaps pretty much everywhere, and it's error prone. We lose all kinds of
features like "find usages" and refactoring key/property names. Silly typos
in array keys with no actual validation of any kind cause null values and
annoying to find bugs.

I agree that hashmaps can be really easy to use, but not as data structures
outside of the function/method scope they were defined in. If value vs
object semantics are important here, then something that is forward
compatible with whatever structs may hold in the future could be
interesting.

5 months ago by Rob Landers — view source

unread

__

Hi,

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won't be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url() does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.

It most definitely needs to be an object. Arrays are awful DX wise, there is array shape which modern IDEs like phpstorm support and so does static analysis but the overall experience remains subpar compared to classes (and objects).

I’m curious why you say this other than an opinion about developer experience? Arrays are values, objects are not. A parsed uri seems more like a value and less like an object. Just reading through the comments so far, it appears that whatever is used will just be wrapped in library code regardless, for userland code, but the objective is to be useful for other extensions and core code. In that case, a hashmap is much easier to work with than a class.

Looking at the objectives of the RFC and the comments here, it almost sounds like it is begging to be a simple array instead of an object.

— Rob

Depends on there being the intention to have it as parameter type. If it's designed to be passed around to functions I really don't want it to be an array. I am maintaining a legacy codebase where arrays are being used as hashmaps pretty much everywhere, and it's error prone. We lose all kinds of features like "find usages" and refactoring key/property names. Silly typos in array keys with no actual validation of any kind cause null values and annoying to find bugs.

I agree that hashmaps can be really easy to use, but not as data structures outside of the function/method scope they were defined in. If value vs object semantics are important here, then something that is forward compatible with whatever structs may hold in the future could be interesting.

I meant hashmaps from within C, not within php. If it is just going to wrapped in userland libraries as people seem to be suggesting in this thread, then you only have to get it right once, and it is easy to use from C.

— Rob

4 months ago by kocsismate90@gmail.com — view source

unread

Hi,

Depends on there being the intention to have it as parameter type. If it's

designed to be passed around to functions I really don't want it to be an
array. I am maintaining a legacy codebase where arrays are being used as
hashmaps pretty much everywhere, and it's error prone. We lose all kinds of
features like "find usages" and refactoring key/property names. Silly typos
in array keys with no actual validation of any kind cause null values and
annoying to find bugs.

I agree that hashmaps can be really easy to use, but not as data
structures outside of the function/method scope they were defined in. If
value vs object semantics are important here, then something that is
forward compatible with whatever structs may hold in the future could be
interesting.

Yes, I agree here, even if we talk about simple data without behavior. But
as the length of the current RFC also suggests, URIs have surprisingly lot
of behavior, so I think it's natural to use OO for modelling them.

Máté

5 months ago by tim@bastelstu.be — view source

unread

Hi

Am 2025-02-23 18:30, schrieb Gina P. Banyard:

I don't really understand how the UninitializedUriException exception
can be thrown?
Is it somehow possible to create an instance of a URI without
initializing it?

It's mentioned in the RFC (it was not yet, when I read through the RFC):

This can happen for example when the object is instantiated via
ReflectionClass::newInstanceWithoutConstructor()).

Incidentally this is also something that would be fixed by making the
classes final, since it's illegal to bypass the constructor for final
internal classes:

 &lt;?php

 $r = new ReflectionClass(Random\Engine\Mt19937::class);
 $r->newInstanceWithoutConstructor();

results in:

 Fatal error: Uncaught ReflectionException: Class

Random\Engine\Mt19937 is an internal class marked as final that cannot
be instantiated without invoking its constructor

This seems unwise in general.

I agree. This exception is not really actionable by the user and more of
a “should never happen” case. It should be prevented from appearing.

The same is true for UriOperationException. The RFC says that it can
happen for memory issues. Can this actually happen? My understanding is
that the engine bails out when an allocation fails. In any case if a
more graceful handling is desired it should be some generic
OutOfMemoryError rather than an extension-specific exception.

With regard to unserialization, let me refer to:
https://externals.io/message/118311. ext/random uses \Exception and I
suggest ext/uri to do the same. This should also be handled in a
consistent way across extensions, e.g. by reproposing
https://wiki.php.net/rfc/improve_unserialize_error_handling.

And with “Theoretically, URI component reading may also trigger this
exception” being a theoretical issue only, the UriOperationException
is not actually necessary at all.

I'm not really convinced by using the constructor to be able to create
a URI object.
I think it would be better for it to be private/throwing and have two
static constructor parse and tryParse,
mimicking the API that exists for creating an instance of a backed enum
from a scalar.

enums are little different in that they are a singleton. The
Dom\HTMLDocument class with only named constructors might be a better
comparison. But I don't have a strong opinion on constructor vs named
constructor here.

Best regards
Tim Düsterhus

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Tim,

The same is true for UriOperationException. The RFC says that it can
happen for memory issues. Can this actually happen? My understanding is
that the engine bails out when an allocation fails. In any case if a
more graceful handling is desired it should be some generic
OutOfMemoryError rather than an extension-specific exception.

After checking the code of emalloc et al. I agree with you, the exception
won't actually
be thrown for memory errors. Therefore, I removed this part of the RFC.

With regard to unserialization, let me refer to:
https://externals.io/message/118311. ext/random uses \Exception and I
suggest ext/uri to do the same. This should also be handled in a
consistent way across extensions, e.g. by reproposing
https://wiki.php.net/rfc/improve_unserialize_error_handling.

Thanks for bringing this RFC to my attention, I agree with the motivation,
so I
changed this aspect of the RFC too to throw an \Exception.

And with “Theoretically, URI component reading may also trigger this
exception” being a theoretical issue only, the UriOperationException
is not actually necessary at all.

I wanted to reserve the right for any 3rd party internal URI implementations
to fail for whatever reason that prevents reading. The built-in
implementations
don't fail for sure, but it doesn't mean that 3rd party implementations
neither will.
Since potential errors can be handled in some way, I think it makes sense
to keep this exception, especially because it's now basically
non-triggerable.

I'm not sure if I'm entirely correct, but it's possible that a 3rd party
URI implementation
won't (or cannot) use PHP's memory manager, and it relies on the regular
malloc:
in this case, even memory errors could lead to failures.

Regards,
Máté

4 months ago by tim@bastelstu.be — view source

unread

Hi

I'm not sure if I'm entirely correct, but it's possible that a 3rd party
URI implementation
won't (or cannot) use PHP's memory manager, and it relies on the regular
malloc:
in this case, even memory errors could lead to failures.

We already discussed this in private and the UriOperationException was
removed from the RFC, but for public record:

Something like a memory allocation error is not actionable by the user.
Thus it should be an Error (rather than an exception) or a bail out.
Perhaps the engine will one day support gracefully handling the memory
limit being exceeded with an OutOfMemoryError being thrown in that
situation. Then it would also fit nicely for any URI implementation.

Best regards
Tim Düsterhus

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Gina,

The paragraph in at the beginning of the RFC in the > Relevant URI
specifications > WHATWG URL section seems to be incomplete.

Hopefully it's good now. Although I know this section doesn't include much
information.

I don't really understand how the UninitializedUriException exception can
be thrown?
Is it somehow possible to create an instance of a URI without initializing
it?
This seems unwise in general.

I think I've already answered this since then in my previous email (and in
the RFC as well), but yes, it's possible via reflection.
I don't really have an idea how this possibility could be avoided without
also making the classes final.

I'm not really convinced by using the constructor to be able to create a
URI object.
I think it would be better for it to be private/throwing and have two
static constructor parse and tryParse,
mimicking the API that exists for creating an instance of a backed enum
from a scalar.

I'm not completely against using parse() and tryParse(), but I think the
constructor already makes it clear that it either returns
a valid object or throws.

I think changing the name of the toString method to toRawString better
matches the rest of the proposed API,
and also removes the question as to why it isn't the magic method
__toString.

For RFC 3986, we could go with toString() instead of toNormalizedString()
and toRawString() instead of toString() so that we use
the same convention as for getters.

Recently I learnt that for some reason WHATWG normalizes the IP address
during component recomposition, so its toString() is
not really the most rare (at least not in the same way as "raw getters"
are). So for WHATWG, I think keeping toString() and
toDisplayString() probably still makes sense.

I will echo Tim's concerns about the non-final-ity of the URI classes.
This seems like a recipe for disaster.
I can maybe see the usefulness of extending Rfc3986\Uri by a subclass
Ldap\Uri,
but being able to extend the WhatWg URI makes absolutely no sense.
The point of these classes is that if you have an instance of one of
these, you know that you have a valid URI.
Being able to subclass a URI and mess with the equals, toString,
toNormalizedString methods throws away all the safety guarantees provided
by possessing a Uri instance.

I'm sure that people will find their use-cases to subclass all these new
classes, including the WHATWG implementation. As Nicolas mentioned,
his main use-case is minly adding convenience and new factory methods that
don't specifically need all methods to be reimplemented.

While I share your opinion that leaving the URI classes open for extension
is somewhat risky and it's difficult to assess its impacts right now, I can
also
sympathise with what Nicolas wrote in a later message (
https://externals.io/message/123997#126489): that we shouldn't close the
door for the public from
using interchangeable implementations.

I know that going final without any interfaces is the most "convenient" for
the PHP project itself, because the solution has much less BC surface to
maintain,
so we are relatively free and safe to make future changes. This is useful
for an API in its early days that is huge like this. Besides the interests
of the maintainers,
we should also take two important things into account:

Heterogeneous use-cases: it's out of question that the current API won't
fit all use-cases, especially because we have already identified some
followup tasks
that should be implemented (see "Future Scope" section in the RFC).
Interoperability: Since URI handling is a very widespread problem, many
people and libraries will start to use the new extension once
it's available. But because
of the above reason, many of them want to use their own abstraction, and
that's exactly why a common ground is needed: there's simply not a single
right possible
implementation - everyone has their own, given the complexity of the topic.

So we should try to be considerate about these factors by some way or
another. So far, we have four options:

Making the classes open for extension: this solution has acknowledged
technical challenges (
https://github.com/php/php-src/pull/14461#discussion_r1847316607),
and it limits our possibilities of adding changes the most, but users can
effectively add any behavior that they need. Of course, they are free to
introduce bugs and
spec-incompatible behavior into their own implementation, but none of the
other solutions could prevent such bugs either, since people will write
their custom code
wherever they can: if they can't have it in a child class, then they will
have in MyUri, or in UriHelper, or just in a 200 lines long function.

Being able to extend the built-in classes also means that child classes can
use the behavior of their parent by default - there's no need to create
wrapper
classes around the built-in ones (aka using composition), that is a tedious
task to implement, and also which would incur some performance penalty
because of the
extra method calls.

Making the classes open for extension, but making some methods final:
same benefits as above, without the said technical challenges - in theory.
I am currently
trying to figure out if there is a combination of methods that could be
made final so that the known challenges become impossible to be triggered -
although I haven't
managed to come up with a sensible solution yet.
Making the classes final: It avoids some edge-cases for the built-in
classes (the uninitialized state most prominently), while it leaves the
most room for making future
changes. Projects that may want to ship their own abstractions for the two
built-in classes can use composition to create their own URI
implementations.
They can instantiate these implementations however they want to (i.e.
$myUri = new MyUri($uri)). If they need to pass an URI to other libraries
then they could extract
the wrapped built-in class (i.e. $myUri->getUri()).

On the flipside, backporting methods added in future PHP versions (aka
polyfills) will become impossible to implement for URIs according to my
knowledge, as well as mocking
in PHPUnit will also be a lost feature (I'm not sure if it's a good or a
bad thing, but it may be worth to point out).

Also, the current built-in implementations may have alternative
implementations that couldn't be used instead of them. For example, the ADA
URL library (which is mentioned
in the RFC) also implements the WHATWG specification - possibly the very
same way as Lexbor, the currently used library - does. These alternative
implementations may have
different performance characteristics, platform requirements, or level of
maintenance/support, which may qualify them as more suitable for some
use-cases than what the built-in
ones can offer. If we make these classes final, there's no way to use
alternative implementations as a replacement for the default ones, although
they all implement the same
specification having mostly clear semantics.

Making the classes final, but adding a separate interface for each: The
impact of making the built-in classes final would be mitigated by adding
one interface
for each specification (I didn't like this idea in the past, but it now
looks much more useful in the perspective of the final vs non-final
debate). Because of the interfaces,
there would be a common denominator for the different possible
implementations. I'm sure that someone would suggest that the community
(aka PHP-FIG)
should come up with such an interface, but I think we shouldn't expect
someone else to do the work when we are in the position to do it the
best, as those interfaces
should be internal ones, since the built-in URI classes should also
implement them.

If we had these interfaces, projects could use whatever abstraction they
want via composition, but they could more conveniently pass around the same
object everywhere.

I intentionally don't try to draw a conclusion for now, first of all
because it already took me a lot of time to try to mostly objectively
compare the different possibilities, and
I hope that we can find more pros-cons (or fix my reasonings if I made
mistakes somewhere) in order to finally reach some kind of consensus.

Similarly, I don't understand why the WhatWgError is not final.
Even if subclassing of the Uri classes is allowed, any error it would have
would not be a WhatWg one,
so why should you be able to extend it.

I made it final now.

Thank you for your comments:
Máté

4 months ago by Larry Garfield — view source

unread

I'm sure that people will find their use-cases to subclass all these
new classes, including the WHATWG implementation. As Nicolas mentioned,
his main use-case is minly adding convenience and new factory methods
that don't specifically need all methods to be reimplemented.

While I share your opinion that leaving the URI classes open for
extension is somewhat risky and it's difficult to assess its impacts
right now, I can also
sympathise with what Nicolas wrote in a later message
(https://externals.io/message/123997#126489): that we shouldn't close
the door for the public from
using interchangeable implementations.

I know that going final without any interfaces is the most "convenient"
for the PHP project itself, because the solution has much less BC
surface to maintain,
so we are relatively free and safe to make future changes. This is
useful for an API in its early days that is huge like this. Besides the
interests of the maintainers,
we should also take two important things into account:

Heterogeneous use-cases: it's out of question that the current API
won't fit all use-cases, especially because we have already identified
some followup tasks
that should be implemented (see "Future Scope" section in the RFC).

Interoperability: Since URI handling is a very widespread problem,
many people and libraries will start to use the new extension once it's
available. But because
of the above reason, many of them want to use their own abstraction,
and that's exactly why a common ground is needed: there's simply not a
single right possible
implementation - everyone has their own, given the complexity of the
topic.

So we should try to be considerate about these factors by some way or
another. So far, we have four options:

Making the classes open for extension: this solution has acknowledged
technical challenges
(https://github.com/php/php-src/pull/14461#discussion_r1847316607),
and it limits our possibilities of adding changes the most, but users
can effectively add any behavior that they need. Of course, they are
free to introduce bugs and
spec-incompatible behavior into their own implementation, but none of
the other solutions could prevent such bugs either, since people will
write their custom code
wherever they can: if they can't have it in a child class, then they
will have in MyUri, or in UriHelper, or just in a 200 lines long
function.

Being able to extend the built-in classes also means that child classes
can use the behavior of their parent by default - there's no need to
create wrapper
classes around the built-in ones (aka using composition), that is a
tedious task to implement, and also which would incur some performance
penalty because of the
extra method calls.

Making the classes open for extension, but making some methods final:
same benefits as above, without the said technical challenges - in
theory. I am currently
trying to figure out if there is a combination of methods that could be
made final so that the known challenges become impossible to be
triggered - although I haven't
managed to come up with a sensible solution yet.

Making the classes final: It avoids some edge-cases for the built-in
classes (the uninitialized state most prominently), while it leaves the
most room for making future
changes. Projects that may want to ship their own abstractions for the
two built-in classes can use composition to create their own URI
implementations.
They can instantiate these implementations however they want to (i.e.
$myUri = new MyUri($uri)). If they need to pass an URI to other
libraries then they could extract
the wrapped built-in class (i.e. $myUri->getUri()).

On the flipside, backporting methods added in future PHP versions (aka
polyfills) will become impossible to implement for URIs according to my
knowledge, as well as mocking
in PHPUnit will also be a lost feature (I'm not sure if it's a good or
a bad thing, but it may be worth to point out).

Also, the current built-in implementations may have alternative
implementations that couldn't be used instead of them. For example, the
ADA URL library (which is mentioned
in the RFC) also implements the WHATWG specification - possibly the
very same way as Lexbor, the currently used library - does. These
alternative implementations may have
different performance characteristics, platform requirements, or level
of maintenance/support, which may qualify them as more suitable for
some use-cases than what the built-in
ones can offer. If we make these classes final, there's no way to use
alternative implementations as a replacement for the default ones,
although they all implement the same
specification having mostly clear semantics.

Making the classes final, but adding a separate interface for each:
The impact of making the built-in classes final would be mitigated by
adding one interface
for each specification (I didn't like this idea in the past, but it now
looks much more useful in the perspective of the final vs non-final
debate). Because of the interfaces,
there would be a common denominator for the different possible
implementations. I'm sure that someone would suggest that the community
(aka PHP-FIG)
should come up with such an interface, but I think we shouldn't expect
someone else to do the work when we are in the position to do it the
best, as those interfaces
should be internal ones, since the built-in URI classes should also
implement them.

If we had these interfaces, projects could use whatever abstraction
they want via composition, but they could more conveniently pass around
the same object everywhere.

I intentionally don't try to draw a conclusion for now, first of all
because it already took me a lot of time to try to mostly objectively
compare the different possibilities, and
I hope that we can find more pros-cons (or fix my reasonings if I made
mistakes somewhere) in order to finally reach some kind of consensus.

Thought: make the class non-final, but all of the defined methods final, and any internal data properties private. That way we know that a child class cannot break any of the existing guarantees, but can still add convenience methods or static method constructors on top of the existing API, without the need for an interface and a very verbose composing class.

--Larry Garfield

4 months ago by kocsismate90@gmail.com — view source

unread

Hi Larry and everyone who took part in the final vs non-final debate,

Thought: make the class non-final, but all of the defined methods final,

and any internal data properties private. That way we know that a child
class cannot break any of the existing guarantees, but can still add
convenience methods or static method constructors on top of the existing
API, without the need for an interface and a very verbose composing class.

I was thinking about this a lot, hesitating a lot on all the possibilities.
At last, I went with final classes. I know this is disappointing for
everyone who wanted to have an unlocked implementation,
and I am still sympathetic for providing some kind of extension point. I
synthesized my thoughts in a very lengthy section:
https://wiki.php.net/rfc/url_parsing_api#why_should_the_uri_rfc3986_uri_and_the_uri_whatwg_url_classes_be_final
so please read my full reasoning there.

TLDR: First of all, let me clarify that I want to open the API as soon as
the API becomes mature enough. However, based on the heated debate, we
would surely need a lor more time to find the best solution that won't have
unforeseen surprises. Since the final vs non-final question is a very small
(but important) detail of the proposal, I would like to discuss it on its
own, without affecting the whole work, and without risking to meet the
deadline of PHP 8.5. I really hope that this decision will give back the
focus on the most essential parts of the proposal that cannot be changed
(or only with a lot of difficulties) once the feature goes live: I mostly
think about the percent encoding/decoding related behavior, just to name
one thing.

Máté

3 months ago by kocsismate90@gmail.com — view source

unread

Hey Ignace,

(let me answer in the original thread, as apparently the discussion
continued in a separate thread from the main one)

I believe during normalization of IPv6 host the letter a-f should be
lowercase in accordance with the RFC since

RFC3986 follows https://www.rfc-editor.org/rfc/rfc3513 which has been
replaced by https://www.rfc-editor.org/rfc/rfc4291 which is updated by
https://www.rfc-editor.org/rfc/rfc5952#section-4.3 which recommends
lowecasing the letters. (yeah that was quite a digging I know 🙂 )

That's quite a long chain of RFC updates.... But yes, RFC 3986 explicitly
mentions this here:

Although host is case-insensitive, producers and normalizers should use

lowercase for registered names and hexadecimal addresses for the sake of
uniformity, while only using uppercase letters for percent-encodings.

And that's what the current implementation does. :)

Since the withers expect well encoded components does it means that it
is the same for the constructor. What is

the expected result for the following code ?
$uri = new Uri\Rfc3986\Uri("https://example,com/?foo[]=1&foo[]=2" <https://example,com/?foo[]=1&foo[]=2>);
Should the above trigger an exception because the query component contain invalid characters or
is it acceptable ? Asking because currently our dear old parse_url does not fail on this and
probably most PHP developers expect this not to fail.

IMHO I am in favor of it failing to get a consistent experience when using the class because
otherwse you introduce an inconsistency between the constructor behaviour and the rest of the class
API.

Yes, generally, creation or any mutation of Uri\Rfc3986\Uri fails when the
URI is invalid, exactly in order to offer a consistent experience.

Regards,
Máté

3 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

it might be brought back for future improvements.

Yes, surely!

I have one last question regarding the URI implementations which are raised by my polyfill:

Did you also took into account the delimiters when submitting data via the withers ? In other words is
$uri->withQuery('?foo=bar');
//the same as
$uri->withQuery('foo=bar');
I know it is the case in of the WHATWG specification but I do not know if you kept this behaviour in your implementation for the WhatWgUrl for the Rfc3986 or for both. I would lean toward not accepting this "normalization" but since this is not documented in the RFC I wanted to know what is the expected behaviour.

Yes, very good question! As you said, this aspect is not defined by either
the RFC 3986, or the present PHP RFC... But yes, this normalization
won't be accepted by the RFC implementation. I've just included this piece
of information in the relevant section (
https://wiki.php.net/rfc/url_parsing_api#component_modification).

Regards,
Máté

2 months ago by kocsismate90@gmail.com — view source

unread

Hello Internals,

After more than a hundred emails refining even the tiniest details, we have
reached a point where I'd like to call for a vote.
I know that the new API still doesn't support many use-cases, it still has
missing pieces, but now it includes a cohesive set
of functionality that could be a very useful basic building block for most
people.

That said, I don't intend to change anything about the RFC anymore, unless
there's still some factual error in it. There are a lot of
possibilities how such a large API can look like, and this RFC approaches
the problem the way it is currently described,
and not in any other way.

So unless some very serious issues arise, I'm going to start the vote on
8th May, possibly in the morning (according to UTC).

Regards,
Máté

2 months ago by Paul M. Jones — view source

unread

Hi Maté and all,

Hello Internals,

After more than a hundred emails refining even the tiniest details, we have reached a point where I'd like to call for a vote.
I know that the new API still doesn't support many use-cases, it still has missing pieces, but now it includes a cohesive set
of functionality that could be a very useful basic building block for most people.

That said, I don't intend to change anything about the RFC anymore, unless there's still some factual error in it. There are a lot of
possibilities how such a large API can look like, and this RFC approaches the problem the way it is currently described,
and not in any other way.

So unless some very serious issues arise, I'm going to start the vote on 8th May, possibly in the morning (according to UTC).

I am on record as wanting very much to see some decent web-centric objects in core PHP (Request, Response, Uri/Url, etc).

To my chagrin, despite the fact that its goals are laudable, I do not think this RFC is in a ready state to provide such objects.

Among other things I find troubling, the RFC as presented ...

is too broad in scope;
acknowledges it is incomplete, with work left undone;
admits to standards non-compliance; and,
has an uncertain API.

Too Broad In Scope

The RFC attempts to do too much at once: not just making URI/URL parsing "pluggable" for internals, and providing an RFC 3986 compliant parser, but also creating from scratch entirely new RFC 3986 URI and related Exception classes for userland consumption, along with entirely new WHATWG-URL classes and Exceptions.

The RFC itself remarks on "[t]he already large scope of the RFC" -- and the same has been observed during the on-list discussions. Even Maté's message above mentions "There are a lot of possibilities how such a large API can look like".

It would be better to narrow the scope of the RFC to something more manageable.

Incomplete, Work Left Undone

This is a consequence of the overly-broad scope. The work remaining is by no means certain to be completed or voted in after followup RFCs, either on a short timeline or a long one.

Maté notes above that the RFC "has missing pieces" -- and here are some examples from the RFC itself:

"Builder classes are not offered by the present RFC just yet. ... this feature is one of the top candidates of a followup RFC."
"The topic of query parameter manipulation should be discussed as a followup to the current RFC."
"There are multiple planned features in future scope that should be supported."
"There are immediate plans to add new capabilities to the new API"
"the position of this RFC is not to include this interface [URLSearchParams] yet"

It would better to present a single finished product instead of multiple partially-finished products.

Standards Non-Compliance

The RFC states early on that "the parse_url() function is offered for parsing URLs, however, it isn't compliant with any standards. ... Incompatibility with current standards is a serious issue" -- but later it says:

Getters of Uri\WhatWg\Url have a few gotchas for the ones who are inherently familiar with the WHATWG URL specification: they don't (entirely) follow the “getter steps” that are defined by the specification, but the individual components are returned directly without any other changes that the “getter steps” would otherwise specify.

The RFC doesn't fully follow the WHATWG-URL standard. This is reminiscent of the complaint regarding parse_url().

Further, "the WHATWG URL specification contains a URLSearchParams interface" but "the position of this RFC is not to include this interface yet".

It would be better to actually follow the WHATWG-URL standard, and not add a partially-compliant and somewhat-nonstandard implementation to core.

Uncertain API

Because of the unfinished work, and because of the "living standard" nature of WHATWG-URL, the foundation of the API is unsteady:

WHATWG URL doesn't specify percent-decoding rules for most components ... But since the WHATWG URL specification is subject to constant updates, it's possible that normalization or percent-decoding rules change in the future.

"Constant updates" makes me think it is too early to include a WHATWG-URL implementation in core.

Then we have this ...

the current RFC chooses to make the built-in URI implementations final ... until the new API becomes mature enough and becomes tested in practice.

... and this:

Once the API settles, we plan to lift these restrictions [around final classes] at some extent.

If the API needs to "become tested in practice" so that it can "mature" and "settle", it would be better do that in userland (maybe published on Packagist or PECL) instead of in core.

Remedies

I think all of the above can be remedied, so that we can finally have some decent web-centric objects in core. But that's a discussion for a later time, one we can have if the RFC does not pass.

-- pmj

2 months ago by Gina P. Banyard — view source

unread

Hi Maté and all,

Hello Internals,

After more than a hundred emails refining even the tiniest details, we have reached a point where I'd like to call for a vote.
I know that the new API still doesn't support many use-cases, it still has missing pieces, but now it includes a cohesive set
of functionality that could be a very useful basic building block for most people.

That said, I don't intend to change anything about the RFC anymore, unless there's still some factual error in it. There are a lot of
possibilities how such a large API can look like, and this RFC approaches the problem the way it is currently described,
and not in any other way.

So unless some very serious issues arise, I'm going to start the vote on 8th May, possibly in the morning (according to UTC).

I am on record as wanting very much to see some decent web-centric objects in core PHP (Request, Response, Uri/Url, etc).

To my chagrin, despite the fact that its goals are laudable, I do not think this RFC is in a ready state to provide such objects.
[...]

-- pmj

Considering that this RFC was in discussion for over 10 months,
and you only started providing input 2 months ago after there have already been serious alterations to it twice.
I am not sure your "rant" is something that is at all productive.

You are free to vote against it, but stalling the work someone has committed just because you don't think it is ready is not how any of this works.
Looking from the sidelines, you seem to have the opinion that we should be standardizing existing userland design.
This is not what you want, because if you do this you get POSIX, and POSIX is notoriously inconsistent and kinda bad.
And maybe this is what FIG did, which whatever, but core is not FIG nor userland.

So let's go through your points:

is too broad in scope;

An RFC author is allowed to choose whatever scope they want.

acknowledges it is incomplete, with work left undone;

Using multiple RFCs to provide incremental improvements to the language is a standard thing we do.
Therefore, this point is moot.

admits to standards non-compliance; and,

Non-compliance with what?
WHATWG which is a living standard?
Not having one component of the WHATWG spec?
The same way, the new 8.4 DOM classes don't implement the whole living DOM spec?

has an uncertain API.

Frankly, 90% of the recent uncertainty has seemingly come from you trying to "rework" the RFC to your own taste.
If you think this should first be an extension or a userland package then feel free to do it, regardless of the result of this vote.
Considering that one of the main maintainers of an actual popular userland URI library has actively been participating in the discussions since the beginning and help shape this RFC,
makes me believe this is very much ready to vote, compared to the opinion of someone that is trying to chime in last minute.

Sincerely,

Gina P. Banyard

2 months ago by Paul M. Jones — view source

unread

Hi all,

I am on record as wanting very much to see some decent web-centric objects in core PHP (Request, Response, Uri/Url, etc).

To my chagrin, despite the fact that its goals are laudable, I do not think this RFC is in a ready state to provide such objects.
[...]

-- pmj

Putting these together, one from the beginning ...

Considering that this RFC was in discussion for over 10 months, and you only started providing input 2 months ago after there have already been serious alterations to it twice.

... and one from the end:

the opinion of someone that is trying to chime in last minute.

Sure, I can see what it looks like: Johnny-come-lately starts making noise only as things are finishing up.

I can say only that (at least from my perspective) there was no sure way to tell how much longer discussion would go on, or how many more changes might be considered. Based on other experiences, conscience demanded that I offer comments as requested on the RFC, as many others had before me.

I am not sure your "rant" is something that is at all productive.

I am glad for the scare quotes; it was a factual analysis, not a rant. As to whether it is productive, well, one never knows until afterwards.

You are free to vote against it,

I might? If I do, it strikes me as constructive (and polite) to have provided reasons why -- thus my message.

but stalling the work someone has committed just because you don't think it is ready is not how any of this works.

To be fair, just because someone has committed work does not mean the work should be accepted; but, my individual opinion is of relatively little weight there.

Looking from the sidelines, you seem to have the opinion that we should be standardizing existing userland design.

Not exactly. My opinion is that the RFC should consider the approaches taken by the many others that have produced working URI solutions; and, if those approaches are discarded, then articulate the reasons for doing so. To ignore them out of hand is insufficiently diligent.

So let's go through your points:

is too broad in scope;

An RFC author is allowed to choose whatever scope they want.

I did not say otherwise; whether the scope chosen is a good one or not is something else.

has an uncertain API.

Frankly, 90% of the recent uncertainty has seemingly come from you trying to "rework" the RFC to your own taste.

First, the uncertainty I referred to was from the RFC itself, when it states that the API needs to become "mature enough" and "tested in practice" until it "settles." That tells me the authors aren't too sure of it.

Then, to be clear, those observations and suggestions were not based on my "own taste." They are based the decisions of a dozen or so developers working in the URI space, research into which is summarized at https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md.

admits to standards non-compliance; and,

Non-compliance with what?

I mentioned this in the earlier message, but to reiterate: the WHATWG-URL getters "don't (entirely) follow the “getter steps” that are defined by the specification, but the individual components are returned directly without any other changes that the “getter steps” would otherwise specify."

So, those are some of my concerns around the RFC. Take them or leave them, as you see fit. If the RFC passes, it won't be the worst thing that that ever happened to PHP, and if it turns out that my concerns were unfounded, so much the better.

-- pmj