Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs
according to any well established standards (RFC 1738 or the WHATWG URL
living standard), since the parse_url()
function is optimized for
performance instead of correctness.
In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.
You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
Hey Máté,
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
So far, amazing ! 👏
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
This is a great addition to have! I see there's nothing specifically about
__toString
in the RFC, is this aiming to do the same as PSR-7?
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the
parse_url()
function is optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
Hi Máté
+1 from me, I'm all for modern web-related APIs as you know.
Some questions/remarks:
- Why did you choose UrlParser to be a "static" class? Right now it's just a fancy namespace.
I can see the point of having a UrlParser class where you can e.g. configure it with which URL standard you want,
but as it is now there is no such capability. - It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely. - Why is UrlComponent a backed enum?
- A nit: We didn't bundle the entire Lexbor engine, only select parts of it. Just thought I'd make it clear.
- About edge cases: e.g. what happens if I call the Url constructor and leave every string field empty?
Overall seems good.
Kind regards
Niels
- Why did you choose UrlParser to be a "static" class?
Because "static class" is the hip new cool ;)
Bilge
- It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.
Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)
We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....
Cheers
Stephen
- It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....
Cheers
Stephen
I personally ignore PSR when it doesn't make sense to use it. They're nice for library compatibility, but I will happily toss compatibility when it doesn't make sense to be compatible. This might be one of those cases as there is no reason it has to be PSR compliant. In fact, a wrapper may be written to make it compliant, if one so chooses. I suspect it is better to be realistic and learn from the short-comings of PSR and apply those learnings here, vs. reiterating them and "engraving them in stone" (so to speak).
— Rob
- It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....
Cheers
Stephen
While I do not think the debate should be about compatibility with PSR-7
some historical context shoyld be brought to light for a fair discussion:
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented. - PHP historical query parser parse_str logic is so bad (mangled
parameter name for instance) that PSR-7 was right not embedding that
parsing algorithm in its specification. - If you take aside URITemplate specification and now URLSearchParams
there is no official, referenced and or agreed upon rules/document on
how a query string MUST or SHOULD be parsed. - Last but not least URLSearchParans encoding/decoding rules DO NOT
follow either RFC1738 nor RFC3986 (they follow the form data which is
kind of a mix between both RFC)
THis means that just adding a method or a class that mimic 100%
URLSearchParans for instance will constitute a major departure in how
PHP trears query string you will no longer have a 1:1 relation between
the data you have inside your _GET
array and the one in
UrlSearchParams for better or for worse.
For all these arguments I would keep the proposed Url
free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.
Hi Ignace,
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do usingparse_url
. In my view theUrl
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.
Supporting IANA registered schemes is a valid request, and is definitely
useful.
However, I think this feature is not strictly required to have in the
current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely on parse_url()
. And of course, we can (and should)
add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.
Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in
PHP:
yes, PHP is server side, but it still interacts with browsers very heavily.
Among other
use-cases I cannot yet image, the major one is most likely validating
user-supplied URLs
for opening in the browser. As far as I see the situation, currently there
is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or
not.
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.
Thank you for the historical context!
Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until
this function
really becomes superfluous. That's why it now seems to me that the behavior
of
parse_url()
could be leveraged in ext/url so that it would work with a
Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a
Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are
TBD).
For all these arguments I would keep the proposed Url
free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.
My WIP implementation still uses nullable properties and return types. I
only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio
for everyone
involved in the discussion, then I think making these types nullable is
fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with
this.
Again, thank you for your constructive criticism.
Regards,
Máté
Hi Ignace,
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do usingparse_url
. In my view theUrl
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.Supporting IANA registered schemes is a valid request, and is definitely useful.
However, I think this feature is not strictly required to have in the current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely onparse_url()
. And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.
It's also worth pointing out (as another reason not to do this) is that IANA may-or-may not be valid in the current network. For example, TOR, Handshake, IPFS, Freenet, etc. all have their own DNS schemes and do not (usually) use IANA registered schemes, and many people create sites that cater to those networks.
Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in PHP:
yes, PHP is server side, but it still interacts with browsers very heavily. Among other
use-cases I cannot yet image, the major one is most likely validating user-supplied URLs
for opening in the browser. As far as I see the situation, currently there is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or not.
Looking at the spec for WHATWG, it looks like example%2Ecom
will be parsed as a valid URL, and transformed to example.com
, while this doesn't currently happen in parse_url()
:
I don't know if that may be an issue, but might be if you are expecting the string to remain URL encoded.
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.Thank you for the historical context!
Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until this function
really becomes superfluous. That's why it now seems to me that the behavior of
parse_url()
could be leveraged in ext/url so that it would work with a Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are TBD).For all these arguments I would keep the proposed
Url
free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.My WIP implementation still uses nullable properties and return types. I only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone
involved in the discussion, then I think making these types nullable is fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with this.
The spec contains elements and their types. It would be good to adhere to the spec (simplifies documentation):
- scheme may be null or empty string
- port may be null
- path is never null, but may be empty string
- query may be null
- fragment may be null
- user/password may be null (to differentiate between an empty password or no password)
- host may be null (for relative URLs
Again, thank you for your constructive criticism.
Regards,
Máté
— Rob
Hi Ignace,
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do usingparse_url
. In my view theUrl
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.Supporting IANA registered schemes is a valid request, and is definitely useful.
However, I think this feature is not strictly required to have in the current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely onparse_url()
. And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.It's also worth pointing out (as another reason not to do this) is that IANA may-or-may not be valid in the current network. For example, TOR, Handshake, IPFS, Freenet, etc. all have their own DNS schemes and do not (usually) use IANA registered schemes, and many people create sites that cater to those networks.
Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in PHP:
yes, PHP is server side, but it still interacts with browsers very heavily. Among other
use-cases I cannot yet image, the major one is most likely validating user-supplied URLs
for opening in the browser. As far as I see the situation, currently there is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or not.Looking at the spec for WHATWG, it looks like
example%2Ecom
will be parsed as a valid URL, and transformed toexample.com
, while this doesn't currently happen inparse_url()
:I don't know if that may be an issue, but might be if you are expecting the string to remain URL encoded.
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.Thank you for the historical context!
Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until this function
really becomes superfluous. That's why it now seems to me that the behavior of
parse_url()
could be leveraged in ext/url so that it would work with a Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are TBD).For all these arguments I would keep the proposed
Url
free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.My WIP implementation still uses nullable properties and return types. I only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone
involved in the discussion, then I think making these types nullable is fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with this.The spec contains elements and their types. It would be good to adhere to the spec (simplifies documentation):
- scheme may be null or empty string
- port may be null
- path is never null, but may be empty string
- query may be null
- fragment may be null
- user/password may be null (to differentiate between an empty password or no password)
- host may be null (for relative URLs
Again, thank you for your constructive criticism.
Regards,
Máté— Rob
Here's a list of examples worth adding to the RFC:
//example.com?
ftp://user@example.com/path/to/ffile
https://user:@example.com
https://user:pass@example%2Ecom/?something=other&bool#heading
etc.
— Rob
Hi Máté,
Supporting IANA registered schemes is a valid request, and is
definitely useful. However, I think this feature is not strictly
required to have in the current RFC.
True. Having a WHATWG compliant parser in PHP source code is a big +1
from me I have nothing against that inclusion.
Based on your and others' feedback, it has now become clear for me
thatparse_url()
is still useful and ext/url needs quite some additional
capabilities until this function really becomes superfluous.
parse_url
can only be deprecated when a RFC3986 compliant parser is
added to php-src, hence why I insist in having that parser being present
too.
I will also add that everything up to now in PHP uses RFC3986 as basis
for generating or representing URLs (cURL extension, streams, etc...).
Having the first and only OOP representation of an URL in the language
not following that same specification seems odd to me. It opens the door
to inconcistencies that will only be resolved once an equivalent RFC3986
URL object made its way into the source code.
On the public API side I would recommend the following:
-
if you are to strictly follow the WHATWG specification no URI
component can be null. They must all be strings. If we have to plan to
use the same object for RFC3986 compliant parser, then all components
should be nullable except for the path component which can never be null
as it is always present. -
As other have mention we should add a method to resolve an URI against
a base URI something like Url::resolve(string $url, Url|string|null
$baseUrl) where the baseURL argument should be an absolute Url if
present. If absent the url argument must be absolute otherwise an
exception should be thrown -
last but not least the WHATWG specification is not only a URL parser
but also a URL validator and can apply some "correction" to malformed
URL and report them. The specification has a provision for a structure
to report malformed URL errors. I failed to see this mechanism being
mention anywhere the RFC. Will the URL only trigger exceptions or will
it also triggers warnings ? For inspiration the excellent PHP userland
WHATWG URL parser from Trevor Rowbotham
https://github.com/TRowbotham/URL-Parser allow using a PSR-3 logger to
record those errors.
Best regards,
Ignace
Hi Máté,
Supporting IANA registered schemes is a valid request, and is
definitely useful. However, I think this feature is not strictly
required to have in the current RFC.True. Having a WHATWG compliant parser in PHP source code is a big +1
from me I have nothing against that inclusion.Based on your and others' feedback, it has now become clear for me
thatparse_url()
is still useful and ext/url needs quite some additional
capabilities until this function really becomes superfluous.
parse_url
can only be deprecated when a RFC3986 compliant parser is
added to php-src, hence why I insist in having that parser being present
too.I will also add that everything up to now in PHP uses RFC3986 as basis
for generating or representing URLs (cURL extension, streams, etc...).
Having the first and only OOP representation of an URL in the language
not following that same specification seems odd to me. It opens the door
to inconcistencies that will only be resolved once an equivalent RFC3986
URL object made its way into the source code.On the public API side I would recommend the following:
- if you are to strictly follow the WHATWG specification no URI
component can be null. They must all be strings. If we have to plan to
use the same object for RFC3986 compliant parser, then all components
should be nullable except for the path component which can never be null
as it is always present.
This isn't true. It's just that in the language it is specified in, any element can be null (i.e., no nullable types). It specifies what may be null here: URL Standard (whatwg.org) https://url.spec.whatwg.org/#url-representation
— Rob
Hi Maté,
Fantastic RFC :)
Le dim. 7 juil. 2024 à 11:17, Máté Kocsis kocsismate90@gmail.com a écrit :
Hi Ignace,
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do usingparse_url
. In my view theUrl
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.Supporting IANA registered schemes is a valid request, and is definitely
useful.
However, I think this feature is not strictly required to have in the
current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely onparse_url()
.
If I may, parse_url is showing its age and issues like
https://github.com/php/php-src/issues/12703 make it unreliable. We need an
escape plan from it.
FYI, we're discussing whether a Uri component should make it in Symfony
precisely to work around parse_url's issues in
https://github.com/php/php-src/issues/12703
Your RFC would be the perfect answer to this discussion but IRI would need
to be part of it.
I agree with everything Ignace said. Supporting RFC3986 from day-1 would be
absolutely great!
Note that we use parse_url for http-URLs, but also to parse DSNs like
redis://localhost and the likes.
And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in
PHP:
yes, PHP is server side, but it still interacts with browsers very
heavily. Among other
use-cases I cannot yet image, the major one is most likely validating
user-supplied URLs
for opening in the browser. As far as I see the situation, currently there
is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or
not.
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.Thank you for the historical context!
Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until
this function
really becomes superfluous. That's why it now seems to me that the
behavior of
parse_url()
could be leveraged in ext/url so that it would work with a
Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a
Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are
TBD).For all these arguments I would keep the proposed
Url
free of allthese concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.My WIP implementation still uses nullable properties and return types. I
only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low
prio for everyone
involved in the discussion, then I think making these types nullable is
fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with
this.Again, thank you for your constructive criticism.
Regards,
Máté
Hey Ignace, Nicolas,
Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/) in
the recent days
in order to add support for the requested functionality. As far as I can
tell, the results
were very promising, so I'm ok to include this into my proposal (I haven't
pushed my
changes yet and haven't updated the RFC yet).
Regarding the reference resolution (
https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering what
the use-case is?
But in any case, I'm fine with incorporating this as well into the RFC,
since apparently
both Lexbor and uriparser support this (naturally).
What I became puzzled about is the correct object structure and naming. Now
that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class extending
the former one...
If it is, then in my opinion, the logical behavior would be that Lexbor
always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case the
differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since URL
objects
could hold URLs parsed based on both specs (and therefore having a unified
interface is required).
Or rather we should have a separate URI and a WhatwgUrl class so that the
former one would
always be created by uriparser, while the latter one by Lexbor? This way we
could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one could
have a getUserInfo() method,
while the WHATWG related one could have both getUser() and getPassword()
methods). But then
the question is how interchangeable these classes should be? I.e. should we
be able to convert them
back and forth, or should there be an interface that is implemented by the
two classes?
I'd appreciate any suggestions regarding these questions.
P.S. due to its bad receptance, I got rid of the UrlParser class as well as
the UrlComponent enum from my
implementation in the meantime.
Regards,
Máté
Hey Ignace, Nicolas,
Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/)
in the recent days
in order to add support for the requested functionality. As far as I
can tell, the results
were very promising, so I'm ok to include this into my proposal (I
haven't pushed my
changes yet and haven't updated the RFC yet).Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
But in any case, I'm fine with incorporating this as well into the RFC,
since apparently
both Lexbor and uriparser support this (naturally).What I became puzzled about is the correct object structure and naming.
Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
If it is, then in my opinion, the logical behavior would be that Lexbor
always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case
the differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since
URL objects
could hold URLs parsed based on both specs (and therefore having a
unified interface is required).Or rather we should have a separate URI and a WhatwgUrl class so that
the former one would
always be created by uriparser, while the latter one by Lexbor? This
way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one
could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and
getPassword() methods). But then
the question is how interchangeable these classes should be? I.e.
should we be able to convert them
back and forth, or should there be an interface that is implemented by
the two classes?I'd appreciate any suggestions regarding these questions.
P.S. due to its bad receptance, I got rid of the UrlParser class as
well as the UrlComponent enum from my
implementation in the meantime.Regards,
Máté
I apologize if I missed this up-thread somewhere, but what precisely are the differences between URI and URL? My understanding was that URL is a subset of URI (all URLs are URIs, but not all URIs are URLs). You're saying they're slightly disjoint sets? Can you give some concrete examples of where the parsing rules would produce different results? That may give us a better sense of what the logic should be.
--Larry Garfield
Hey Ignace, Nicolas,
Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/)
in the recent days
in order to add support for the requested functionality. As far as I
can tell, the results
were very promising, so I'm ok to include this into my proposal (I
haven't pushed my
changes yet and haven't updated the RFC yet).Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
But in any case, I'm fine with incorporating this as well into the
RFC, since apparently
both Lexbor and uriparser support this (naturally).What I became puzzled about is the correct object structure and
naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
If it is, then in my opinion, the logical behavior would be that
Lexbor always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case
the differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since
URL objects
could hold URLs parsed based on both specs (and therefore having a
unified interface is required).Or rather we should have a separate URI and a WhatwgUrl class so that
the former one would
always be created by uriparser, while the latter one by Lexbor? This
way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one
could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and
getPassword() methods). But then
the question is how interchangeable these classes should be? I.e.
should we be able to convert them
back and forth, or should there be an interface that is implemented by
the two classes?I'd appreciate any suggestions regarding these questions.
P.S. due to its bad receptance, I got rid of the UrlParser class as
well as the UrlComponent enum from my
implementation in the meantime.Regards,
Máté
Hi Máté,
As far as I can tell, the results were very promising, so I'm ok to
include this into my proposal (I haven't pushed my changes yet and
haven't updated the RFC yet).
This is a great news if indeed it is possible to release both
specifications at the same time that would be really great.
Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
Resolution is common when using an HTTP client and you defined a base
URI and then you can construct
subsequent URI based on that base URI using resolution.
What I became puzzled about is the correct object structure and
naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
Both specification parse and can be represented by a URL value object.
The main difference between both
implementation are around normalization and encoding.
RFC3986 only allow non destructive normalization which is not true in
the case of WHATWG spec:
Here's a simple example to illustrate the differences:
HttPs://0300.0250.0000.0001/path?query=foo%20bar
- with RFC3986 you will end up with
https://0300.0250.0000.0001/path?query=foo%20bar
- with WHATWG you will end up with
https://192.168.0.1/path?query=foo+bar
In the case of WHATWG the host is changed and the query string follow a
distinctive encoding spec.
From my POV you have 2 choices either you use one URL object for both
specifications with distinctive named constructors fromRFC3986 and
fromWhatwg or you have one interface and two distinctive implementations.
I do not think that one can be the extended to create the other one at
least that's my POV.
Hope this helps you in your implementation.
Best regards,
Ignace
Hi Niels,
First of all, thank you for your support!
Why did you choose UrlParser to be a "static" class? Right now it's just a
fancy namespace.
That's a good question, let me explain the reason: one of my major design
goals was to make the UrlParser class to be
extendable and configurable (e.g. via an "engine" property similar to what
Random/Randomizer has). Of course, UrlParser
doesn't support any of this yet, but at least the possibility is there for
followup RFCs due to the class being final.
Since I knew it would be an overkill to require instantiating an UrlParser
instance for a task which is stateless (URL parsing),
finally I settled on using static methods for the purpose. Later, if the
need arises, the static methods could be converted to
non-static ones with minimal BC impact.
It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a
key-value storage for query parameters.
Hm, yes, that's an observation I can agree with. However, this restriction
shouldn't limit followups to add key-value storage
support for query parameters. Although, as far as I could determine,
neither Lexbor is capable of such a thing currently.
Why is UrlComponent a backed enum?
To be honest, it has no specific reason apart from that's what I am used
to. I'm fine with whatever choice, even with getting rid of
UrlComponent completely. I added the UrlParser::parseUrlComponent() method
(and hence the UrlComponent enum) to the
proposal in order to have a direct replacement for parse_url()
when it's
called with the $component parameter set, but I wasn't
really sure whether this is needed at all... So I'm eager to hear any
recommendations regarding this problem.
A nit: We didn't bundle the entire Lexbor engine, only select parts of it.
Just thought I'd make it clear.
Yes, my wording was slightly misleading. I'll clarify this in the RFC.
About edge cases: e.g. what happens if I call the Url constructor and leave
every string field empty?
Nothing :) The Url class in its current form can store invalid URLs. I know
that URLs are generally modeled as value objects (that's
also why the proposed class is immutable), and generally speaking, value
objects should protect their invariants. However, due to
separating the parser to its own class, I abandoned this "rule". So this is
one more downside of the current API.
Regards,
Máté
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the
WHATWG URL living standard), since theparse_url()
function is
optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
I am all for proper data modeling of all the things, so I support this effort.
Comments:
-
There's no need for UrlComponent to be backed.
-
I don't understand why UrlParser is a static class. We just had a whole big debate about that. :-)
There's a couple of ways I could see it working, and I'm not sure which I prefer:
-
Better if we envision the parser getting options or configuration in the future.
$url = new UrlParser()->parseUrl(): Url; -
The named-constructor pattern is quite common.
$url = Url::parseFromString()
$url = Url::parseToArray();
-
I... do not understand the point of having public properties AND getters/withers. A readonly class with withers, OK, a bit clunky to implement but it would be your problem in C, not mine, so I don't care. :-) But why getters AND public properties? If going that far, why not finish up clone-with and then we don't need the withers, either? :-)
-
Making all the parameters to Url required except port makes little sense to me. User/pass is more likely to be omitted 99% of the time than port. In practice, most components are optional, in which case it would be inaccurate to not make them nullable. Empty string wouldn't be quite the same, as that is still a value and code that knows to skip empty string when doing something is basically the same as code that knows to skip nulls. We should assume people are going to instantiate this class themselves often, not just get it from the parser, so it should be designed to support that.
-
I would not make Url final. "OMG but then people can extend it!" Exactly. I can absolutely see a case for an HttpUrl subclass that enforces scheme as http/https, or an FtpUrl that enforces a scheme of ftp, etc. Or even an InternalUrl that assumes the host is one particular company, or something. (If this sounds like scope creep, it's because I am confident that people will want to creep this direction and we should plan ahead for it.)
-
If the intent of the withers is to mimic PSR-7, I don't think it does so effectively. Without the interface, it couldn't be a drop-in replacement for UriInterface anyway. And we cannot extend it to add the interface if it's final. Widening the parameters in PSR-7 interfaces to support both wouldn't work, as that would be a hard-BC break for any existing implementations. So I don't really see what the goal is here.
-
If we ever get "data classes", this would be a good candidate. :-)
-
Crazy idea:
new UriParser(HttpUrl::class)->parse(string);
To allow a more restrictive set of rules. Or even just to cast the object to that child class.
--Larry Garfield
Hi Larry,
Thank you very much for your feedback! I think I have already partially
answered some of your questions in my previous email to Niels,
but let me answer your other questions below:
- I... do not understand the point of having public properties AND
getters/withers. A readonly class with withers, OK, a bit clunky to
implement but it would be your problem in C, not mine, so I don't care.
:-) But why getters AND public properties? If going that far, why not
finish up clone-with and then we don't need the withers, either? :-)
I know it's disappointing, but the public modifiers are just a typo which
were forgotten there from the very first iteration of the API :) However,
I'm fine with having public readonly properties without getters as well, as
long as we declare this a policy that we are going to adopt... Withers are
indeed a must for now (and their implementation indeed requires some magic
in C...).
- Making all the parameters to Url required except port makes little sense
to me. User/pass is more likely to be omitted 99% of the time than port.
In practice, most components are optional, in which case it would be
inaccurate to not make them nullable. Empty string wouldn't be quite the
same, as that is still a value and code that knows to skip empty string
when doing something is basically the same as code that knows to skip
nulls. We should assume people are going to instantiate this class
themselves often, not just get it from the parser, so it should be designed
to support that.
I may have misunderstood what you wrote, but all the parameters - including
port - are required. If you really meant "nullable" instead of "required",
then you are right. Apart from this, I'm completely fine with making these
parameters optional, especially if we decide not to have the UrlParser (my
initial assumption was that the Url class is going to be instantiated via
UrlParser::parseUrl() calls).
- I would not make Url final. "OMG but then people can extend it!"
Exactly. I can absolutely see a case for an HttpUrl subclass that enforces
scheme as http/https, or an FtpUrl that enforces a scheme of ftp, etc. Or
even an InternalUrl that assumes the host is one particular company, or
something. (If this sounds like scope creep, it's because I am confident
that people will want to creep this direction and we should plan ahead for
it.)
Without having thought much about its consequences on the implementation,
I'm fine with removing the final modifier.
- If the intent of the withers is to mimic PSR-7, I don't think it does so
effectively. Without the interface, it couldn't be a drop-in replacement
for UriInterface anyway. And we cannot extend it to add the interface if
it's final. Widening the parameters in PSR-7 interfaces to support both
wouldn't work, as that would be a hard-BC break for any existing
implementations. So I don't really see what the goal is here.
I've just answered this to Ben, but let me reiterate: PSR-7's UriInterface
is only needed because PHP doesn't have a Url internal class. :)
Máté
Hi Everyone,I've been working on a new RFC for a while now, and time has come to present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the
parse_url()
function is optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
The RFC states:
<snip> The Url\Url class is intentionally compatible with the PSR-7 UriInterface. </snip>It mirrors the interface, but it can’t be swapped out for a UriInterface instance, especially since it can’t be extended, so I wouldn’t consider it compatible. I would still need to write a compatibility layer that composes Url\Url and implements UriInterface.
<snip> This makes it possible for a next iteration of the PSR-7 standard to use Url\Url directly instead of requiring implementations to provide their own Psr\Http\Message\UriInterface implementation. </snip>Since PSRs are concerned with shared interfaces and this class is final and does not implement any interfaces, I’m not sure how you envision “a next iteration” of PSR-7 to use this directly, unless what you mean is that UriInterface would be deprecated and applications would type directly against Url\Url.
Cheers,
Ben
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
As a maintainer of a PHP userland URI toolkit I have a couple of
questioms/remarks on the proposal. Fist, I look forward for finally
having a real Url parser AND validator in PHP core. Any effort on that
direction is always a welcomed good news.
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url
. In my view the Url
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.
Therefore, I would rather suggest we ALSO include support for RFC3986
and RFC3987 specification properly and give both specs a go (at the same
time!) and a clear way to instantiate your Url
with one or the other spec.
In clear, my ideal situation would be to add to the parser at least 2
named constructors UrlParser::fromRFC3986
and UrlParser::fromWHATWG
or something similar (name can be changed or improved).
While this is an old article by Daniel Stenberg
(https://daniel.haxx.se/blog/2017/01/30/one-url-standard-please/), it
conveys with more in depth analysis my issues with the WHATWG spec and
its usage in PHP if it were to be use as the ONLY available parser in
PHP core for URL.
the PSR-7 relation is also unfortunate from my POV: PSR-7 UriInterface
is designed to be at its core an HTTP URI representation (so it shares
the same type of issue as the WHATWG spec!) meaning in absence of a
scheme it falls back to the HTTP scheme validation. This is why the
interface can forgone any nullable component because the HTTP spec
allows it, other schemes do not. For instance the FTP scheme prohibits
the presence of the query and fragment components which means they MUST
be null
in that case.
By removing PSR-7 constraints we could add
- the
Url::(get|to)Components
method: it would mimicsparse_url
returned value and as such ease migration fromparse_url
- the
Url::getUsername
andUrl::getPassword
to access the username
and password component individually. You would still use
thewithUserInfo
method to update them but you give the developer the
ability to access both components directly from theUrl
object.
These additions would remove the need for
-
UrlParser::parseUrlToArray
-
UrlParser::parseUrlComponent
-
UrlComponent
Enum
Cheers,
Ignace
Therefore, I would rather suggest we ALSO include support for RFC3986 and RFC3987 specification properly and give both specs a go (at the same time!) and a clear way to instantiate your
Url
with one or the other spec.
In clear, my ideal situation would be to add to the parser at least 2 named constructorsUrlParser::fromRFC3986
andUrlParser::fromWHATWG
or something similar (name can be changed or improved).While this is an old article by Daniel Stenberg (https://daniel.haxx.se/blog/2017/01/30/one-url-standard-please/), it conveys with more in depth analysis my issues with the WHATWG spec and its usage in PHP if it were to be use as the ONLY available parser in PHP core for URL.
I agree that I would love to see a more general IRI parser, with maybe a URI parser being a subtype of an IRI parser.
Cheers,
Ben
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the
WHATWG URL living standard), since theparse_url()
function is
optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
Hey,
That's great that you've made the Url class readonly. Immutability is
realiable. And I fully agree that a better parser is needed.
I agree with the otters that
- the enum might be fine without the backing, if it's needed at all
- I'm not convinced a separate UrlParser is needed,
Url::someFactory($str) should be enough - getters seem unnecessary, they should only be added if you can be sure
they are going to be used for compatibility with PSR-7 - treating $query as a single string is clumsy, having some kind of bag
or at least an array to represent it would be cooler and easier to build
and manipulate
I wanted to add that it might be more useful to make all the Url
constructor arguments optional. Either nullable or with reasonable
defaults. So you could $url = new Url(path: 'robots.txt'); foreach ($domains as $d) $r[] = file_get_contents($url->withHost($d))
and stuff
like that.
Similar modifiers would be very useful for the query stuff, e.g. $u = Url::current(); return $u->withQueryParam('page', $u->queryParam->page + 1);
.
Sure, all of that can be done in the userland as long as you drop
final
:)
BR,
Juris
[…] add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api
First-pass comments/thoughts.
As others have mentioned, it seems the class would/could not actually satisfy PSR-7. Realistically, the PSR-7 interface package or someone else would need to create a new class that combines the two, potentially as part of a transition away from it to the built-in class, with future PSRs building directly on Url. If we take that as given, we might as well design for the end state, and accept that there will be a (minimal) transition. This end state would benefit from being designed with the logical constraints of PSR-7 (so that migration is possible without major surprises), but without restricting us to its exact API shape, since an intermediary class would come into existence either way.
For example, Url could be a value class with merely 8 public properties. Possibly with a UrlImmutable subclass, akin to DateTime, where the properties are read-only instead a clone method could return Url?).
It might be more ergonomic to leave the parser as implementation detail, allowing the API to be accessed from a single import rather than requiring two. This could look like Url::parse() or Url::parseFromString().
For the Url::parseComponent() method, did you consider accepting the existing PHP_URL_* constants? They appear to fit exactly, in naming, description, and associated return types.
Without UrlParser/UrlComponent, I'd adopt it direclty in applications and frameworks. WIthout it, further wrapping seems likely for improved usability. This is sometimes benefitial when exposing low-level APIs, but it seems like this is close to fitting in a single class, as demonstrated by the WHATWG URL API.
One thing I feel is missing, is a method to parse a (partial) URL relative to another. E.g. to expand or translate paths between two URLs. Consider expanding "/w/index.php", or "index.php" relative to "https://wikipedia.org/w/". Or expanding "//example.org" relative to either "https://wikipedia.org" vs "http://wikipedia.org". The WHATWG URL API does this in the form of a second optional string|Stringable parameter to Url::parse(). Implementing "expand URL" with parsing of incomplete URLs is error-prone and hard to get right. Including this would be valuable.
See also Net_URL2 and its resolve() method https://pear.php.net/package/Net_URL2 https://github.com/pear/Net_URL2
--
Timo Tijhof
https://timotijhof.net/
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
I was exploring wrapping ada_url for PHP (
https://github.com/lnear-dev/ada-url). It works, but it's a bit slower,
likely due to the implementation of the objects. I was planning to embed
the zvals directly in the object, similar to PhpToken, but I haven't had
the chance and don't really need it anymore. Shouldn't be too much work to
clean it up though
On Fri, Jun 28, 2024 at 3:38 PM Máté Kocsis kocsismate90@gmail.com
wrote:Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by any
means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
MátéI was exploring wrapping ada_url for PHP (
https://github.com/lnear-dev/ada-url). It works, but it's a bit slower,
likely due to the implementation of the objects. I was planning to embed
the zvals directly in the object, similar to PhpToken, but I haven't had
the chance and don't really need it anymore. Shouldn't be too much work to
clean it up though
I’ve updated the implementation, and with Ada 2.9.0, the performance is now
closer to parse_url
for short URLs and even outperforms it for longer
URLs. You can see the benchmarks in the "Run benchmark script" section of
this GitHub Actions run.
cheers,
Lanre
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the
parse_url()
function is optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
Hi Máté
Something that I thought about lately is how the existing URL parser in PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the url extension could be used instead of the current one for those usages (opt-in).
Kind regards
Niels
Hi Máté
Something that I thought about lately is how the existing URL parser in
PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we
rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the
url extension could be used instead of the current one for those usages
(opt-in).Kind regards
Niels
Hi Niels,
As mentioned before, I believe the "pluggable" system can only be applied
once a RFC3986 URL object is available, using the WHATWG URL
would constitute a major BC. I would even go a step further and state that
even by using the RFC3986 URL object you would still face some issues, for
instance,
in regards to file
scheme based URL. Those are not parsed the same
way with parse_url
function and RFC3986 rules.
Maybe that change may land on PHP9 or the behaviour may be deprecated to be
removed in PHP10 whenever that one happens.
On Sun, Jul 21, 2024 at 1:22 PM Niels Dossche dossche.niels@gmail.com
wrote:
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by any
means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api <
https://wiki.php.net/rfc/url_parsing_api>Regards,
MátéHi Máté
Something that I thought about lately is how the existing URL parser in
PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we
rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the url
extension could be used instead of the current one for those usages
(opt-in).Kind regards
Niels
Hi Ignace, Niels,
Sorry for being silent for so long, I was working hard on the
implementation besides some summer activities :) I can say that I had
really good progress in the last month and now I think (hope) that I
managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:
- The uriparser library is now used for parsing URIs based on RFC 3986.
- I renamed the extension to "uri" in favor of "url" in order to
make the name more generic and to express the new use-case. - There is no Url\UrlParser class anymore. The Uri\Uri class now includes
the relevant factory methods. - Uri/Uri is now an abstract class which is implemented by 2 concrete
classes: Uri\Rfc3986Uri and Uri\WhatwgUri. - WhatWG URL parsing now returns the exact error code according to the
specification (although a reference parameter is used for now - but this is
TBD) - As suggested by Niels, it's now possible to plug an URI parsing
implementation into PHP. A new uri.default_handler INI option is also added.
Currently, integration is only implemented forFILTER_VALIDATE_URL
though.
The approach also makes it possible to register additional 3rd party
libraries for parsing URIs (like ADA URL). - It looks like that performance significantly improved according to the
rough benchmarks performed in CI.
Please re-read the RFC as it shares a bit more details than my quick
summary above: https://wiki.php.net/rfc/url_parsing_api
There are some questions I still didn't manage to find an answer for
though. Most importantly, the URI parser libraries used don't support
modification
of the URI. That's why I had to get rid of the "wither" methods for now
which were originally part of the API. I think it's unfortunate, and I'll
try to do my
best to reclaim them.
Additionally, due to technical reasons, extending the Uri\Uri class in
userland is only possible if all the methods are overridden by the child.
It's because
I had to use "computed" properties in the implementation (roughly, they are
stored in an internal C struct unlike regular properties). That's why it
may be
better if userland code could use (and possibly implement) an Uri\Uri
interface instead.
In one of my previous emails, I had some concerns that RFC 3986 and WhatWg
spec can really share the same interface (they do in my current
implementation
despite that they are different classes). I still share this concern
because WhatWg specifies the "user" and "password" URL components, while
RFC 3986
only specifies the notion of "userinfo" (which is usually just
user:password, but it's not necessarily the case as far as I understood).
The RFC implementation
of the RFC 3986 parser currently splits the 'userinfo' component at the ":"
character, but doing so doesn't seem very spec compliant.
Arnaud suggested that it would be better if the query parameters could be
retrieved both escaped and unescaped after parsing. I haven't had time to
investigate
the possibilities, but my gut feeling is that it's only possible to achieve
with some custom code. Arnaud also had questions regarding canonization.
Currently,
it's not performed when calling the __toString() method, because only
uriparser library supports this feature, and I didn't want to diverge the
two implementations.
I'm not even sure that it's a good idea to always do it so I'm thinking
about the possibility to selectively enable this feature (i.e. adding a
separate "toCanonizedString"
method).
Regards,
Máté
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing URLs
according to any well established standards (RFC 1738 or the WHATWG URL
living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
Máté, thanks for putting this together.
Whenever I need to work with URLs there are a few things missing that I would love to see incorporated into any change in PHP that brings us a spec-compliant parsing class.
First of all, I typically care most about WhatWG URLs because the PHP code I’m working with is making decisions about HTML that a browser will interpret. Paramount above all other concerns that code on the server can understand content in the same way that the browsers will, otherwise we will invite security issues. People may have valid critiques with the WhatWG specification, but it’s also the most-relevant specification for users of much or most of the PHP code we write, and it’s valuable because it allows us to talk about URLs in the same way a browser would.
I’m worried about the side-effects that having a global uri.default_handler could have with code running differently for no apparent reason, or differently based on what is calling it. If someone is writing code for a controlled system I could see this being valuable, but if someone is writing a framework like WordPress and has no control over the environments in which code runs, it seems dangerous to hope that every plugin and every host runs compatible system configurations. Nobody is going to check ini_get( ‘uri.default_handler’ )
before every line that parses URLs. Beyond this, even just allowing a pluggable parser invites broken deployments because PHP code that is reading from a browser or sending output to one needs to speak the language the browser is speaking, not some arbitrary language that’s similar to it.
One thing I feel is missing, is a method to parse a (partial) URL relative to another
Being able to parse a relative URL and know if a URL is relative or absolute would help WordPress, which often makes decisions differently based on this property (for instance, when reading an href
property of a link). I know these aren’t spec-compliant URLs, but they still represent valid values for URL fields in HTML and knowing if they are relative or not requires some amount of parsing specific details everywhere, vs. in a class that already parses URLs. Effectively, this would imply that PHP’s new URL parser decodes document.querySelector( ‘a’ ).getAttribute( ‘href’ )
, which should be the same as document.querySelector( ‘a’ ).href
, and indicates whether it found a full URL or only a portion of one.
-
$url->is_relative
or$url->is_absolute
-
$url->specificity = URL::Relative | URL::Absolute
the URI parser libraries used don't support modification of the URI
Having methods to add query arguments, change the path, etc… would be a great way to simplify user-space code working with URLs. For instance, read a URL and then add a query argument if some condition within the URL warrants it (for example, the path ends in .png
).
Was it intended to add this to the RFC before it’s finalized?
I would not make Url final. "OMG but then people can extend it!" Exactly.
My counter-point to this argument is that I see security exploits appear everywhere that functions which implement specifications are pluggable and extendable. It’s easy to see the need to create a class that limits possible URLs, but that also doesn’t require extending a class. A class can wrap a URL parser just as it could extend one. Magic methods would make it even easier.
A problem that can arise with adding additional rules onto a specification like this is that the subclass gets used in more places than it should and then somewhere some PHP code allows a malicious URL because it failed to parse and then the inspection rules weren’t applied.
Finally, I frequently find the need to be able to consider a URL in both the display context and the serialization context. With Ada we have normalize_url()
, parse_search_params()
, and the IDNA functions to convert between the two representations. In order to keep strong boundaries between security domains, it would be nice if PHP could expose the two variations: one is an encoded form of a URL that machines can easily parse while the other is a “plain string” in PHP that’s easier for humans to parse but which might not even be a valid URL. Part of the reason for this need is that I often see user-space code treating an entire URL as a single text span that requires one set of rules for full decoding; it’s multiple segments that each have their own decoding rules.
- Original [ https://xn--google.com/secret/../search?q=🍔 ]
-
$url->normalize()
[ https://xn--google.com/search?q=%F0%9F%8D%94 ] -
$url->for_display()
Displayed [ https://䕮䕵䕶䕱.com/search?q=🍔 ]
Having this in the RFC would give everyone the tools they need to effectively and safely set links within an HTML document.
All the best,
Dennis Snell
Hi Dennis,
Even though I didn't answer for a long time, I was improving my RFC
implementation in the meanwhile as well as evaluating your suggestions.
I’m worried about the side-effects that having a global
uri.default_handler could
have with code running differently for no apparent reason, or differently
based on what is calling it. If someone is writing code for a controlled
system I could see this being valuable, but if someone is writing a
framework like WordPress and has no control over the environments in which
code runs, it seems dangerous to hope that every plugin and every host runs
compatible system configurations. Nobody is going to checkini_get( ‘uri.default_handler’ )
before every line that parses URLs. Beyond this,
even just allowing a pluggable parser invites broken deployments
because PHP code that is reading from a browser or sending output to one
needs to speak the language the browser is speaking, not some arbitrary
language that’s similar to it.
You convinced me with your arguments regarding the issues a global
uri.default_handler
INI config can cause, especially after having read a blog post by Daniel
Stenberg about the topic (
https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/). That's why I
removed this from the RFC in favor of relying on configuring the parser at
the individual feature level. However, I don't agree with removing a
pluggable parser because of the following reasons:
- the current method (parse_url() based parser) is already doomed, isn't
compliant with any spec, so it already doesn't speak the language the
browser is speaking - even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web - in addition, there are tools which aren't compliant with the WhatWg spec,
but with some other. Most prominently, cURL is mostly RFC3986 compliant
with some additional flavour of WhatWg according to
https://everything.curl.dev/cmdline/urls/browsers.html
That's why I intend to keep support for pluggability.
Being able to parse a relative URL and know if a URL is relative or
absolute would help WordPress, which often makes decisions differently
based on this property (for instance, when reading anhref
property of a
link). I know these aren’t spec-compliant URLs, but they still represent
valid values for URL fields in HTML and knowing if they are relative or not
requires some amount of parsing specific details everywhere, vs. in a class
that already parses URLs. Effectively, this would imply that PHP’s new URL
parser decodesdocument.querySelector( ‘a’ ).getAttribute( ‘href’ )
,
which should be the same asdocument.querySelector( ‘a’ ).href
, and
indicates whether it found a full URL or only a portion of one.
$url->is_relative
or$url->is_absolute
$url->specificity = URL::Relative | URL::Absolute
The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when
the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL, and then WhatWgUri should let you know whether the originally passed
in URI was relative or not, did I get you right? This feature is certainly
possible with RFC3986 URIs (even without the base parameter), but WhatWg
requires the above mentioned workaround for parsing + I have to look into
how this can be implemented...
Having methods to add query arguments, change the path, etc… would be a
great way to simplify user-space code working with URLs. For instance, read
a URL and then add a query argument if some condition within the URL
warrants it (for example, the path ends in.png
).
I managed to retain support for the "wither" methods that were originally
part of the proposal. This required using custom code for the uriparser
library, while the maintainer of Lexbor was kind enough to add native
support for modification after I submitted a feature request. However,
convenience methods for manipulating query parameters are still not part of
the RFC because it would increase the scope of the RFC even more, and due
to other issues highlighted by Ignace in his prior email:
https://externals.io/message/123997#124077. As I really want such a
feature, I'd be eager to create a followup RFC dedicated for handling query
strings.
My counter-point to this argument is that I see security exploits appear
everywhere that functions which implement specifications are pluggable and
extendable. It’s easy to see the need to create a class that limits possible
URLs, but that also doesn’t require extending a class. A class can wrap a
URL parser just as it could extend one. Magic methods would make it even
easier.
Right now, it's only possible to plug internal URI implementation into PHP,
userland classes cannot be used, so this probably reduces the issue.
However, I recently bumped into a technical issue with URIs not being final
which I am currently trying to assess how to solve. More information is
available at one of my comments on my PR:
https://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c06124aa2f331afa8#r1847316607
As far as I see the situation currently, it would probably be better to
make these classes final so that similar unforeseen issues and
inconsistencies cannot happen again (we can unfinalize them later anyway).
Finally, I frequently find the need to be able to consider a URL in both
the display context and the serialization context. With Ada we have
normalize_url()
,parse_search_params()
, and the IDNA functions to
convert between the two representations. In order to keep strong boundaries
between security domains, it would be nice if PHP could expose the two
variations: one is an encoded form of a URL that machines can easily parse
while the other is a “plain string” in PHP that’s easier for humans to
parse but which might not even be a valid URL. Part of the reason for this
need is that I often see user-space code treating an entire URL as a single
text span that requires one set of rules for full decoding; it’s multiple
segments that each have their own decoding rules.
- Original [ https://xn--google.com/secret/../search?q=🍔 ]
$url->normalize()
[ https://xn--google.com/search?q=%F0%9F%8D%94 ]$url->for_display()
Displayed [ https://䕮䕵䕶䕱.com/search?q=
https://xn--google.com/search?q=🍔 ]
Even though I didn't entirely implement this suggestion, I added
normalization support:
- the normalize() method can be used to create a new URI instance whose
components are normalized based on the current object - the toNormalizedString() method can be used when only the normalized
string representation is needed - the newly added equalsTo() method also makes use of normalization to
better identify equal URIs
For more information, please refer to the relevant section of the RFC:
https://wiki.php.net/rfc/url_parsing_api#api_design. The forDisplay()
method also seems to be useful at the first glance, but since this may be a
controversial optional feature, I'd defer it for later...
Regards,
Máté
It seems that I’ve mucked up the mailing list again by deleting an old message I intended on replying to. Apologies all around for replying to an older message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no returning from a bit of an extended leave, so I appreciate your diligence and patience. Here are some thoughts in response to your message from Nov. 19, 2024.
even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web
This has largely been touched on elsewhere, but I will echo the idea that it seems valid to have to separate parsers for the two standards, and truly they diverge enough that it seems like it could be only a superficial thing for them to share an interface.
I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.
Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?
Coming from the XHTML/HTML/XML side I know that there was substantial effort to enforce standards on browsers and that led to decades of security exploits and confusion, when the “official” standards never fully existed in the way people thought. I don’t mean to start any flame wars, but is the URL story at all similar here?
I’m mostly worried that we could accidentally encourage risky behavior for developers who aren’t familiar with the nuances of having to URL specifications vs. having the simplest, least-specific interface point them in the right direction for what they will probably be doing. parse_url()
is a great example of how the thing that looks right is actually terribly prone to failure.
The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL
If this means passing something like the following then I suppose it’s okay. It would be nice to be able to know without passing the second parameter, as there are multitude cases where no such base URL would be available, and some dummy parameter would need to be provided.
$url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
var_dump( $url->is_relative_or_something_like_that );
This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that https://example.com
does not replace the actual host part if one is provided in $url
. For example, this code should work.
$url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘https://example.com’ );
$url->domain === 'wiki.php.net'
The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I'd defer it for later…
Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”
This is a misleading URL to human readers, which is why the WhatWG indicates that “browsers should render a URL’s host by running domain to Unicode with the URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n].
The lack of a standard method here means that (a) most code won’t render the URLs the way a human would recognize them, and (b) those who do will run to inefficient and likely-incomplete user-space code to try and decode/render these hosts.
It may be something fine for a follow-up to this work, but it’s also something I personally consider essential for any native support of handling URLs that are destined for human review. If sending to an href
attribute it should be the normalized URL; but if displayed as text it should be easy to prevent tricking people in this way.
In my HTML decoding RFC I tried to bake in this decision in the type of the function using an enum. Since I figured most people are unaware of the role of the context in which HTML text is decoded, I found the enum to be a suitable convenience as well as educational tool.
$url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
$url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com
The names probably are terrible in all of my code snippets, but at this point I’m not proposing actual names, just code samples good enough to illustrate the point. By forcing a choice here (no default value) someone will see the options and probably make the right call.
This is all looking quite nice. I’m happy to see how the RFC continues to develop, and I’m eagerly looking forward to being able to finally rely on PHP’s handling of URLs.
Happy new year,
Dennis Snell
It seems that I’ve mucked up the mailing list again by deleting an old message I intended on replying to. Apologies all around for replying to an older message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no returning from a bit of an extended leave, so I appreciate your diligence and patience. Here are some thoughts in response to your message from Nov. 19, 2024.even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the webThis has largely been touched on elsewhere, but I will echo the idea that it seems valid to have to separate parsers for the two standards, and truly they diverge enough that it seems like it could be only a superficial thing for them to share an interface.
I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.
Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?
Coming from the XHTML/HTML/XML side I know that there was substantial effort to enforce standards on browsers and that led to decades of security exploits and confusion, when the “official” standards never fully existed in the way people thought. I don’t mean to start any flame wars, but is the URL story at all similar here?
I’m mostly worried that we could accidentally encourage risky behavior for developers who aren’t familiar with the nuances of having to URL specifications vs. having the simplest, least-specific interface point them in the right direction for what they will probably be doing.
parse_url()
is a great example of how the thing that looks right is actually terribly prone to failure.The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URLIf this means passing something like the following then I suppose it’s okay. It would be nice to be able to know without passing the second parameter, as there are multitude cases where no such base URL would be available, and some dummy parameter would need to be provided.
$url = Uri\WhatWgUri::parse( $url, 'https://example.com' ) var_dump( $url->is_relative_or_something_like_that );
This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that
https://example.com
does not replace the actual host part if one is provided in$url
. For example, this code should work.$url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘https://example.com’ ); $url->domain === 'wiki.php.net'
The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I'd defer it for later…
Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”
This is a misleading URL to human readers, which is why the WhatWG indicates that “browsers should render a URL’s host by running domain to Unicode with the URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n].
The lack of a standard method here means that (a) most code won’t render the URLs the way a human would recognize them, and (b) those who do will run to inefficient and likely-incomplete user-space code to try and decode/render these hosts.
It may be something fine for a follow-up to this work, but it’s also something I personally consider essential for any native support of handling URLs that are destined for human review. If sending to an
href
attribute it should be the normalized URL; but if displayed as text it should be easy to prevent tricking people in this way.In my HTML decoding RFC I tried to bake in this decision in the type of the function using an enum. Since I figured most people are unaware of the role of the context in which HTML text is decoded, I found the enum to be a suitable convenience as well as educational tool.
$url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com $url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com
The names probably are terrible in all of my code snippets, but at this point I’m not proposing actual names, just code samples good enough to illustrate the point. By forcing a choice here (no default value) someone will see the options and probably make the right call.
This is all looking quite nice. I’m happy to see how the RFC continues to develop, and I’m eagerly looking forward to being able to finally rely on PHP’s handling of URLs.
Happy new year,
Dennis Snell
Hi Dennis,
I’m curious to hear from folks here hat fraction of the actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them
are truly RFC3986 systems or if the common-enough URLs are valid in both
specs.
Here's my take on both RFC. RFC3986/87 is a "parsing" RFC which leave
the validation to each individual scheme, for instance the following URL
is valid under RFC3986 but will be problematic under WHATWG URL spec
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)
The LDAP URL is RFC3986 compliant but adds its own validation rules on
top of the RFC. This means that LDAP URL generation would be problematic
if we only implement the WHATWG spec, hence why having a RFC3986/87 URI
in PHP is crucial.
Futhermore, the WHATWG spec not only parses but also in the same time
validates and more agressively normalizes the URL something the RFC3986
does not do or more precisely recognizes and categorizes in two
categories, the non-destructive and the destructive normalizations.
These normalization affect the scheme, the path and also the host which
can be very impactful in your application.
For the following URL 'https://0073.0232.0311.0377/b'
RFC3986: 'https://0073.0232.0311.0377/b'
WHATWG URL: 'https://59.154.201.255/b'
So this can be a source of confusion for developper. Last but not least
RFC3986 alone will never be able to parses IDN domain names and required
suport of RFC3987 IDN domains to do so.
Hopefully with those examples you will understand the strenghts and
weaknesses of each spec and why IMHO PHP needs both to be up to date.
Hi Dennis,
I only harp on the WhatWG spec so much because for many people this will
be the only one they are aware of, if they are aware of any spec at all,
and this is a sizable vector of attack targeting servers from user-supplied
content. I’m curious to hear from folks here hat fraction of the actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them are
truly RFC3986 systems or if the common-enough URLs are valid in both specs.
I think Ignace's examples already highlighted that the two specifications
differ in nuances so much that even I had to admit after months of trying
to squeeze them into the same interface that doing so would be
irresponsible.
The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing
URNs or URIs with scheme-specific behavior - like ldap apparently), but
even the UriInterface of PSR-7 can build upon it. On the other hand,
Uri\WhatWg\Url will be useful for representing browser links and any other
URLs for the web (i.e. an HTTP application router component should use this
class).
Just to enlighten me and possibly others with less familiarity, how and
when are RFC3986 URLs used and what are those systems supposed to do when
an invalid URL appears, such as when dealing with percent-encodings as you
brought up in response to Tim?
I am not 100% sure what I brought up to Tim, but certainly, the biggest
difference between the two specs regarding percent-encoding was recently
documented in the RFC:
https://wiki.php.net/rfc/url_parsing_api#percent-encoding . The other main
difference is how the host component is stored: WHATWG automatically
percent-decodes it, while RFC3986 doesn't. This is summarized in the
https://wiki.php.net/rfc/url_parsing_api#component_retrieval section (a bit
below).
This would be fine, knowing in hindsight that it was originally a relative
path. Of course, this would mean that it’s critical thathttps://example.com
does not replace the actual host part if one is
provided in$url
. For example, this code should work.$url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘ https://example.com’ ); $url->domain === 'wiki.php.net'
Yes. it's the case. Both classes only use the base URL for relative URIs.
Hopefully this won’t be too controversial, even though the concept was new
to me when I started having to reliably work with URLs. I choose the
example I did because of human risk factors in security exploits. "
xn--google.com" is not in fact a Google domain, but an IDNA domain
decoding to "䕮䕵䕶䕱.com http://xn--google.com”
I got your point, so I implemented your suggestion. Actually, I made yet
another larger API change in the meanwhile, but in any case, the WHATWG
implementation now supports IDNA the following way:
$url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘", null);
echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // 🐘.com
echo $url->toString(); //
https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
echo $url->toDisplayString(); /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out
at the end of https://externals.io/message/126182#126184), and adding
support for RFC3987 (therefore IRIs) would be a very heavy amount of work, it's
just not feasible within this RFC :( To make things worse, its code should
be written from scratch, since I haven't found any suitable C library yet
for this purpose. That's why I'll leave them for
On other notes, let me share some of the changes since my previous message
to the mailing list:
- First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method
from the proposal after Arnaud's feedback. Now, both the normalized (and
decoded), as well as the non-normalized representation can equally be
retrieved from the same URI instance. This was necessary to change in order
for users to be able to consistently use URIs. Now, if someone needs an
exact URI component value, they can use the getRaw*() getter. If they want
the normalized and percent-decoded form then a get*() getter should be
used. For more information, the
https://wiki.php.net/rfc/url_parsing_api#component_retrieval section should
be consulted. - I made a few less important API changes, like converting the WhatWgError
class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing
the return type of some getters (removing nullability) etc. - I fixed quite some smaller details of the implementation along with a
very important spec incompatibility: until now, the "path" component didn't
contain the leading "/" character when it should have. Now, both classes
conform to their respective specifications with regards to path handling.
I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until now are
addressed some way or another. However, the only remaining question that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
should be final? Personally, I don't see much problem with opening them for
extension (other than some technical challenges that I already shared a few
months ago), and I think people will have legitimate use cases for
extending these classes. On the other hand, having final classes may allow
us to make slightly more significant changes without BC concerns until we
have a more battle-tested API, and of course completely eliminate the need
to overcome the said technical challenges. According to Tim, it may also
result in safer code because spec-compliant base classes cannot be extended
by possibly non-spec compliant/buggy children. I don't necessarily fully
agree with this specific concern, but here it is.
Regards,
Máté
Hi
Am 2025-02-16 23:01, schrieb Máté Kocsis:
I only harp on the WhatWG spec so much because for many people this
will
be the only one they are aware of, if they are aware of any spec at
all,
and this is a sizable vector of attack targeting servers from
user-supplied
content. I’m curious to hear from folks here hat fraction of the
actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them
are
truly RFC3986 systems or if the common-enough URLs are valid in both
specs.I think Ignace's examples already highlighted that the two
specifications
differ in nuances so much that even I had to admit after months of
trying
to squeeze them into the same interface that doing so would be
irresponsible.
I think this is also a good argument in favor of finally making the
classes final. Not making them final would allow for irresponsible
sub-classes :-)
echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // 🐘.com
echo $url->toString(); //
https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
echo $url->toDisplayString(); /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
The naming of these methods seems to be a little inconsistent. It should
either be:
->getHostForDisplay()
->toStringForDisplay()
or
->getDisplayHost()
->toDisplayString()
but not a mix between both of them.
I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until now
are
addressed some way or another. However, the only remaining question
that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url
classes
should be final? Personally, I don't see much problem with opening them
for
Yes. Besides the remark above, my previous arguments still apply (e.g.
with()
ers not being able to construct instances for subclasses,
requiring to override all of them). I'm also noticing that serialization
is unsafe with subclasses that add a $__uri
property (or perhaps any
property at all?).
We already had extensive off-list discussion about the RFC and I agree
it's in a good shape now. I've given it another read and here's my
remarks:
The toDisplayString()
method that you mentioned above is not in the
RFC. Did you mean toHumanFriendlyString()
? Which one is correct?
The example output of the $errors
array does not match the stub. It
contains a failure
property, should that be softError
instead?
The RFC states "When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results in
a failure."
What happens for Rfc3986 when passing an invalid URI to the constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since $errors
is not applicable for 3986?
The RFC does not specify when UninitializedUriException
is thrown.
The RFC does not specify when UriOperationException
is thrown.
Generally speaking I believe it would help understanding if you would
add a /** @throws InvalidUriException */
to each of the methods in the
stub to make it clear which ones are able to throw (e.g. resolve(), or
the withers). It's harder to find this out from “English” rather than
“code” :-)
In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if %26
is decoded to &
in a query-string. Or if %3D
is
decoded to =
. This really is the same case as with %2F
in a path.
The explanation
"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components."
alone is not clear to me. “reserved characters that are not allowed in a
component”. I assume this means that %2F
(/) in a path will not be
decoded, but %3F
(?) will, because a bare ?
can't appear in a path?
In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:
$url->getRawScheme() // does not exist, because WhatWgUrl always
normalizes the scheme
to better point out the differences between the two APIs with regard to
normalization (it's mentioned, but having it in the code blocks would
make it more visible).
In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ?
and #
as necessary. Will the same happen
for Rfc3986? Will the encoding of #
also happen for the query-string
component? The RFC only mentions the path component.
I'm also wondering if there are cases where the withers would not
round-trip, i.e. where $url->withPath($url->getPath())
would not
result in the original URL?
Can you add examples where the authority / host contains IPv6 literals?
It would be useful to specifically show whether or not the square
brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening 2001:db8:0:0:0:0:0:1
to 2001:db8::1
).
In “Component Recomposition” the RFC states "The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".
Does this mean that toString() for Rfc3986 will always return the
original input?
It would be useful to know whether or not the classes implement
__debugInfo()
/ how they appear when var_dump()
ing them.
Best regards
Tim Düsterhus
Hi
[dropping Dennis from the Cc list]
Am 2025-02-21 13:06, schrieb Tim Düsterhus:
We already had extensive off-list discussion about the RFC and I agree
it's in a good shape now. I've given it another read and here's my
remarks:
One more thing that came to my mind, but where I'm not sure what the
correct choice is:
Naming of WhatWgError
and WhatWgErrorType
. They are placed within
the Uri\WhatWg
namespace, making the WhatWg
in their name a little
redundant.
For Exceptions the recommendation is to use this kind of redundant
naming, to make implicit imports for catch blocks more convenient
compared to needing to alias each and every Exception
class. The same
reasoning could also apply here, but here I find it less obvious.
The alternative would probably be Uri\WhatWg\Error
and
Uri\WhatWg\Error\Type
.
No strong opinion from my side, but wanted to mention it nevertheless.
Best regards
Tim Düsterhus
-----Original Message-----
From: Tim Düsterhus tim@bastelstu.be
Sent: Sunday, February 23, 2025 5:05 PM
To: Máté Kocsis kocsismate90@gmail.com
Cc: Internals internals@lists.php.net
Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing
API
Naming of
WhatWgError
andWhatWgErrorType
. They are placed within the
Uri\WhatWg
namespace, making theWhatWg
in their name a little
redundant.
Hey,
As those are URI validation errors, maybe something like
Uri\WhatWg\ValidationError
would be both less clashy and less redundant?
If I'd see WhatWgError
without seeing the "Uri" keyword I'd probably think
it's related to other aspects of the spec, e.g. something went wrong with
the HTML parsing. Although I understand it's validating against the WhatWg
spec, UriError
would seem clearer to me.
BR,
Juris
Hi
Am 2025-02-23 18:47, schrieb Juris Evertovskis:
As those are URI validation errors, maybe something like
Uri\WhatWg\ValidationError
would be both less clashy and less
redundant?
I like that suggestion.
Best regards
Tim Düsterhus
Hi Juris and Tim,
Am 2025-02-23 18:47, schrieb Juris Evertovskis:
As those are URI validation errors, maybe something like
Uri\WhatWg\ValidationError
would be both less clashy and less
redundant?I like that suggestion.
Best regards
Tim Düsterhus
I liked it as well, so I changed the related classes the following way:
- Uri\WhatWg\WhatWgError became Uri\WhatWg\UrlValidationError
- Uri\WhatWg\WhatWgErrorType became Uri\WhatWg\UrlValidationErrorType
This way, WhatWg is not duplicated in the FQCN, but the class name is still
specific enough to possibly not clash with anything else.
I could also imagine removing the Url prefix, but I like it, since it
highlights that it's related to WHATWG URLs.
Regards,
Máté
Hi
Am 2025-02-16 23:01, schrieb Máté Kocsis:
I only harp on the WhatWG spec so much because for many people this
will
be the only one they are aware of, if they are aware of any spec at
all,
and this is a sizable vector of attack targeting servers from
user-supplied
content. I’m curious to hear from folks here hat fraction of the
actual PHP
code deals with RFC3986 URLs, and of those, if the systems using
them are
truly RFC3986 systems or if the common-enough URLs are valid in both
specs.I think Ignace's examples already highlighted that the two
specifications
differ in nuances so much that even I had to admit after months of
trying
to squeeze them into the same interface that doing so would be
irresponsible.I think this is also a good argument in favor of finally making the
classes final. Not making them final would allow for irresponsible
sub-classes :-)echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // 🐘.com
echo $url->toString(); //
https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
echo $url->toDisplayString(); /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98The naming of these methods seems to be a little inconsistent. It
should either be:->getHostForDisplay()
->toStringForDisplay()or
->getDisplayHost()
->toDisplayString()but not a mix between both of them.
I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until
now are
addressed some way or another. However, the only remaining question
that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
should be final? Personally, I don't see much problem with opening
them forYes. Besides the remark above, my previous arguments still apply (e.g.
with()
ers not being able to construct instances for subclasses,
requiring to override all of them). I'm also noticing that
serialization is unsafe with subclasses that add a$__uri
property
(or perhaps any property at all?).
We already had extensive off-list discussion about the RFC and I agree
it's in a good shape now. I've given it another read and here's my
remarks:
The
toDisplayString()
method that you mentioned above is not in the
RFC. Did you meantoHumanFriendlyString()
? Which one is correct?
The example output of the
$errors
array does not match the stub. It
contains afailure
property, should that besoftError
instead?
The RFC states "When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results
in a failure."What happens for Rfc3986 when passing an invalid URI to the
constructor? Will an exception be thrown? What will the error array
contain? Is it perhaps necessary to subclass Uri\InvalidUriException
for use with WhatWgUrl, since$errors
is not applicable for 3986?
The RFC does not specify when
UninitializedUriException
is thrown.
The RFC does not specify when
UriOperationException
is thrown.
Generally speaking I believe it would help understanding if you would
add a/** @throws InvalidUriException */
to each of the methods in
the stub to make it clear which ones are able to throw (e.g.
resolve(), or the withers). It's harder to find this out from
“English” rather than “code” :-)
In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if%26
is decoded to&
in a query-string. Or if%3D
is
decoded to=
. This really is the same case as with%2F
in a path.
The explanation"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded
that are not allowed in a component. This behavior is needed to be
able to unambiguously retrieve components."alone is not clear to me. “reserved characters that are not allowed in
a component”. I assume this means that%2F
(/) in a path will not be
decoded, but%3F
(?) will, because a bare?
can't appear in a path?
In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:$url->getRawScheme() // does not exist, because WhatWgUrl always
normalizes the schemeto better point out the differences between the two APIs with regard
to normalization (it's mentioned, but having it in the code blocks
would make it more visible).
In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode?
and#
as necessary. Will the same
happen for Rfc3986? Will the encoding of#
also happen for the
query-string component? The RFC only mentions the path component.I'm also wondering if there are cases where the withers would not
round-trip, i.e. where$url->withPath($url->getPath())
would not
result in the original URL?
Can you add examples where the authority / host contains IPv6
literals? It would be useful to specifically show whether or not the
square brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening2001:db8:0:0:0:0:0:1
to2001:db8::1
).
In “Component Recomposition” the RFC states "The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".Does this mean that toString() for Rfc3986 will always return the
original input?
It would be useful to know whether or not the classes implement
__debugInfo()
/ how they appear whenvar_dump()
ing them.Best regards
Tim Düsterhus
Hi Maté I just read the final proposal and here's my quick remarks it
may be possible other have already highlighted some of those remarks:
I believe there's a typo in the RFC
All URI components - with the exception of the host - can be
retrieved in two formats:
I believe you mean - with the excepotion of the Port
0 - It is a unfortunate that there's no IDNA support for RFC3986, I
understand the reasoning behind that decision but I was wondering if it
was possible to optin its use when the ext-intl extension is present ?
1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they
will also get a Uri::toDisplayString
and Uri::getHostForDisplay
maybe this should be stated in the Futurscope ?
2 - I would go with final classes for both classes and promote
decoration for extension. This would reduce security issues a lot.
3 - I would make the constructor private using a from
, tryFrom
or
parse
and tryParse
methods to highlight the difference in result
4 - For consistency I would use toRawString and toString just like it is
done for components.
5 - Can the returned array from __debugInfo be used in a "normal" method
like toComponents
naming can be changed/improve to ease migration from
parse_url or is this left for userland library ?
Hi
Am 2025-02-24 10:18, schrieb Ignace Nyamagana Butera:
5 - Can the returned array from __debugInfo be used in a "normal"
method liketoComponents
naming can be changed/improve to ease
migration from parse_url or is this left for userland library ?
I would prefer not expose this functionality for the same reason that
there are no raw properties provided: The user must make an explicit
choice whether they are interested in the raw or in the normalized
version of the individual components.
It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the encoded
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signature, e.g. %2f
is different
than %2F
from a SAML signature perspective, requiring workarounds like
this:
https://github.com/SAML-Toolkits/php-saml/blob/c89d78c4aa398767cf9775d9e32d445e64213425/lib/Saml2/Utils.php#L724-L737
Best regards
Tim Düsterhus
Hi,
Thanks for all the efforts making this RFC happen, it'll be a game changer
in the domain!
I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.
When shipping a new class, one ships two things: a behavior and a type. The
behavior is what some want to close by making the class final. But the
result is that the type will also be final. And this would lead to a
situation where people tighly couple their code to one single
implementation - the internal one.
The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)
If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.
Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By making
the type non-final, we keep things open enough for userland to build on it.
If not, we're going to end up with a fragmented community: some will
tightly couple to the native Url implementation, some others will define a
UriInterface of their own and will compose it with the native
implementation, all these with non-interoperable base types of course,
because interop is hard.
By making the classes non-final, there will be one base type to build upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve my
main concern).
5 - Can the returned array from __debugInfo be used in a "normal"
method like
toComponents
naming can be changed/improve to ease
migration from parse_url or is this left for userland library ?I would prefer not expose this functionality for the same reason that
there are no raw properties provided: The user must make an explicit
choice whether they are interested in the raw or in the normalized
version of the individual components.
The RFC is also missing whether __debugInfo returns raw or non-raw
components. Then, I'm wondering if we need this per-component break for
debugging at all? It might be less confusing (on this encoding aspect) to
dump basically what __serialize() returns (under another key than __uri of
course).
This would also close the avenue of calling __debugInfo() directly (at the
cost of making it possibly harder to move away from parse_url()
, but I
don't think we need to make this simpler - getting familiar with the new
API before would be required and welcome actually.)
It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the encoded
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signature
I would be careful with this argument: signature validation should be done
on raw bytes. Requiring an object to preserve byte-level accuracy while the
very purpose of OOP is to provide abstractions might be conflicting. The
signing topic can be solved by keeping the raw signed payload around.
Hi
Am 2025-02-24 12:08, schrieb Nicolas Grekas:
The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot
be
achieve because there's no type to compose.
Yes, that's the point: The behavior and the type are intimately tied
together. The Uri/Url classes are representing values, not services. You
wouldn't extend an int either. For DateTimeImmutable inheritance being
legal causes a ton of needless bugs (especially around serialization
behavior).
Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By
making
For a given specification (RFC 3986 / WHATWG) there is exactly one
correct interpretation of a given URL. “Fine-tuning” means that you are
no longer following the specification.
the type non-final, we keep things open enough for userland to build on
it.
This works:
final class HttpUrl {
private readonly \Uri\Rfc3986\Uri $uri;
public function __construct(string $uri) {
$this->uri = new \Uri\Rfc3986\Uri($uri);
if ($this->uri->getScheme() !== 'http') {
throw new ValueError('Scheme must be http');
}
}
public function toRfc3986(): \Uri\Rfc3986\Uri {
return $this->uri;
}
}
Userland can easily build their convenience wrappers around the classes,
they just need to export them to the native classes which will then
guarantee that the result is fully validated and actually a valid
URI/URL. Keep in mind that the ext/uri extension will always be
available, thus users can rely on the native implementation.
By making the classes non-final, there will be one base type to build
upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve
my
main concern).
Mate already explained why a native UriInterface was intentionally
removed from the RFC in https://news-web.php.net/php.internals/126425.
The RFC is also missing whether __debugInfo returns raw or non-raw
components. Then, I'm wondering if we need this per-component break for
debugging at all? It might be less confusing (on this encoding aspect)
to
dump basically what __serialize() returns (under another key than __uri
of
course).
That would also work for me.
It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the encoded
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signatureI would be careful with this argument: signature validation should be
done
on raw bytes. Requiring an object to preserve byte-level accuracy while
the
very purpose of OOP is to provide abstractions might be conflicting.
The
signing topic can be solved by keeping the raw signed payload around.
Yes, the SAML signature behavior is wrong, but I did not write the SAML
specification. I just pointed out how it a possible use-case where
choosing the raw or normalized form depends on the component and where a
“get all components” function would be dangerous.
Best regards
Tim Düsterhus
Am 2025-02-24 12:08, schrieb Nicolas Grekas:
The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot
be
achieve because there's no type to compose.Yes, that's the point: The behavior and the type are intimately tied
together. The Uri/Url classes are representing values, not services. You
wouldn't extend an int either. For DateTimeImmutable inheritance being
legal causes a ton of needless bugs (especially around serialization
behavior).
DatetimeImmutable is a good example of community-proven usefulness for
inheritance:
the carbon package has a huge success because it does add a ton of nice
helpers (that are better maintained in userland) while still providing
compatibility with functions that accept the native type.
The fact that the native implementation had bugs when inheritance was used
doesn't mean inheritance is a problem. It's just bugs that need to be
fixed. Conceptually nothing makes those bugs inevitable.
Closing the class would have hindered community-innovation. The same
applies here.
Then, if people make mistakes in their child classes, their problem. But
the community shouldn't be forbidden to extend a class just because
mistakes can happen.
Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By
makingFor a given specification (RFC 3986 / WHATWG) there is exactly one
correct interpretation of a given URL. “Fine-tuning” means that you are
no longer following the specification.
See Carbon example, it's not specifically about fine-tuning. We cannot
anticipate how creative people are. Nor should we prevent them from being
so, from the PoV of the PHP engine designers.
the type non-final, we keep things open enough for userland to build on
it.
This works:
final class HttpUrl { private readonly \Uri\Rfc3986\Uri $uri; public function __construct(string $uri) { $this->uri = new \Uri\Rfc3986\Uri($uri); if ($this->uri->getScheme() !== 'http') { throw new ValueError('Scheme must be http'); } } public function toRfc3986(): \Uri\Rfc3986\Uri { return $this->uri; } }
Userland can easily build their convenience wrappers around the classes,
they just need to export them to the native classes which will then
guarantee that the result is fully validated and actually a valid
URI/URL. Keep in mind that the ext/uri extension will always be
available, thus users can rely on the native implementation.
This is an example of what I call community-fragmentation: one hardcoded
type that should only be used as an implementation detail, but will leak at
type-boundaries and will make things inflexible. Each project will have to
think about such designs, and many more will get it wrong. (We will be the
ones to blame since we're the ones educated on the topic.)
By making the classes non-final, there will be one base type to build
upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve
my
main concern).Mate already explained why a native UriInterface was intentionally
removed from the RFC in https://news-web.php.net/php.internals/126425.
The only one option remains - making the class non-final.
Nicolas
On Mon, 24 Feb 2025 at 14:45, Nicolas Grekas nicolas.grekas+php@gmail.com
wrote:
Am 2025-02-24 12:08, schrieb Nicolas Grekas:
The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot
be
achieve because there's no type to compose.Yes, that's the point: The behavior and the type are intimately tied
together. The Uri/Url classes are representing values, not services. You
wouldn't extend an int either. For DateTimeImmutable inheritance being
legal causes a ton of needless bugs (especially around serialization
behavior).DatetimeImmutable is a good example of community-proven usefulness for
inheritance:
the carbon package has a huge success because it does add a ton of nice
helpers (that are better maintained in userland) while still providing
compatibility with functions that accept the native type.The fact that the native implementation had bugs when inheritance was used
doesn't mean inheritance is a problem. It's just bugs that need to be
fixed. Conceptually nothing makes those bugs inevitable.Closing the class would have hindered community-innovation. The same
applies here.
TBH, data-point from someone that spends time removing Carbon usages here
:-P
The DateTimeImmutable
type should've been final
from the start: it is
trivial to declare a userland interface, and then use the
DateTimeImmutable
type as an implementation detail of a userland-provided
interface.
PSR-7, for example, will benefit greatly from this new RFC, without ever
having to expose the underlying value type to userland.
Inheritance is a tool to be used when there is LSP-compliant divergence
from the original type, and here, the PHP RFC aims at modeling something
that doesn't have alternative implementations: it's closed for
modification, and that's good.
Marco Pivetta
Am 24.02.2025 um 14:57 schrieb Marco Pivetta:
The
DateTimeImmutable
type should've beenfinal
from the start: it is
trivial to declare a userland interface, and then use the
DateTimeImmutable
type as an implementation detail of a userland-
provided interface.
+1
I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.When shipping a new class, one ships two things: a behavior and a type. The behavior is what some want to close by making the class final. But the result is that the type will also be final. And this would lead to a situation where people tighly couple their code to one single implementation - the internal one.
The situation I'm telling about is when one will accept an argument described as
function (\Uri\WhatWg\Url $url)If the Url class is final, this signature means only one possible implementation can ever be passed: the native one. Composition cannot be achieve because there's no type to compose.
Fine-tuning the behavior provided by the RFC is what we might be most interested in, but we should not forget that we also ship a type. By making the type non-final, we keep things open enough for userland to build on it. If not, we're going to end up with a fragmented community: some will tightly couple to the native Url implementation, some others will define a UriInterface of their own and will compose it with the native implementation, all these with non-interoperable base types of course, because interop is hard.
By making the classes non-final, there will be one base type to build upon for userland.
(the alternative would be to define native UrlInterface, but that'd increase complexity for little to no gain IMHO - althought that'd solve my main concern).
The open/closed principle does not mean "open to inheritance".
Just pulling in the Wikipedia definition: [1]
In object-oriented programming, the open–closed principle (OCP) states "software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification";
You can extend a class by using a decorator or the delegation pattern.
But most importantly, I would like to focus on the "closed for modification" part of the principle.
Unless we make all the methods final, inheritance allows you to modify the behaviour of the methods, which is in opposition to the principle.
Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.
Quoting Dijkstra:
The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.
A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of such a type,
you know with absolute certainty how it behaves and what you can do with it, and know that if a consumer needs a WhatWg URI it will not reject it.
This also means consumers of said WhatWg\Uri type do not need to care about validation of it.
If one is able to extend a WhatWg URI, then none of the above applies, and you just have a raw string with fancy methods.
I.e. you are now vague, and any consumer of the type needs to do validation because it cannot trust the type, and you have created a useless abstraction.
It also seems you did not read the relevant "Why a common URI interface is not supported?" [2] section of the RFC.
The major reason why this RFC has had so many iterations and been in discussion for so long is because Máté tried, again and again, to have a common interface.
But this just does not make any sense, you cannot make something extremely concrete vague and abstract, unless you want to lose all the benefits of the abstraction.
Best regards,
Gina P. Banyard
[1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
[2] https://wiki.php.net/rfc/url_parsing_api#why_a_common_uri_interface_is_not_supported
What's wrong with declaring all the methods as final eg.
https://github.com/lnear-dev/ada-url/blob/main/ada_url.stub.php
On Monday, 24 February 2025 at 11:08, Nicolas Grekas <
nicolas.grekas+php@gmail.com> wrote:I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.When shipping a new class, one ships two things: a behavior and a type.
The behavior is what some want to close by making the class final. But the
result is that the type will also be final. And this would lead to a
situation where people tighly couple their code to one single
implementation - the internal one.The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By making
the type non-final, we keep things open enough for userland to build on it.
If not, we're going to end up with a fragmented community: some will
tightly couple to the native Url implementation, some others will define a
UriInterface of their own and will compose it with the native
implementation, all these with non-interoperable base types of course,
because interop is hard.By making the classes non-final, there will be one base type to build upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve my
main concern).The open/closed principle does not mean "open to inheritance".
Just pulling in the Wikipedia definition: [1]In object-oriented programming, the open–closed principle (OCP) states
"software entities (classes, modules, functions, etc.) should be open for
extension, but closed for modification";You can extend a class by using a decorator or the delegation pattern.
But most importantly, I would like to focus on the "closed for
modification" part of the principle.
Unless we make all the methods final, inheritance allows you to modify
the behaviour of the methods, which is in opposition to the principle.Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg
spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.Quoting Dijkstra:
The purpose of abstracting is not to be vague, but to create a new
semantic level in which one can be absolutely precise.A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of
such a type,
you know with absolute certainty how it behaves and what you can do
with it, and know that if a consumer needs a WhatWg URI it will not reject
it.
This also means consumers of said WhatWg\Uri type do not need to care
about validation of it.If one is able to extend a WhatWg URI, then none of the above applies,
and you just have a raw string with fancy methods.
I.e. you are now vague, and any consumer of the type needs to do
validation because it cannot trust the type, and you have created a
useless abstraction.It also seems you did not read the relevant "Why a common URI interface
is not supported?" [2] section of the RFC.
The major reason why this RFC has had so many iterations and been in
discussion for so long is because Máté tried, again and again, to have a
common interface.
But this just does not make any sense, you cannot make something extremely
concrete vague and abstract, unless you want to lose all the benefits of
the abstraction.Best regards,
Gina P. Banyard
[1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
[2]
https://wiki.php.net/rfc/url_parsing_api#why_a_common_uri_interface_is_not_supported
Hi
Am 2025-02-24 15:05, schrieb Hammed Ajao:
What's wrong with declaring all the methods as final eg.
https://github.com/lnear-dev/ada-url/blob/main/ada_url.stub.php
It is not possible to construct a subclass in a generic fashion, because
you don't know the constructor’s signature and you also don’t know if it
added some properties with a certain semantic. That means that the
with*()
ers are unable to return an instance of the subclass, leading
to confusing behavior in cases like these:
final class HttpUrl extends \Uri\Rfc3986\Uri {
public function __construct(string $uri, public readonly bool
$allowInsecure) {
parent::__construct($uri);
if ($this->getScheme() !== 'https') {
if ($allowInsecure) {
if ($this->getScheme() !== 'http') {
throw new ValueError('Scheme must be https or
http');
}
} else {
throw new ValueError('Scheme must be https');
}
}
}
}
$httpUrl = (new HttpUrl('https://example.com'))->withPath('/foo');
get_class($httpUrl); // \Uri\Rfc3986\Uri
Best regards
Tim Düsterhus
Le lun. 24 févr. 2025 à 14:57, Gina P. Banyard internals@gpb.moe a écrit :
On Monday, 24 February 2025 at 11:08, Nicolas Grekas <
nicolas.grekas+php@gmail.com> wrote:I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.When shipping a new class, one ships two things: a behavior and a type.
The behavior is what some want to close by making the class final. But the
result is that the type will also be final. And this would lead to a
situation where people tighly couple their code to one single
implementation - the internal one.The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By making
the type non-final, we keep things open enough for userland to build on it.
If not, we're going to end up with a fragmented community: some will
tightly couple to the native Url implementation, some others will define a
UriInterface of their own and will compose it with the native
implementation, all these with non-interoperable base types of course,
because interop is hard.By making the classes non-final, there will be one base type to build upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve my
main concern).The open/closed principle does not mean "open to inheritance".
Just pulling in the Wikipedia definition: [1]In object-oriented programming, the open–closed principle (OCP) states
"software entities (classes, modules, functions, etc.) should be open for
extension, but closed for modification";You can extend a class by using a decorator or the delegation pattern.
Yes.
You can strike decoration with a non-final class (and no base interface),
that's my point.
But most importantly, I would like to focus on the "closed for
modification" part of the principle.
Unless we make all the methods final, inheritance allows you to modify
the behaviour of the methods, which is in opposition to the principle.Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg
spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.Quoting Dijkstra:
The purpose of abstracting is not to be vague, but to create a new
semantic level in which one can be absolutely precise.A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of
such a type,
you know with absolute certainty how it behaves and what you can do
with it, and know that if a consumer needs a WhatWg URI it will not reject
it.
This also means consumers of said WhatWg\Uri type do not need to care
about validation of it.If one is able to extend a WhatWg URI, then none of the above applies,
and you just have a raw string with fancy methods.
I.e. you are now vague, and any consumer of the type needs to do
validation because it cannot trust the type, and you have created a
useless abstraction.
A couple of non-final Url classes would still be absolutely useful: e.g. as
a consumer/callee, I would have stated very clearly that I need an object
that behaves like native Url objects. Then, if the implementation doesn't,
that's on the caller. The abstraction would do its job. I don't think the
extra guarantees you're describing would be useful in practice (but you
could still do an exact ::class comparison if you'd really want to).
It also seems you did not read the relevant "Why a common URI interface
is not supported?" [2] section of the RFC.
This sentence comes to me as unnecessarily confrontational. I’d really like
to keep this discussion as constructive as possible so that php-internal
remains a welcoming space for everyone.
The major reason why this RFC has had so many iterations and been in
discussion for so long is because Máté tried, again and again, to have a
common interface.
But this just does not make any sense, you cannot make something extremely
concrete vague and abstract, unless you want to lose all the benefits of
the abstraction.
I was considering the alternative of providing TWO interfaces indeed. Sorry
if that wasn't clear enough.
Nicolas
Hi,
Thanks for all the efforts making this RFC happen, it'll be a game
changer in the domain!I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.When shipping a new class, one ships two things: a behavior and a type.
The behavior is what some want to close by making the class final. But
the result is that the type will also be final. And this would lead to a
situation where people tighly couple their code to one single
implementation - the internal one.The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By
making the type non-final, we keep things open enough for userland to
build on it. If not, we're going to end up with a fragmented community:
some will tightly couple to the native Url implementation, some others
will define a UriInterface of their own and will compose it with the
native implementation, all these with non-interoperable base types of
course, because interop is hard.By making the classes non-final, there will be one base type to build
upon for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve
my main concern).> 5 - Can the returned array from __debugInfo be used in a "normal" > method like `toComponents` naming can be changed/improve to ease > migration from parse_url or is this left for userland library ? I would prefer not expose this functionality for the same reason that there are no raw properties provided: The user must make an explicit choice whether they are interested in the raw or in the normalized version of the individual components.
The RFC is also missing whether __debugInfo returns raw or non-raw
components. Then, I'm wondering if we need this per-component break for
debugging at all? It might be less confusing (on this encoding aspect)
to dump basically what __serialize() returns (under another key than
__uri of course).
This would also close the avenue of calling __debugInfo() directly (at
the cost of making it possibly harder to move away fromparse_url()
, but
I don't think we need to make this simpler - getting familiar with the
new API before would be required and welcome actually.)It can make sense to normalize a hostname, but not the path. My usual example against normalizing the path is that SAML signs the *encoded* URI instead of the payload and changing the case in percent-encoded characters is sufficient to break the signature
I would be careful with this argument: signature validation should be
done on raw bytes. Requiring an object to preserve byte-level accuracy
while the very purpose of OOP is to provide abstractions might be
conflicting. The signing topic can be solved by keeping the raw signed
payload around.
Hi Nicolas,
> 5 - Can the returned array from __debugInfo be used in a "normal" > method like `toComponents` naming can be changed/improve to ease > migration from parse_url or is this left for userland library ? I would prefer not expose this functionality for the same reason that there are no raw properties provided: The user must make an explicit choice whether they are interested in the raw or in the normalized version of the individual components.
I only mention this because I saw the debugInfo method being
implemented. TBH I would be more be in favor of removing the method all
together I fail to see the added value of such method unless we want to
hide the class internal property in which case it should then "just"
show the raw URL and nothing more.
Hi Tim,
Thank you again for the thorough review!
The naming of these methods seems to be a little inconsistent. It should
either be:->getHostForDisplay() ->toStringForDisplay()
or
->getDisplayHost() ->toDisplayString()
but not a mix between both of them.
Yes, I completely agree with your concern. I'm just not sure yet which
combination I'd prefer.
Probably the latter one?
Yes. Besides the remark above, my previous arguments still apply (e.g.
with()
ers not being able to construct instances for subclasses,
requiring to override all of them). I'm also noticing that serialization
is unsafe with subclasses that add a$__uri
property (or perhaps any
property at all?).
Hm, yes, you are right indeed that withers cannot really create new
instances on
their own because the whole URI string is needed to instantiate a
new object... which is only
accessible if it's reconstructed by swapping the relevant component with
its new value.
Please note that trying to serialize a $__uri property will result in an
exception.
The
toDisplayString()
method that you mentioned above is not in the
RFC. Did you meantoHumanFriendlyString()
? Which one is correct?
The toHumanFriendlyString() method stuck there from a previous version of
the proposal,
since then I converted it to toDisplayString().
The example output of the
$errors
array does not match the stub. It
contains afailure
property, should that besoftError
instead?
The $softError property is also an outdated name: I recently changed it to
$failure
to be consistent with the wording that the WHATWG specification uses.
The RFC states "When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results in
a failure."What happens for Rfc3986 when passing an invalid URI to the constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since$errors
is not applicable for 3986?
The first two questions are answered right at the top of the parsing
section:
"the constructor: It expects a URI, and optionally, a base URL in order to
support reference resolution.
When parsing is unsuccessful, a Uri\InvalidUriException is thrown."
The $errors property will contain an empty array though, as you supposed. I
don't see much problem
with using the same exception in both cases, however I'm also fine
with making the $errors property
nullable in order to indicate that returning errors is not supported by the
implementation triggering
the error.
The RFC does not specify when
UninitializedUriException
is thrown.
That's a very good catch! I completely forgot about some exceptions. This
one is used
for indicating that an URI is not correctly initialized: when a URI
instance is created
without actually invoking the constructor, or the parse method, or
__unserialize(),
then any methods that try to use the internally stored URI will trigger
this exception.
The RFC does not specify when
UriOperationException
is thrown.
Generally speaking I believe it would help understanding if you would
add a/** @throws InvalidUriException */
to each of the methods in the
stub to make it clear which ones are able to throw (e.g. resolve(), or
the withers). It's harder to find this out from “English” rather than
“code” :-)
Good idea! I've added the PHPDoc as well as created a dedicated "Exceptions"
section.
In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if%26
is decoded to&
in a query-string. Or if%3D
is
decoded to=
. This really is the same case as with%2F
in a path.
The explanation
Thanks for calling these cases out, I've significantly reworked the
relevant sections.
First of all, I added much more details to the general overview about
percent-encoding:
https://wiki.php.net/rfc/url_parsing_api#percent-encoding_decoding as well
as extended
the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section
with more information
about the two component representations, and added a general clarification
related to reserved
characters. Additionally, the
https://wiki.php.net/rfc/url_parsing_api#component_modification
section
makes it clear how percent-encoding is performed when the withers are used.
After thinking about the question a lot, finally the current
encoding-decoding rules seem
logical to me, but please double-check them. It's easy to misinterpret such
long and complex
specifications.
Long story short: when parsing an URI or modifying a component, RFC 3986
fails hard if
an invalid character is found, while WHATWG implementation automatically
percent-encodes
it while also triggering a soft-error.
While retrieving the "normalized-decoded" representation of a URI component,
percent-decoding is
performed when possible:
- in case of RFC3986: reserved and invalid characters are not
percent-decoded (only unreserved ones are) - in case of WHATWG: invalid characters and characters with special meaning
(that fall into the percent-encode set
of the given component) are not percent-decoded
The relevant sections will give a little more reasoning why I went with
these rules.
"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components."alone is not clear to me. “reserved characters that are not allowed in a
component”. I assume this means that%2F
(/) in a path will not be
decoded, but%3F
(?) will, because a bare?
can't appear in a path?
I hope that this question is also clear after my clarifications + the
reconsidered logic.
In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:$url->getRawScheme() // does not exist, because WhatWgUrl always
normalizes the scheme
Done.
to better point out the differences between the two APIs with regard to
normalization (it's mentioned, but having it in the code blocks would
make it more visible).
Done.
In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode?
and#
as necessary. Will the same happen
for Rfc3986? Will the encoding of#
also happen for the query-string
component? The RFC only mentions the path component.
The above referenced sections will give a clear answer for this question as
well.
TLDR: after your message, I realized that automatic percent-encoding also
triggers a (soft)
error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri,
so it won't do any automatic percent-encoding. It's unfortunate, because
this behavior is not
consistent with WHATWG, but it's more consistent with the parsing rules of its
own specification,
where there are only hard errors, and there's no such thing as "automatic
correction".
I'm also wondering if there are cases where the withers would not
round-trip, i.e. where$url->withPath($url->getPath())
would not
result in the original URL?
I am currently not aware of any such situation... I even wrote about this
aspect fairly
long, because I think "roundtripability" is a very important attribute.
Thank you for
raising awareness of this!
Can you add examples where the authority / host contains IPv6 literals?
It would be useful to specifically show whether or not the square
brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening2001:db8:0:0:0:0:0:1
to2001:db8::1
).
Good idea again! I've added an example containing an IPv6 host at the very
end of the component retrieval section. And yes, they will be
enclosed within a [] pair as
per the spec.
It also surprised me, but IP address normalization is only performed by
WHATWG
during recomposition! But nowhere else...
In “Component Recomposition” the RFC states "The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".Does this mean that toString() for Rfc3986 will always return the
original input?
Yes, effectively that's the case, only WHATWG modifies the input according
to my knowledge.
In the past, I had the impression that RFC 3986 also did a few changes,
but then I had to realize that it was not the case after I had dug deep
into the code of uriparser.
It would be useful to know whether or not the classes implement
__debugInfo()
/ how they appear whenvar_dump()
ing them.
I've added an example.
That's all I managed to write for now, but I'll try to answer the rest of
the messages and feedback
as soon as possible. :)
Regards,
Máté
Hi Dennis,
I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.
I think Ignace's examples already highlighted that the two specifications differ in nuances so much that even I had to admit after months of trying to squeeze them into the same interface that doing so would be irresponsible.
The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs with scheme-specific behavior - like ldap apparently), but even the UriInterface of PSR-7 can build upon it. On the other hand, Uri\WhatWg\Url will be useful for representing browser links and any other URLs for the web (i.e. an HTTP application router component should use this class).
Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?
I am not 100% sure what I brought up to Tim, but certainly, the biggest difference between the two specs regarding percent-encoding was recently documented in the RFC: https://wiki.php.net/rfc/url_parsing_api#percent-encoding
. The other main difference is how the host component is stored: WHATWG automatically percent-decodes it, while RFC3986 doesn't. This is summarized in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval
section (a bit below).
This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that
https://example.com
does not replace the actual host part if one is provided in$url
. For example, this code should work.$url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc ’, ‘https://example.com ’ ); $url->domain === 'wiki.php.net '
Yes. it's the case. Both classes only use the base URL for relative URIs.
Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com
" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com
”I got your point, so I implemented your suggestion. Actually, I made yet another larger API change in the meanwhile, but in any case, the WHATWG implementation now supports IDNA the following way:
$url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘", null);echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // 🐘.com
echo $url->toString(); // https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98echo $url->toDisplayString(); / https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out at the end of https://externals.io/message/126182#126184
), and adding support for RFC3987 (therefore IRIs) would be a very heavy amount of work, it's just not feasible within this RFC :( To make things worse, its code should be written from scratch, since I haven't found any suitable C library yet for this purpose. That's why I'll leave them forOn other notes, let me share some of the changes since my previous message to the mailing list:
- First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the proposal after Arnaud's feedback. Now, both the normalized (and decoded), as well as the non-normalized representation can equally be retrieved from the same URI instance. This was necessary to change in order for users to be able to consistently use URIs. Now, if someone needs an exact URI component value, they can use the getRaw*() getter. If they want the normalized and percent-decoded form then a get*() getter should be used. For more information, the
https://wiki.php.net/rfc/url_parsing_api#component_retrieval
section should be consulted.
This seems like a good change.
- I made a few less important API changes, like converting the WhatWgError class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing the return type of some getters (removing nullability) etc.
Love this.
- I fixed quite some smaller details of the implementation along with a very important spec incompatibility: until now, the "path" component didn't contain the leading "/" character when it should have. Now, both classes conform to their respective specifications with regards to path handling.
This is a late thought, and surely amenable to a later RFC, but I was thinking about the get/set path methods and the issue of the / and %2F.
- If we exposed getPathIterator()
or getPathSegments()
could we not report these in their fully-decoded forms? That is, because the path segments are separated by some invocation or array element, they could be decoded?
- Probably more valuably, if withPath()
accepted an array, could we not allow fully non-escaped PHP strings as path segments which the URL class could safely and by-default handle the escaping for the caller?
Right now, if someone haphazardly joins path segments in order to set withPath()
they will likely be unaware of that nuance and get the path wrong. On the grand scale of things, I suspect this is a really minor risk. However, if they could send in an array then they would never need to be aware of that nuance in order to provide a fully-reliable URL, up to the class rejecting path segments which cannot be represented.
I think the RFC is now mature enough to consider voting in the foreseeable future, since most of the concerns which came up until now are addressed some way or another. However, the only remaining question that I still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should be final? Personally, I don't see much problem with opening them for extension (other than some technical challenges that I already shared a few months ago), and I think people will have legitimate use cases for extending these classes. On the other hand, having final classes may allow us to make slightly more significant changes without BC concerns until we have a more battle-tested API, and of course completely eliminate the need to overcome the said technical challenges. According to Tim, it may also result in safer code because spec-compliant base classes cannot be extended by possibly non-spec compliant/buggy children. I don't necessarily fully agree with this specific concern, but here it is.
I’ve taken another fresh and full review of the RFC and I just want to share my appreciation for how well-written it seems, and how meticulously you have taken everyone’s feedback and incorporated it. It seems mature enough to me as well, and I think it’s in a good place. Still, here are some additional thoughts (and a previous one again) related to some of aspects, mostly naming.
The HTML5 library has ::createFromString()
instead of parse()
. Did you consider following this form? It doesn’t seem that important, but could be a nice improvement in consistency among the newer spec-compliant APIs. Further, I think createFromString()
is a little more obvious in intent, as parse()
is so generic.
Given the issues around equivalence, what about isEquivalent()
instead of equals()
? In the RFC I think you have been careful to use the “equivalence” terminology, but then in the actual interface we fall back to equals()
and lose some of the nuance.
Something about not implementing getRawScheme()
and friends in the WHATWG class seems off. Your rationale makes sense, but then I wonder what the problem is in exposing the raw untranslated components, particularly since the “raw” part of the name already suggests some kind of danger or risk in using it as some semantic piece.
Tim brought up the naming of getHost()
and getHostForDisplay()
as well as the correspondence with the toString()
methods. I’m not sure if it was overlooked or I missed the followup, but I wonder what your thoughts are on passing an enum to these methods indicating the rendering context. Here’s why: I see developers reach for the first method that looks right. In this case, that would almost always be getHost()
, yet getHost()
or toString()
or whatever is going to be inappropriate in many common cases. I see two ways of baking in education into the API surface: creating two symmetric methods (e.g. getDisplayableHost()
and getNonDisplayableHost()
); or requiring an enum forcing the choice (e.g. getHost( ForDisplay | ForNonDisplay )
). In the case on an enum this could be equally applied across all of the relevant methods where such a distinction exists. On one hand this could be seen as forcing callers to make a choice, but on the other hand it can also be seen as a safeguard against an extremely-common foot-gun, making such an easy oversight impossible.
Just this week I stumbled upon an issue with escaping the hash/fragment part of a URL. I think that browsers used to decode percent-encodings in the fragment but they all stopped and this was removed from the WHATWG HTML spec no-percent-escaping. The RFC currently shows getFragment()
decoding percent-encoded fragments, However, I believe that the WHATWG URL spec only indicates percent-encoding when setting the fragment. You can test this in a browser with the following example: Chrome, Firefox, and Safari exhibit the same behavior.
u = new URL(window.location)
u.hash = ‘one and two’;
u.hash === ‘#one%20and%20two’;
u.toString() === ‘….#one%20and%20two’;
So I think it may be more accurate and consistent to handle Whatwg\Url::getFragment
in the same way as getScheme()
. When setting a fragment we should percent-encode the appropriate characters, but when reading it, we should never interpret those characters — it should always return the “raw” value of the fragment.
Once again, thank you for the great work you’ve put into this. I’m so excited to have it. All my comments should be understood exclusively within the WHATWG domain as I don’t have the same experience with the RFC3986 side.
Dennis Snell
Regards,
Máté
Hi Ignace, Niels,
Sorry for being silent for so long, I was working hard on the
implementation besides some summer activities :) I can say that I had
really good progress in the last month and now I think (hope) that I
managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:
I'm not fluent enough in the different parsing styles to comment on the difference there.
I do have concerns about the class design, though. Given the improvements to the language, the accessor methods offer zero benefit at all. Public-read properties (readonly or otherwise) would be faster and offer no less of a guarantee. If you want to allow someone to extend the class and provide some custom logic, use aviz instead of readonly and extenders can use hooks instead of the methods. The getters don't offer any value anymore.
It took me a while to realize that, I think, the fromWhatWg() method is using an in/out parameter for error handling. That is an insta-no on my part. in/out reference parameters make sense in C, maybe C++, and basically nowhere else. I view them as a code smell everywhere they're used in PHP. Better alternatives include exceptions or union returns.
It looks like you've removed the with*() methods. Why? That means it cannot be used as a builder mechanism, which is plenty valuable. (Though could be an issue with query as a string vs array.)
The WhatWgError looks to me like it's begging to be an Enum.
I am confused by the new ini value. It's for use in cases where you're NOT parsing the URL yourself, but relying on some other extension that does URL parsing internally as a side effect?
As usual, I am not a fan of an ini setting, but I cannot think of a different alternative off hand.
--Larry Garfield
Hi Larry,
I do have concerns about the class design, though. Given the improvements
to the language, the accessor methods offer zero benefit at all.
Public-read properties (readonly or otherwise) would be faster and offer no
less of a guarantee. If you want to allow someone to extend the class and
provide some custom logic, use aviz instead of readonly and extenders can
use hooks instead of the methods. The getters don't offer any value
anymore.
Yes, I knew you wouldn't like my traditional style with private properties
- getters... :) So let me try to answer your suggestions: first of all, I
believe the readonly class modifier serves its purpose, and I definitely
want to keep it because it can ensure that all URI instances are immutable.
That's why I cannot use property hooks, since they are incompatible with
readonly. So only the possibility of using asymmetric visibility remains:
however, since extenders still cannot hook them, this idea should also be
rejected. Otherwise, I would consider using readonly with public read,
although I believe traditional methods are better suited for overriding
(easier syntax, decades of experience) than property hooks (my 2cents).
It took me a while to realize that, I think, the fromWhatWg() method is
using an in/out parameter for error handling. That is an insta-no on my
part. in/out reference parameters make sense in C, maybe C++, and
basically nowhere else. I view them as a code smell everywhere they're
used in PHP. Better alternatives include exceptions or union returns.
Yes, originally the RFC used a reference parameter to return the error
during parsing. I knew it was controversial, but that's what was a
consistent choice with other internal functions/methods.
After your feedback, I changed this behavior to an union type return type:
public static function parse(string $uri, ?string $baseUrl = null):
static|array {}
So that in case of failure, an array of Uri\WhatWgError objects are
returned. This practice is not really idiomatic with PHP, so personally I'm
not sure I like it, but neither did I particularly like passing a parameter
by reference...
It looks like you've removed the with*() methods. Why? That means it
cannot be used as a builder mechanism, which is plenty valuable. (Though
could be an issue with query as a string vs array.)
As I answered to Dennis, they were reclaimed in the meanwhile.
The WhatWgError looks to me like it's begging to be an Enum.
It's probably not that visible at the first glance, but Uri\WhatWgError has
2 properties: an error code, and a position, so it's not feasible to make
it an enum. I'd however create a separate Uri\WhatWgErrorCode enum
containing all the error codes, so that the class constants could be
removed from Uri\WhatWgError, but I felt it's overengineering so I decided
not to do this.
Regards,
Máté
Hi
Am 2024-11-24 21:40, schrieb Máté Kocsis:
It took me a while to realize that, I think, the fromWhatWg() method
is
using an in/out parameter for error handling. That is an insta-no on
my
part. in/out reference parameters make sense in C, maybe C++, and
basically nowhere else. I view them as a code smell everywhere
they're
used in PHP. Better alternatives include exceptions or union returns.Yes, originally the RFC used a reference parameter to return the error
during parsing. I knew it was controversial, but that's what was a
consistent choice with other internal functions/methods.
After your feedback, I changed this behavior to an union type return
type:public static function parse(string $uri, ?string $baseUrl = null):
static|array {}So that in case of failure, an array of Uri\WhatWgError objects are
returned. This practice is not really idiomatic with PHP, so personally
I'm
not sure I like it, but neither did I particularly like passing a
parameter
by reference...
I disagree with this change and believe that with the current
capabilities of PHP the out-parameter is the correct API design choice,
because then the “failure” case would be returning a falsy value, which
IMO is pretty idiomatic PHP:
if (($uri = WhatWgUri::parse($someUri, errors: $errors)) !== null) {
printf("Your URI '%s' is valid. Here it is: %s", $someUri,
$uri);
} else {
printf("Your URI '%s' is invalid, there were %d errors.\n",
$someUri, $errors);
}
It would also unify the API between Rfc3986Uri and WhatWgUri.
Best regards
Tim Düsterhus
Hi
Am 2024-08-26 09:40, schrieb Máté Kocsis:
Please re-read the RFC as it shares a bit more details than my quick
summary above: https://wiki.php.net/rfc/url_parsing_api
I have now finally found the time to go through the discussion thread
and make a first pass through the RFC and have the following remarks.
The RFC is not listed in the overview page: https://wiki.php.net/rfc
I agree with Dennis' remark that the Rfc3986Uri
and WhatWgUri
classes must be final. The RFC makes the argument that:
Having separate classes for the two standards makes it possible to
indicate explicit intent at the type level that one specific standard
is required.
Developers extending the classes could accidentally violate the
respective standard, which nullifies the benefit of making invalid
states unrepresentable at the type-level.
This also means that the return type of the “withers” should be self
instead of static
, which also means that the “withers” in the
interface must be self
. Perhaps this means that they should not exist
on the interface at all. DateTimeInterface
only provides the getters,
likely for a similar reason.
I believe the UriException
class as the base exception should not be
abstract
. There is no real benefit to it, especially since it doesn't
specify any additional abstract methods.
See also the PR introducing the Exception hierarchy for ext/random for
some opinions / arguments regarding the Exception class design:
https://github.com/php/php-src/pull/9220
I'm not sure I like the Interface
suffix on the UriInterface
interface. Just Uri\Uri
would be equally expressive.
I am not sure about the *User()
and *Password()
methods existing on
the interface. As the RFC acknowledges, RFC 3986 only specifies a
“userinfo” segment. Should the *User()
and *Password()
methods
perhaps be specific to the WhatWgUri
class?
I'll give the RFC another read later and expect some additional
commentary when I think about this more.
Best regards
Tim Düsterhus
Hi Tim,
Thanks for your feedback!
The RFC is not listed in the overview page: https://wiki.php.net/rfc
Uh, indeed! I've just fixed it.
I agree with Dennis' remark that the
Rfc3986Uri
andWhatWgUri
classes must be final. The RFC makes the argument that:Having separate classes for the two standards makes it possible to
indicate explicit intent at the type level that one specific standard
is required.Developers extending the classes could accidentally violate the
respective standard, which nullifies the benefit of making invalid
states unrepresentable at the type-level.
On the one hand, I also have some concerns about making these classes final
or non-final
as you probably saw in my last email (the concern came up with a question
about an implementation
detail: https://github.com/php/php-src/pull/14461#discussion_r1847316607).
On the other hand though,
if someone overrides an URI implementation, then I assume there's
definitely a purpose for doing
so (i.e. the child class has additional capabilities, or it can handle
additional protocols). If developers cannot
achieve this via inheritance, then they will do so otherwise (by using
composition, or putting the custom logic
in a helper class etc.). It's just not realistic to prevent logical bugs by
making classes final.
I would rather ask whether it's possible to make the 2 built-in URI
implementations having
quite some special internal behavior behave consistently with userland
classes, even if they are overridden?
For now, the answer seems to be yes (especially after hearing Niels'
solution in the GitHub thread linked above),
but of course new issues may arise later which we don't know about yet. And
of course, it's much easier to make
a class final first and relax the inheritance rules later, than the other
way around... So these are the only reasons
why I'd make the classes final, but otherwise it would be useful to be able
to extend them.
This also means that the return type of the “withers” should be
self
instead ofstatic
, which also means that the “withers” in the
interface must beself
. Perhaps this means that they should not exist
on the interface at all.DateTimeInterface
only provides the getters,
likely for a similar reason.
Using the self
return type over static
would be counterproductive
in my opinion:
it's mostly because static is the correct type semantically, and it can be
useful for
forward compatibility later if we once want to remove the final modifier.
Regarding the analogy with DateTimeInterface, I think this one is wrong:
the ext/uri API is
completely immutable, while ext/date has the mutable DateTime
implementation,
so it's not possible to include setters in the interface, otherwise one
couldn't know
what to expect after modification.
I believe the UriException
class as the base exception should not be
abstract
. There is no real benefit to it, especially since it doesn't
specify any additional abstract methods.
I have no hard feelings regarding this. If I make it a concrete class, then
likely
implementations will start to throw it instead of more specific subclasses.
That's
probably not an issue, people are not usually interested in the exact
reason of an exception.
Since ext/date also added a generic parent exception (DateError) recently
which wasn't abstract,
then I'm fine with doing the same with ext/uri.
I'm not sure I like the
Interface
suffix on theUriInterface
interface. JustUri\Uri
would be equally expressive.
Yes, I was expecting this debate :) To be honest, I never liked interfaces
without an "Interface"
suffix, and my dislike didn't go away when I had to use such an interface
somewhere, because it
was difficult for me to find out what the symbol I was typing actually
referred to. But apart from my personal
experiences, I prefer to stay with "UriInterface" because the 2 most well
known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this
name definitely conveys that
people should not try to instantiate it.
I am not sure about the *User()
and *Password()
methods existing on
the interface. As the RFC acknowledges, RFC 3986 only specifies a
“userinfo” segment. Should the*User()
and*Password()
methods
perhaps be specific to theWhatWgUri
class?
Really good question, and I hesitated a lot about the same (even in some of
my messages to the mailing list).
In fact, RFC 3986 has some notion of user/password, because the specification
mentions the "user:password"
format as deprecated [in favor of passing authentication information in
other places]. So I think the *User()
and
*Password()
methods are legitimately part of the interface. And it's not
even without precedent to have them in
an interface: PSR-7 made use of the "user" and "password" notions in the
UriInterface::withUserInfo()
method
which accepts a $user
and a $password
parameter. I know people on this
list generally don't like PSR-7,
but t would be useful to know why PHP FIG chose to use these two parameters.
Due to the reasons above, the question for me is really whether we want to
add the *UserInfo()
methods to the
interface or at least to Uri/Rfc3986Uri. Since WhatWg doesn't even mention
user info (apart from "userinfo
percent-encode set" which refers to something else), I'd prefer not to add
the methods in question to Uri/UriInterface.
If people insist on it, then I'm fine to add the methods to Uri\Rfc3986Uri
though.
I disagree with this change and believe that with the current
capabilities of PHP the out-parameter is the correct API design choice,
because then the “failure” case would be returning a falsy value, which
IMO is pretty idiomatic PHP:
Yes, I can live with any of the solutions, I'm just not sure which is less
bad. :) If only we had out parameters... But wishful
thinking aside, I am fine with whatever the majority of people prefer.
Probably being able to unify the API of the two
implementations is a good argument no one thought about so far for using
passing by reference...
Regards,
Máté
I'm not sure I like the
Interface
suffix on theUriInterface
interface. JustUri\Uri
would be equally expressive.Yes, I was expecting this debate :) To be honest, I never liked interfaces
without an "Interface"
suffix, and my dislike didn't go away when I had to use such an interface
somewhere, because it
was difficult for me to find out what the symbol I was typing actually
referred to.
By the same argument, you could come up with code like
<?php
class User {
const defaultGroupNameConstant = "users";
private string $nameVariable;
public function getNameMethod() {…}
…
}
?>
But apart from my personal
experiences, I prefer to stay with "UriInterface" because the 2 most well
known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this
name definitely conveys that
people should not try to instantiate it.
DateTimeInterface has been introduced after there had already been
DateTime. Otherwise, we would likely have DateTime, DateTimeMutable and
DateTimeImmutable. (Or only DateTime as immutable class.)
SessionHandler/SessionHandlerInterface have been bad naming choices, in
my opinion. The interface could have been SessionHandler, and the class
DefaultSessionHandler (and should have been final). I dislike these
Interface and Implementation (or abbreviations of these) suffixes.
Christoph
I'm not sure I like the
Interface
suffix on theUriInterface
interface. JustUri\Uri
would be equally expressive.Yes, I was expecting this debate :) To be honest, I never liked interfaces
without an "Interface"
suffix, and my dislike didn't go away when I had to use such an interface
somewhere, because it
was difficult for me to find out what the symbol I was typing actually
referred to.By the same argument, you could come up with code like
<?php
class User {
const defaultGroupNameConstant = "users";
private string $nameVariable;
public function getNameMethod() {…}
…
}
?>But apart from my personal
experiences, I prefer to stay with "UriInterface" because the 2 most well
known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this
name definitely conveys that
people should not try to instantiate it.DateTimeInterface has been introduced after there had already been
DateTime. Otherwise, we would likely have DateTime, DateTimeMutable and
DateTimeImmutable. (Or only DateTime as immutable class.)SessionHandler/SessionHandlerInterface have been bad naming choices, in
my opinion. The interface could have been SessionHandler, and the class
DefaultSessionHandler (and should have been final). I dislike these
Interface and Implementation (or abbreviations of these) suffixes.Christoph
I used to be in favor of *Interface, but over time realized how useless it was. :-) I have stopped doing it in my own code and my code reads way better. Also, the majority of PHP's built in interfaces (Traversable, Countable, etc.) are not suffixed, AFAIK, so it's better to avoid it for consistency. As noted, DateTimeInterface is a special case outlier.
--Larry Garfield
Hi Máté,
I've read the latest version of the RFC and while I very much like the RFC, I have some remarks.
The paragraph in at the beginning of the RFC in the > Relevant URI specifications > WHATWG URL section seems to be incomplete.
I don't really understand how the UninitializedUriException exception can be thrown?
Is it somehow possible to create an instance of a URI without initializing it?
This seems unwise in general.
I'm not really convinced by using the constructor to be able to create a URI object.
I think it would be better for it to be private/throwing and have two static constructor parse
and tryParse
,
mimicking the API that exists for creating an instance of a backed enum from a scalar.
I think changing the name of the toString
method to toRawString
better matches the rest of the proposed API,
and also removes the question as to why it isn't the magic method __toString
.
I will echo Tim's concerns about the non-final-ity of the URI classes.
This seems like a recipe for disaster.
I can maybe see the usefulness of extending Rfc3986\Uri by a subclass Ldap\Uri,
but being able to extend the WhatWg URI makes absolutely no sense.
The point of these classes is that if you have an instance of one of these, you know that you have a valid URI.
Being able to subclass a URI and mess with the equals
, toString
, toNormalizedString
methods throws away all the safety guarantees provided by possessing a Uri instance.
Moreover, like Tim previously mentioned, if you subclass you need to override all the methods,
and you might end up in the similar situation which lead to the removal of the common Uri interface in the first place.
Which basically suggests creating a new Uri class instead of extending anyway.
Making these classes final just removes a lot of edge cases, some that I don't think we can anticipate,
while also simplifying other aspects, like serialization.
As you won't need that weird __uri
property any longer.
Similarly, I don't understand why the WhatWgError is not final.
Even if subclassing of the Uri classes is allowed, any error it would have would not be a WhatWg one,
so why should you be able to extend it.
Parsing API and why Monads wouldn't solve the soft error case anyway.
This is just a remark, but you wouldn't be able to really implement a monad if you want to support partial success.
So I'm not sure mentioning the lack of monadic support in PHP is the best argument against them for this RFC.
Best regards,
Gina P. Banyard
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the
parse_url()
function is optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
Hi all,
In earlier discussions on the Server-Side Request and Response objects RFC and the after-action sumamry, one of the common non-technical objections was that it would better be handled in userland.
Seeing as there are at least two other WHATWG-URL projects in userland now ...
... does the same objection continue to hold?
-- pmj
Hi all,
In earlier discussions on the Server-Side Request and Response objects RFC and the after-action sumamry, one of the common non-technical objections was that it would better be handled in userland.
Seeing as there are at least two other WHATWG-URL projects in userland now ...
... does the same objection continue to hold?
Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.
Best regards,
Gina P. Banyard
Hi
Am 2025-02-23 18:57, schrieb Paul M. Jones:
In earlier discussions on the Server-Side Request and Response
objects RFC and the after-action
sumamry, one of the common
non-technical objections was that it would better be handled in
userland.
I did not read through the entire discussion, but had a look at the
“after-action summary” thread and specifically Côme’s response, which
you apparently agreed with:
My take on that is more that functionality in core needs to be
«perfect», or at least near unanimous.
Or perhaps phrased differently, like I did just a few days ago in:
https://externals.io/message/126350#126355
The type of functionality that is nowadays added to PHP’s standard
library is “building block” functionality: Functions that a userland
developer would commonly need in their custom library or application.
Correctly processing URIs is a common need for developers and it’s
complicated to do right, thus it qualifies as a “building block”.
PHP also already has this functionality in parse_url()
, but it's
severely broken. To me it clearly makes sense to gradually provide
better-designed and safer replacement functionality for broken parts of
the standard library. This worked for the randomness functionality in
PHP 8.2, for DOM in PHP 8.4 and hopefully for URIs in PHP 8.5.
Best regards
Tim Düsterhus
Hi there,
...
but had a look at the “after-action summary” thread and specifically Côme’s response, which you apparently agreed with:
My take on that is more that functionality in core needs to be «perfect», or at least near unanimous.
Côme Chilliet's full quote goes on to say "And of course it’s way easier to find a solution which pleases everyone when it’s for something quite simple" -- the example was of str_contains()
.
Or perhaps phrased differently, like I did just a few days ago in: https://externals.io/message/126350#126355
The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application.
Correctly processing URIs is a common need for developers and it’s complicated to do right, thus it qualifies as a “building block”.
Agreed. Add to that:
Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.
(The previous objections being that this ought to be left in userland.)
I'm repeatedly on record as saying that PHP, as a web-centric language, ought to have more web-centric objects available in core. A Request would be one of those; a Response another; and as being discussed here, a Url.
However, if it is true that ...
-
"it’s way easier to find a solution which pleases everyone when it’s for something quite simple"
-
"The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application."
-
"one of the other stated goals of this RFC is to provide this API to other core extensions"
-
"Parsing is the single most important operation to use with URIs where a URI string is decomposed into multiple components during the process." (from the RFC)
... then an extensive set of objects and exceptions is not strictly necessary.
Something like function parse_url_whatwg(string $url_string, ?string $base_url = null) : array
, with an array of returned components, would meet all of those needs.
Similarly, something like a function parse_url_rfc3986(string $uri_string, ?string $base_url = null) : array
does the same for RFC 3986 parsing.
Those things combined provide solid parsing functionality to userland, and make that parsing functionality available to other core extensions.
-- pmj
Hi there,
...
but had a look at the “after-action summary” thread and specifically Côme’s response, which you apparently agreed with:
My take on that is more that functionality in core needs to be «perfect», or at least near unanimous.
Côme Chilliet's full quote goes on to say "And of course it’s way easier to find a solution which pleases everyone when it’s for something quite simple" -- the example was of
str_contains()
.Or perhaps phrased differently, like I did just a few days ago in: https://externals.io/message/126350#126355
The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application.
Correctly processing URIs is a common need for developers and it’s complicated to do right, thus it qualifies as a “building block”.
Agreed. Add to that:
Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.
(The previous objections being that this ought to be left in userland.)
I'm repeatedly on record as saying that PHP, as a web-centric language, ought to have more web-centric objects available in core. A Request would be one of those; a Response another; and as being discussed here, a Url.
However, if it is true that ...
"it’s way easier to find a solution which pleases everyone when it’s for something quite simple"
"The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application."
"one of the other stated goals of this RFC is to provide this API to other core extensions"
"Parsing is the single most important operation to use with URIs where a URI string is decomposed into multiple components during the process." (from the RFC)
... then an extensive set of objects and exceptions is not strictly necessary.
Something like
function parse_url_whatwg(string $url_string, ?string $base_url = null) : array
, with an array of returned components, would meet all of those needs.Similarly, something like a
function parse_url_rfc3986(string $uri_string, ?string $base_url = null) : array
does the same for RFC 3986 parsing.Those things combined provide solid parsing functionality to userland, and make that parsing functionality available to other core extensions.
-- pmj
Hi Paul,
The problem with your suggestion is that the specification from WHATWG
and RFC3986/3987 are so different and that the function you are
proposing won't be able to cover the outcome correctly (ie give the
developper all the needed information). This is why, for instance, Maté
added the getRaw* method alongside the normalized getter (method without
the Raw prefix).
Also keep in mind that URL construction may also differ between
specifications so instead of just 2 functions you may end up woth 4
methods not counting error jandling. So indeed using an OOP approach
while more complex is IMHO the better approach.
The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won't be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).
The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url()
does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).
All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.
Recall that I'm responding at least in part to the comment that "Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections [to the Request/Response objects going into core] do not apply here." If the only reason they don't apply is that the core extensions need a parsing API, that reason becomes obviated by using just functions for the parsing elements.
Unless I'm missing something; happy to hear what that might be.
-- pmj
Hi,
On Feb 25, 2025, at 09:55, ignace nyamagana butera nyamsprod@gmail.com
wrote:The problem with your suggestion is that the specification from WHATWG
and RFC3986/3987 are so different and that the function you are proposing
won't be able to cover the outcome correctly (ie give the developper all
the needed information). This is why, for instance, Maté added the getRaw*
method alongside the normalized getter (method without the Raw prefix).The two functions need not return an identical array of components; e.g.,
the 3986 parsing function might return an array much likeparse_url()
does
now, and the WHATWG function might return a completely different array of
components (one that includes the normalized and/or raw components).All of this is to say that the parsing functionality does not have to be
in an object to be useful both to the internal API and to userland.
It most definitely needs to be an object. Arrays are awful DX wise, there
is array shape which modern IDEs like phpstorm support and so does static
analysis but the overall experience remains subpar compared to classes (and
objects).
Recall that I'm responding at least in part to the comment that
"Considering that one of the other stated goals of this RFC is to provide
this API to other core extensions, the previous objections [to the
Request/Response objects going into core] do not apply here." If the only
reason they don't apply is that the core extensions need a parsing API,
that reason becomes obviated by using just functions for the parsing
elements.Unless I'm missing something; happy to hear what that might be.
-- pmj
Imho Request and Response objects do belong in core, but with a very good
api, something which would replace http foundation/PSR7 altogether.
Hi,
The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won't be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).
The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like
parse_url()
does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.
It most definitely needs to be an object. Arrays are awful DX wise, there is array shape which modern IDEs like phpstorm support and so does static analysis but the overall experience remains subpar compared to classes (and objects).
I’m curious why you say this other than an opinion about developer experience? Arrays are values, objects are not. A parsed uri seems more like a value and less like an object. Just reading through the comments so far, it appears that whatever is used will just be wrapped in library code regardless, for userland code, but the objective is to be useful for other extensions and core code. In that case, a hashmap is much easier to work with than a class.
Looking at the objectives of the RFC and the comments here, it almost sounds like it is begging to be a simple array instead of an object.
— Rob
Hi,
On Feb 25, 2025, at 09:55, ignace nyamagana butera nyamsprod@gmail.com
wrote:The problem with your suggestion is that the specification from WHATWG
and RFC3986/3987 are so different and that the function you are proposing
won't be able to cover the outcome correctly (ie give the developper all
the needed information). This is why, for instance, Maté added the getRaw*
method alongside the normalized getter (method without the Raw prefix).The two functions need not return an identical array of components; e.g.,
the 3986 parsing function might return an array much likeparse_url()
does
now, and the WHATWG function might return a completely different array of
components (one that includes the normalized and/or raw components).All of this is to say that the parsing functionality does not have to be
in an object to be useful both to the internal API and to userland.It most definitely needs to be an object. Arrays are awful DX wise, there
is array shape which modern IDEs like phpstorm support and so does static
analysis but the overall experience remains subpar compared to classes (and
objects).I’m curious why you say this other than an opinion about developer
experience? Arrays are values, objects are not. A parsed uri seems more
like a value and less like an object. Just reading through the comments so
far, it appears that whatever is used will just be wrapped in library code
regardless, for userland code, but the objective is to be useful for other
extensions and core code. In that case, a hashmap is much easier to work
with than a class.Looking at the objectives of the RFC and the comments here, it almost
sounds like it is begging to be a simple array instead of an object.— Rob
Depends on there being the intention to have it as parameter type. If it's
designed to be passed around to functions I really don't want it to be an
array. I am maintaining a legacy codebase where arrays are being used as
hashmaps pretty much everywhere, and it's error prone. We lose all kinds of
features like "find usages" and refactoring key/property names. Silly typos
in array keys with no actual validation of any kind cause null values and
annoying to find bugs.
I agree that hashmaps can be really easy to use, but not as data structures
outside of the function/method scope they were defined in. If value vs
object semantics are important here, then something that is forward
compatible with whatever structs may hold in the future could be
interesting.
__
Hi,
The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won't be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).
The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like
parse_url()
does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.
It most definitely needs to be an object. Arrays are awful DX wise, there is array shape which modern IDEs like phpstorm support and so does static analysis but the overall experience remains subpar compared to classes (and objects).
I’m curious why you say this other than an opinion about developer experience? Arrays are values, objects are not. A parsed uri seems more like a value and less like an object. Just reading through the comments so far, it appears that whatever is used will just be wrapped in library code regardless, for userland code, but the objective is to be useful for other extensions and core code. In that case, a hashmap is much easier to work with than a class.
Looking at the objectives of the RFC and the comments here, it almost sounds like it is begging to be a simple array instead of an object.
— Rob
Depends on there being the intention to have it as parameter type. If it's designed to be passed around to functions I really don't want it to be an array. I am maintaining a legacy codebase where arrays are being used as hashmaps pretty much everywhere, and it's error prone. We lose all kinds of features like "find usages" and refactoring key/property names. Silly typos in array keys with no actual validation of any kind cause null values and annoying to find bugs.
I agree that hashmaps can be really easy to use, but not as data structures outside of the function/method scope they were defined in. If value vs object semantics are important here, then something that is forward compatible with whatever structs may hold in the future could be interesting.
I meant hashmaps from within C, not within php. If it is just going to wrapped in userland libraries as people seem to be suggesting in this thread, then you only have to get it right once, and it is easy to use from C.
— Rob
Hi
Am 2025-02-23 18:30, schrieb Gina P. Banyard:
I don't really understand how the UninitializedUriException exception
can be thrown?
Is it somehow possible to create an instance of a URI without
initializing it?
It's mentioned in the RFC (it was not yet, when I read through the RFC):
This can happen for example when the object is instantiated via
ReflectionClass::newInstanceWithoutConstructor()).
Incidentally this is also something that would be fixed by making the
classes final
, since it's illegal to bypass the constructor for final
internal classes:
<?php
$r = new ReflectionClass(Random\Engine\Mt19937::class);
$r->newInstanceWithoutConstructor();
results in:
Fatal error: Uncaught ReflectionException: Class
Random\Engine\Mt19937 is an internal class marked as final that cannot
be instantiated without invoking its constructor
This seems unwise in general.
I agree. This exception is not really actionable by the user and more of
a “should never happen” case. It should be prevented from appearing.
The same is true for UriOperationException
. The RFC says that it can
happen for memory issues. Can this actually happen? My understanding is
that the engine bails out when an allocation fails. In any case if a
more graceful handling is desired it should be some generic
OutOfMemoryError
rather than an extension-specific exception.
With regard to unserialization, let me refer to:
https://externals.io/message/118311. ext/random uses \Exception
and I
suggest ext/uri to do the same. This should also be handled in a
consistent way across extensions, e.g. by reproposing
https://wiki.php.net/rfc/improve_unserialize_error_handling.
And with “Theoretically, URI component reading may also trigger this
exception” being a theoretical issue only, the UriOperationException
is not actually necessary at all.
I'm not really convinced by using the constructor to be able to create
a URI object.
I think it would be better for it to be private/throwing and have two
static constructorparse
andtryParse
,
mimicking the API that exists for creating an instance of a backed enum
from a scalar.
enums are little different in that they are a singleton. The
Dom\HTMLDocument class with only named constructors might be a better
comparison. But I don't have a strong opinion on constructor vs named
constructor here.
Best regards
Tim Düsterhus
Hi Tim,
The same is true for
UriOperationException
. The RFC says that it can
happen for memory issues. Can this actually happen? My understanding is
that the engine bails out when an allocation fails. In any case if a
more graceful handling is desired it should be some generic
OutOfMemoryError
rather than an extension-specific exception.
After checking the code of emalloc et al. I agree with you, the exception
won't actually
be thrown for memory errors. Therefore, I removed this part of the RFC.
With regard to unserialization, let me refer to:
https://externals.io/message/118311. ext/random uses\Exception
and I
suggest ext/uri to do the same. This should also be handled in a
consistent way across extensions, e.g. by reproposing
https://wiki.php.net/rfc/improve_unserialize_error_handling.
Thanks for bringing this RFC to my attention, I agree with the motivation,
so I
changed this aspect of the RFC too to throw an \Exception.
And with “Theoretically, URI component reading may also trigger this
exception” being a theoretical issue only, theUriOperationException
is not actually necessary at all.
I wanted to reserve the right for any 3rd party internal URI implementations
to fail for whatever reason that prevents reading. The built-in
implementations
don't fail for sure, but it doesn't mean that 3rd party implementations
neither will.
Since potential errors can be handled in some way, I think it makes sense
to keep this exception, especially because it's now basically
non-triggerable.
I'm not sure if I'm entirely correct, but it's possible that a 3rd party
URI implementation
won't (or cannot) use PHP's memory manager, and it relies on the regular
malloc:
in this case, even memory errors could lead to failures.
Regards,
Máté
Hi Gina,
The paragraph in at the beginning of the RFC in the > Relevant URI
specifications > WHATWG URL section seems to be incomplete.
Hopefully it's good now. Although I know this section doesn't include much
information.
I don't really understand how the UninitializedUriException exception can
be thrown?
Is it somehow possible to create an instance of a URI without initializing
it?
This seems unwise in general.
I think I've already answered this since then in my previous email (and in
the RFC as well), but yes, it's possible via reflection.
I don't really have an idea how this possibility could be avoided without
also making the classes final.
I'm not really convinced by using the constructor to be able to create a
URI object.
I think it would be better for it to be private/throwing and have two
static constructorparse
andtryParse
,
mimicking the API that exists for creating an instance of a backed enum
from a scalar.
I'm not completely against using parse() and tryParse(), but I think the
constructor already makes it clear that it either returns
a valid object or throws.
I think changing the name of the
toString
method totoRawString
better
matches the rest of the proposed API,
and also removes the question as to why it isn't the magic method
__toString
.
For RFC 3986, we could go with toString() instead of toNormalizedString()
and toRawString() instead of toString() so that we use
the same convention as for getters.
Recently I learnt that for some reason WHATWG normalizes the IP address
during component recomposition, so its toString() is
not really the most rare (at least not in the same way as "raw getters"
are). So for WHATWG, I think keeping toString() and
toDisplayString() probably still makes sense.
I will echo Tim's concerns about the non-final-ity of the URI classes.
This seems like a recipe for disaster.
I can maybe see the usefulness of extending Rfc3986\Uri by a subclass
Ldap\Uri,
but being able to extend the WhatWg URI makes absolutely no sense.
The point of these classes is that if you have an instance of one of
these, you know that you have a valid URI.
Being able to subclass a URI and mess with theequals
,toString
,
toNormalizedString
methods throws away all the safety guarantees provided
by possessing a Uri instance.
I'm sure that people will find their use-cases to subclass all these new
classes, including the WHATWG implementation. As Nicolas mentioned,
his main use-case is minly adding convenience and new factory methods that
don't specifically need all methods to be reimplemented.
While I share your opinion that leaving the URI classes open for extension
is somewhat risky and it's difficult to assess its impacts right now, I can
also
sympathise with what Nicolas wrote in a later message (
https://externals.io/message/123997#126489): that we shouldn't close the
door for the public from
using interchangeable implementations.
I know that going final without any interfaces is the most "convenient" for
the PHP project itself, because the solution has much less BC surface to
maintain,
so we are relatively free and safe to make future changes. This is useful
for an API in its early days that is huge like this. Besides the interests
of the maintainers,
we should also take two important things into account:
- Heterogeneous use-cases: it's out of question that the current API won't
fit all use-cases, especially because we have already identified some
followup tasks
that should be implemented (see "Future Scope" section in the RFC). - Interoperability: Since URI handling is a very widespread problem, many
people and libraries will start to use the new extension once
it's available. But because
of the above reason, many of them want to use their own abstraction, and
that's exactly why a common ground is needed: there's simply not a single
right possible
implementation - everyone has their own, given the complexity of the topic.
So we should try to be considerate about these factors by some way or
another. So far, we have four options:
- Making the classes open for extension: this solution has acknowledged
technical challenges (
https://github.com/php/php-src/pull/14461#discussion_r1847316607),
and it limits our possibilities of adding changes the most, but users can
effectively add any behavior that they need. Of course, they are free to
introduce bugs and
spec-incompatible behavior into their own implementation, but none of the
other solutions could prevent such bugs either, since people will write
their custom code
wherever they can: if they can't have it in a child class, then they will
have in MyUri, or in UriHelper, or just in a 200 lines long function.
Being able to extend the built-in classes also means that child classes can
use the behavior of their parent by default - there's no need to create
wrapper
classes around the built-in ones (aka using composition), that is a tedious
task to implement, and also which would incur some performance penalty
because of the
extra method calls.
-
Making the classes open for extension, but making some methods final:
same benefits as above, without the said technical challenges - in theory.
I am currently
trying to figure out if there is a combination of methods that could be
made final so that the known challenges become impossible to be triggered -
although I haven't
managed to come up with a sensible solution yet. -
Making the classes final: It avoids some edge-cases for the built-in
classes (the uninitialized state most prominently), while it leaves the
most room for making future
changes. Projects that may want to ship their own abstractions for the two
built-in classes can use composition to create their own URI
implementations.
They can instantiate these implementations however they want to (i.e.
$myUri = new MyUri($uri)). If they need to pass an URI to other libraries
then they could extract
the wrapped built-in class (i.e. $myUri->getUri()).
On the flipside, backporting methods added in future PHP versions (aka
polyfills) will become impossible to implement for URIs according to my
knowledge, as well as mocking
in PHPUnit will also be a lost feature (I'm not sure if it's a good or a
bad thing, but it may be worth to point out).
Also, the current built-in implementations may have alternative
implementations that couldn't be used instead of them. For example, the ADA
URL library (which is mentioned
in the RFC) also implements the WHATWG specification - possibly the very
same way as Lexbor, the currently used library - does. These alternative
implementations may have
different performance characteristics, platform requirements, or level of
maintenance/support, which may qualify them as more suitable for some
use-cases than what the built-in
ones can offer. If we make these classes final, there's no way to use
alternative implementations as a replacement for the default ones, although
they all implement the same
specification having mostly clear semantics.
- Making the classes final, but adding a separate interface for each: The
impact of making the built-in classes final would be mitigated by adding
one interface
for each specification (I didn't like this idea in the past, but it now
looks much more useful in the perspective of the final vs non-final
debate). Because of the interfaces,
there would be a common denominator for the different possible
implementations. I'm sure that someone would suggest that the community
(aka PHP-FIG)
should come up with such an interface, but I think we shouldn't expect
someone else to do the work when we are in the position to do it the
best, as those interfaces
should be internal ones, since the built-in URI classes should also
implement them.
If we had these interfaces, projects could use whatever abstraction they
want via composition, but they could more conveniently pass around the same
object everywhere.
I intentionally don't try to draw a conclusion for now, first of all
because it already took me a lot of time to try to mostly objectively
compare the different possibilities, and
I hope that we can find more pros-cons (or fix my reasonings if I made
mistakes somewhere) in order to finally reach some kind of consensus.
Similarly, I don't understand why the WhatWgError is not final.
Even if subclassing of the Uri classes is allowed, any error it would have
would not be a WhatWg one,
so why should you be able to extend it.
I made it final now.
Thank you for your comments:
Máté
I'm sure that people will find their use-cases to subclass all these
new classes, including the WHATWG implementation. As Nicolas mentioned,
his main use-case is minly adding convenience and new factory methods
that don't specifically need all methods to be reimplemented.While I share your opinion that leaving the URI classes open for
extension is somewhat risky and it's difficult to assess its impacts
right now, I can also
sympathise with what Nicolas wrote in a later message
(https://externals.io/message/123997#126489): that we shouldn't close
the door for the public from
using interchangeable implementations.I know that going final without any interfaces is the most "convenient"
for the PHP project itself, because the solution has much less BC
surface to maintain,
so we are relatively free and safe to make future changes. This is
useful for an API in its early days that is huge like this. Besides the
interests of the maintainers,
we should also take two important things into account:
- Heterogeneous use-cases: it's out of question that the current API
won't fit all use-cases, especially because we have already identified
some followup tasks
that should be implemented (see "Future Scope" section in the RFC).- Interoperability: Since URI handling is a very widespread problem,
many people and libraries will start to use the new extension once it's
available. But because
of the above reason, many of them want to use their own abstraction,
and that's exactly why a common ground is needed: there's simply not a
single right possible
implementation - everyone has their own, given the complexity of the
topic.So we should try to be considerate about these factors by some way or
another. So far, we have four options:
- Making the classes open for extension: this solution has acknowledged
technical challenges
(https://github.com/php/php-src/pull/14461#discussion_r1847316607),
and it limits our possibilities of adding changes the most, but users
can effectively add any behavior that they need. Of course, they are
free to introduce bugs and
spec-incompatible behavior into their own implementation, but none of
the other solutions could prevent such bugs either, since people will
write their custom code
wherever they can: if they can't have it in a child class, then they
will have in MyUri, or in UriHelper, or just in a 200 lines long
function.Being able to extend the built-in classes also means that child classes
can use the behavior of their parent by default - there's no need to
create wrapper
classes around the built-in ones (aka using composition), that is a
tedious task to implement, and also which would incur some performance
penalty because of the
extra method calls.
Making the classes open for extension, but making some methods final:
same benefits as above, without the said technical challenges - in
theory. I am currently
trying to figure out if there is a combination of methods that could be
made final so that the known challenges become impossible to be
triggered - although I haven't
managed to come up with a sensible solution yet.Making the classes final: It avoids some edge-cases for the built-in
classes (the uninitialized state most prominently), while it leaves the
most room for making future
changes. Projects that may want to ship their own abstractions for the
two built-in classes can use composition to create their own URI
implementations.
They can instantiate these implementations however they want to (i.e.
$myUri = new MyUri($uri)). If they need to pass an URI to other
libraries then they could extract
the wrapped built-in class (i.e. $myUri->getUri()).On the flipside, backporting methods added in future PHP versions (aka
polyfills) will become impossible to implement for URIs according to my
knowledge, as well as mocking
in PHPUnit will also be a lost feature (I'm not sure if it's a good or
a bad thing, but it may be worth to point out).Also, the current built-in implementations may have alternative
implementations that couldn't be used instead of them. For example, the
ADA URL library (which is mentioned
in the RFC) also implements the WHATWG specification - possibly the
very same way as Lexbor, the currently used library - does. These
alternative implementations may have
different performance characteristics, platform requirements, or level
of maintenance/support, which may qualify them as more suitable for
some use-cases than what the built-in
ones can offer. If we make these classes final, there's no way to use
alternative implementations as a replacement for the default ones,
although they all implement the same
specification having mostly clear semantics.
- Making the classes final, but adding a separate interface for each:
The impact of making the built-in classes final would be mitigated by
adding one interface
for each specification (I didn't like this idea in the past, but it now
looks much more useful in the perspective of the final vs non-final
debate). Because of the interfaces,
there would be a common denominator for the different possible
implementations. I'm sure that someone would suggest that the community
(aka PHP-FIG)
should come up with such an interface, but I think we shouldn't expect
someone else to do the work when we are in the position to do it the
best, as those interfaces
should be internal ones, since the built-in URI classes should also
implement them.If we had these interfaces, projects could use whatever abstraction
they want via composition, but they could more conveniently pass around
the same object everywhere.I intentionally don't try to draw a conclusion for now, first of all
because it already took me a lot of time to try to mostly objectively
compare the different possibilities, and
I hope that we can find more pros-cons (or fix my reasonings if I made
mistakes somewhere) in order to finally reach some kind of consensus.
Thought: make the class non-final, but all of the defined methods final, and any internal data properties private. That way we know that a child class cannot break any of the existing guarantees, but can still add convenience methods or static method constructors on top of the existing API, without the need for an interface and a very verbose composing class.
--Larry Garfield