Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs
according to any well established standards (RFC 1738 or the WHATWG URL
living standard), since the parse_url()
function is optimized for
performance instead of correctness.
In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.
You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
Hey Máté,
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
So far, amazing ! 👏
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
This is a great addition to have! I see there's nothing specifically about
__toString
in the RFC, is this aiming to do the same as PSR-7?
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the
parse_url()
function is optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
Hi Máté
+1 from me, I'm all for modern web-related APIs as you know.
Some questions/remarks:
- Why did you choose UrlParser to be a "static" class? Right now it's just a fancy namespace.
I can see the point of having a UrlParser class where you can e.g. configure it with which URL standard you want,
but as it is now there is no such capability. - It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely. - Why is UrlComponent a backed enum?
- A nit: We didn't bundle the entire Lexbor engine, only select parts of it. Just thought I'd make it clear.
- About edge cases: e.g. what happens if I call the Url constructor and leave every string field empty?
Overall seems good.
Kind regards
Niels
- Why did you choose UrlParser to be a "static" class?
Because "static class" is the hip new cool ;)
Bilge
- It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.
Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)
We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....
Cheers
Stephen
- It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....
Cheers
Stephen
I personally ignore PSR when it doesn't make sense to use it. They're nice for library compatibility, but I will happily toss compatibility when it doesn't make sense to be compatible. This might be one of those cases as there is no reason it has to be PSR compliant. In fact, a wrapper may be written to make it compliant, if one so chooses. I suspect it is better to be realistic and learn from the short-comings of PSR and apply those learnings here, vs. reiterating them and "engraving them in stone" (so to speak).
— Rob
- It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....
Cheers
Stephen
While I do not think the debate should be about compatibility with PSR-7
some historical context shoyld be brought to light for a fair discussion:
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented. - PHP historical query parser parse_str logic is so bad (mangled
parameter name for instance) that PSR-7 was right not embedding that
parsing algorithm in its specification. - If you take aside URITemplate specification and now URLSearchParams
there is no official, referenced and or agreed upon rules/document on
how a query string MUST or SHOULD be parsed. - Last but not least URLSearchParans encoding/decoding rules DO NOT
follow either RFC1738 nor RFC3986 (they follow the form data which is
kind of a mix between both RFC)
THis means that just adding a method or a class that mimic 100%
URLSearchParans for instance will constitute a major departure in how
PHP trears query string you will no longer have a 1:1 relation between
the data you have inside your _GET
array and the one in
UrlSearchParams for better or for worse.
For all these arguments I would keep the proposed Url
free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.
Hi Ignace,
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do usingparse_url
. In my view theUrl
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.
Supporting IANA registered schemes is a valid request, and is definitely
useful.
However, I think this feature is not strictly required to have in the
current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely on parse_url()
. And of course, we can (and should)
add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.
Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in
PHP:
yes, PHP is server side, but it still interacts with browsers very heavily.
Among other
use-cases I cannot yet image, the major one is most likely validating
user-supplied URLs
for opening in the browser. As far as I see the situation, currently there
is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or
not.
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.
Thank you for the historical context!
Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until
this function
really becomes superfluous. That's why it now seems to me that the behavior
of
parse_url()
could be leveraged in ext/url so that it would work with a
Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a
Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are
TBD).
For all these arguments I would keep the proposed Url
free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.
My WIP implementation still uses nullable properties and return types. I
only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio
for everyone
involved in the discussion, then I think making these types nullable is
fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with
this.
Again, thank you for your constructive criticism.
Regards,
Máté
Hi Ignace,
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do usingparse_url
. In my view theUrl
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.Supporting IANA registered schemes is a valid request, and is definitely useful.
However, I think this feature is not strictly required to have in the current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely onparse_url()
. And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.
It's also worth pointing out (as another reason not to do this) is that IANA may-or-may not be valid in the current network. For example, TOR, Handshake, IPFS, Freenet, etc. all have their own DNS schemes and do not (usually) use IANA registered schemes, and many people create sites that cater to those networks.
Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in PHP:
yes, PHP is server side, but it still interacts with browsers very heavily. Among other
use-cases I cannot yet image, the major one is most likely validating user-supplied URLs
for opening in the browser. As far as I see the situation, currently there is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or not.
Looking at the spec for WHATWG, it looks like example%2Ecom
will be parsed as a valid URL, and transformed to example.com
, while this doesn't currently happen in parse_url()
:
I don't know if that may be an issue, but might be if you are expecting the string to remain URL encoded.
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.Thank you for the historical context!
Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until this function
really becomes superfluous. That's why it now seems to me that the behavior of
parse_url()
could be leveraged in ext/url so that it would work with a Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are TBD).For all these arguments I would keep the proposed
Url
free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.My WIP implementation still uses nullable properties and return types. I only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone
involved in the discussion, then I think making these types nullable is fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with this.
The spec contains elements and their types. It would be good to adhere to the spec (simplifies documentation):
- scheme may be null or empty string
- port may be null
- path is never null, but may be empty string
- query may be null
- fragment may be null
- user/password may be null (to differentiate between an empty password or no password)
- host may be null (for relative URLs
Again, thank you for your constructive criticism.
Regards,
Máté
— Rob
Hi Ignace,
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do usingparse_url
. In my view theUrl
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.Supporting IANA registered schemes is a valid request, and is definitely useful.
However, I think this feature is not strictly required to have in the current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely onparse_url()
. And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.It's also worth pointing out (as another reason not to do this) is that IANA may-or-may not be valid in the current network. For example, TOR, Handshake, IPFS, Freenet, etc. all have their own DNS schemes and do not (usually) use IANA registered schemes, and many people create sites that cater to those networks.
Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in PHP:
yes, PHP is server side, but it still interacts with browsers very heavily. Among other
use-cases I cannot yet image, the major one is most likely validating user-supplied URLs
for opening in the browser. As far as I see the situation, currently there is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or not.Looking at the spec for WHATWG, it looks like
example%2Ecom
will be parsed as a valid URL, and transformed toexample.com
, while this doesn't currently happen inparse_url()
:I don't know if that may be an issue, but might be if you are expecting the string to remain URL encoded.
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.Thank you for the historical context!
Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until this function
really becomes superfluous. That's why it now seems to me that the behavior of
parse_url()
could be leveraged in ext/url so that it would work with a Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are TBD).For all these arguments I would keep the proposed
Url
free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.My WIP implementation still uses nullable properties and return types. I only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone
involved in the discussion, then I think making these types nullable is fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with this.The spec contains elements and their types. It would be good to adhere to the spec (simplifies documentation):
- scheme may be null or empty string
- port may be null
- path is never null, but may be empty string
- query may be null
- fragment may be null
- user/password may be null (to differentiate between an empty password or no password)
- host may be null (for relative URLs
Again, thank you for your constructive criticism.
Regards,
Máté— Rob
Here's a list of examples worth adding to the RFC:
//example.com?
ftp://user@example.com/path/to/ffile
https://user:@example.com
https://user:pass@example%2Ecom/?something=other&bool#heading
etc.
— Rob
Hi Máté,
Supporting IANA registered schemes is a valid request, and is
definitely useful. However, I think this feature is not strictly
required to have in the current RFC.
True. Having a WHATWG compliant parser in PHP source code is a big +1
from me I have nothing against that inclusion.
Based on your and others' feedback, it has now become clear for me
thatparse_url()
is still useful and ext/url needs quite some additional
capabilities until this function really becomes superfluous.
parse_url
can only be deprecated when a RFC3986 compliant parser is
added to php-src, hence why I insist in having that parser being present
too.
I will also add that everything up to now in PHP uses RFC3986 as basis
for generating or representing URLs (cURL extension, streams, etc...).
Having the first and only OOP representation of an URL in the language
not following that same specification seems odd to me. It opens the door
to inconcistencies that will only be resolved once an equivalent RFC3986
URL object made its way into the source code.
On the public API side I would recommend the following:
-
if you are to strictly follow the WHATWG specification no URI
component can be null. They must all be strings. If we have to plan to
use the same object for RFC3986 compliant parser, then all components
should be nullable except for the path component which can never be null
as it is always present. -
As other have mention we should add a method to resolve an URI against
a base URI something like Url::resolve(string $url, Url|string|null
$baseUrl) where the baseURL argument should be an absolute Url if
present. If absent the url argument must be absolute otherwise an
exception should be thrown -
last but not least the WHATWG specification is not only a URL parser
but also a URL validator and can apply some "correction" to malformed
URL and report them. The specification has a provision for a structure
to report malformed URL errors. I failed to see this mechanism being
mention anywhere the RFC. Will the URL only trigger exceptions or will
it also triggers warnings ? For inspiration the excellent PHP userland
WHATWG URL parser from Trevor Rowbotham
https://github.com/TRowbotham/URL-Parser allow using a PSR-3 logger to
record those errors.
Best regards,
Ignace
Hi Máté,
Supporting IANA registered schemes is a valid request, and is
definitely useful. However, I think this feature is not strictly
required to have in the current RFC.True. Having a WHATWG compliant parser in PHP source code is a big +1
from me I have nothing against that inclusion.Based on your and others' feedback, it has now become clear for me
thatparse_url()
is still useful and ext/url needs quite some additional
capabilities until this function really becomes superfluous.
parse_url
can only be deprecated when a RFC3986 compliant parser is
added to php-src, hence why I insist in having that parser being present
too.I will also add that everything up to now in PHP uses RFC3986 as basis
for generating or representing URLs (cURL extension, streams, etc...).
Having the first and only OOP representation of an URL in the language
not following that same specification seems odd to me. It opens the door
to inconcistencies that will only be resolved once an equivalent RFC3986
URL object made its way into the source code.On the public API side I would recommend the following:
- if you are to strictly follow the WHATWG specification no URI
component can be null. They must all be strings. If we have to plan to
use the same object for RFC3986 compliant parser, then all components
should be nullable except for the path component which can never be null
as it is always present.
This isn't true. It's just that in the language it is specified in, any element can be null (i.e., no nullable types). It specifies what may be null here: URL Standard (whatwg.org) https://url.spec.whatwg.org/#url-representation
— Rob
Hi Maté,
Fantastic RFC :)
Le dim. 7 juil. 2024 à 11:17, Máté Kocsis kocsismate90@gmail.com a écrit :
Hi Ignace,
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do usingparse_url
. In my view theUrl
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.Supporting IANA registered schemes is a valid request, and is definitely
useful.
However, I think this feature is not strictly required to have in the
current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely onparse_url()
.
If I may, parse_url is showing its age and issues like
https://github.com/php/php-src/issues/12703 make it unreliable. We need an
escape plan from it.
FYI, we're discussing whether a Uri component should make it in Symfony
precisely to work around parse_url's issues in
https://github.com/php/php-src/issues/12703
Your RFC would be the perfect answer to this discussion but IRI would need
to be part of it.
I agree with everything Ignace said. Supporting RFC3986 from day-1 would be
absolutely great!
Note that we use parse_url for http-URLs, but also to parse DSNs like
redis://localhost and the likes.
And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That's why I
opt for incremental improvements.Besides, I fail to see why a WHATWG compliant parser wouldn't be useful in
PHP:
yes, PHP is server side, but it still interacts with browsers very
heavily. Among other
use-cases I cannot yet image, the major one is most likely validating
user-supplied URLs
for opening in the browser. As far as I see the situation, currently there
is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or
not.
- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.Thank you for the historical context!
Based on your and others' feedback, it has now become clear for me that
parse_url()
is still useful and ext/url needs quite some additional capabilities until
this function
really becomes superfluous. That's why it now seems to me that the
behavior of
parse_url()
could be leveraged in ext/url so that it would work with a
Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a
Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are
TBD).For all these arguments I would keep the proposed
Url
free of allthese concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.My WIP implementation still uses nullable properties and return types. I
only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low
prio for everyone
involved in the discussion, then I think making these types nullable is
fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with
this.Again, thank you for your constructive criticism.
Regards,
Máté
Hey Ignace, Nicolas,
Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/) in
the recent days
in order to add support for the requested functionality. As far as I can
tell, the results
were very promising, so I'm ok to include this into my proposal (I haven't
pushed my
changes yet and haven't updated the RFC yet).
Regarding the reference resolution (
https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering what
the use-case is?
But in any case, I'm fine with incorporating this as well into the RFC,
since apparently
both Lexbor and uriparser support this (naturally).
What I became puzzled about is the correct object structure and naming. Now
that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class extending
the former one...
If it is, then in my opinion, the logical behavior would be that Lexbor
always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case the
differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since URL
objects
could hold URLs parsed based on both specs (and therefore having a unified
interface is required).
Or rather we should have a separate URI and a WhatwgUrl class so that the
former one would
always be created by uriparser, while the latter one by Lexbor? This way we
could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one could
have a getUserInfo() method,
while the WHATWG related one could have both getUser() and getPassword()
methods). But then
the question is how interchangeable these classes should be? I.e. should we
be able to convert them
back and forth, or should there be an interface that is implemented by the
two classes?
I'd appreciate any suggestions regarding these questions.
P.S. due to its bad receptance, I got rid of the UrlParser class as well as
the UrlComponent enum from my
implementation in the meantime.
Regards,
Máté
Hey Ignace, Nicolas,
Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/)
in the recent days
in order to add support for the requested functionality. As far as I
can tell, the results
were very promising, so I'm ok to include this into my proposal (I
haven't pushed my
changes yet and haven't updated the RFC yet).Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
But in any case, I'm fine with incorporating this as well into the RFC,
since apparently
both Lexbor and uriparser support this (naturally).What I became puzzled about is the correct object structure and naming.
Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
If it is, then in my opinion, the logical behavior would be that Lexbor
always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case
the differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since
URL objects
could hold URLs parsed based on both specs (and therefore having a
unified interface is required).Or rather we should have a separate URI and a WhatwgUrl class so that
the former one would
always be created by uriparser, while the latter one by Lexbor? This
way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one
could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and
getPassword() methods). But then
the question is how interchangeable these classes should be? I.e.
should we be able to convert them
back and forth, or should there be an interface that is implemented by
the two classes?I'd appreciate any suggestions regarding these questions.
P.S. due to its bad receptance, I got rid of the UrlParser class as
well as the UrlComponent enum from my
implementation in the meantime.Regards,
Máté
I apologize if I missed this up-thread somewhere, but what precisely are the differences between URI and URL? My understanding was that URL is a subset of URI (all URLs are URIs, but not all URIs are URLs). You're saying they're slightly disjoint sets? Can you give some concrete examples of where the parsing rules would produce different results? That may give us a better sense of what the logic should be.
--Larry Garfield
Hey Ignace, Nicolas,
Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (https://github.com/uriparser/uriparser/)
in the recent days
in order to add support for the requested functionality. As far as I
can tell, the results
were very promising, so I'm ok to include this into my proposal (I
haven't pushed my
changes yet and haven't updated the RFC yet).Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
But in any case, I'm fine with incorporating this as well into the
RFC, since apparently
both Lexbor and uriparser support this (naturally).What I became puzzled about is the correct object structure and
naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
If it is, then in my opinion, the logical behavior would be that
Lexbor always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case
the differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since
URL objects
could hold URLs parsed based on both specs (and therefore having a
unified interface is required).Or rather we should have a separate URI and a WhatwgUrl class so that
the former one would
always be created by uriparser, while the latter one by Lexbor? This
way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one
could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and
getPassword() methods). But then
the question is how interchangeable these classes should be? I.e.
should we be able to convert them
back and forth, or should there be an interface that is implemented by
the two classes?I'd appreciate any suggestions regarding these questions.
P.S. due to its bad receptance, I got rid of the UrlParser class as
well as the UrlComponent enum from my
implementation in the meantime.Regards,
Máté
Hi Máté,
As far as I can tell, the results were very promising, so I'm ok to
include this into my proposal (I haven't pushed my changes yet and
haven't updated the RFC yet).
This is a great news if indeed it is possible to release both
specifications at the same time that would be really great.
Regarding the reference resolution
(https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
Resolution is common when using an HTTP client and you defined a base
URI and then you can construct
subsequent URI based on that base URI using resolution.
What I became puzzled about is the correct object structure and
naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
Both specification parse and can be represented by a URL value object.
The main difference between both
implementation are around normalization and encoding.
RFC3986 only allow non destructive normalization which is not true in
the case of WHATWG spec:
Here's a simple example to illustrate the differences:
HttPs://0300.0250.0000.0001/path?query=foo%20bar
- with RFC3986 you will end up with
https://0300.0250.0000.0001/path?query=foo%20bar
- with WHATWG you will end up with
https://192.168.0.1/path?query=foo+bar
In the case of WHATWG the host is changed and the query string follow a
distinctive encoding spec.
From my POV you have 2 choices either you use one URL object for both
specifications with distinctive named constructors fromRFC3986 and
fromWhatwg or you have one interface and two distinctive implementations.
I do not think that one can be the extended to create the other one at
least that's my POV.
Hope this helps you in your implementation.
Best regards,
Ignace
Hi Niels,
First of all, thank you for your support!
Why did you choose UrlParser to be a "static" class? Right now it's just a
fancy namespace.
That's a good question, let me explain the reason: one of my major design
goals was to make the UrlParser class to be
extendable and configurable (e.g. via an "engine" property similar to what
Random/Randomizer has). Of course, UrlParser
doesn't support any of this yet, but at least the possibility is there for
followup RFCs due to the class being final.
Since I knew it would be an overkill to require instantiating an UrlParser
instance for a task which is stateless (URL parsing),
finally I settled on using static methods for the purpose. Later, if the
need arises, the static methods could be converted to
non-static ones with minimal BC impact.
It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a
key-value storage for query parameters.
Hm, yes, that's an observation I can agree with. However, this restriction
shouldn't limit followups to add key-value storage
support for query parameters. Although, as far as I could determine,
neither Lexbor is capable of such a thing currently.
Why is UrlComponent a backed enum?
To be honest, it has no specific reason apart from that's what I am used
to. I'm fine with whatever choice, even with getting rid of
UrlComponent completely. I added the UrlParser::parseUrlComponent() method
(and hence the UrlComponent enum) to the
proposal in order to have a direct replacement for parse_url()
when it's
called with the $component parameter set, but I wasn't
really sure whether this is needed at all... So I'm eager to hear any
recommendations regarding this problem.
A nit: We didn't bundle the entire Lexbor engine, only select parts of it.
Just thought I'd make it clear.
Yes, my wording was slightly misleading. I'll clarify this in the RFC.
About edge cases: e.g. what happens if I call the Url constructor and leave
every string field empty?
Nothing :) The Url class in its current form can store invalid URLs. I know
that URLs are generally modeled as value objects (that's
also why the proposed class is immutable), and generally speaking, value
objects should protect their invariants. However, due to
separating the parser to its own class, I abandoned this "rule". So this is
one more downside of the current API.
Regards,
Máté
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the
WHATWG URL living standard), since theparse_url()
function is
optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
I am all for proper data modeling of all the things, so I support this effort.
Comments:
-
There's no need for UrlComponent to be backed.
-
I don't understand why UrlParser is a static class. We just had a whole big debate about that. :-)
There's a couple of ways I could see it working, and I'm not sure which I prefer:
-
Better if we envision the parser getting options or configuration in the future.
$url = new UrlParser()->parseUrl(): Url; -
The named-constructor pattern is quite common.
$url = Url::parseFromString()
$url = Url::parseToArray();
-
I... do not understand the point of having public properties AND getters/withers. A readonly class with withers, OK, a bit clunky to implement but it would be your problem in C, not mine, so I don't care. :-) But why getters AND public properties? If going that far, why not finish up clone-with and then we don't need the withers, either? :-)
-
Making all the parameters to Url required except port makes little sense to me. User/pass is more likely to be omitted 99% of the time than port. In practice, most components are optional, in which case it would be inaccurate to not make them nullable. Empty string wouldn't be quite the same, as that is still a value and code that knows to skip empty string when doing something is basically the same as code that knows to skip nulls. We should assume people are going to instantiate this class themselves often, not just get it from the parser, so it should be designed to support that.
-
I would not make Url final. "OMG but then people can extend it!" Exactly. I can absolutely see a case for an HttpUrl subclass that enforces scheme as http/https, or an FtpUrl that enforces a scheme of ftp, etc. Or even an InternalUrl that assumes the host is one particular company, or something. (If this sounds like scope creep, it's because I am confident that people will want to creep this direction and we should plan ahead for it.)
-
If the intent of the withers is to mimic PSR-7, I don't think it does so effectively. Without the interface, it couldn't be a drop-in replacement for UriInterface anyway. And we cannot extend it to add the interface if it's final. Widening the parameters in PSR-7 interfaces to support both wouldn't work, as that would be a hard-BC break for any existing implementations. So I don't really see what the goal is here.
-
If we ever get "data classes", this would be a good candidate. :-)
-
Crazy idea:
new UriParser(HttpUrl::class)->parse(string);
To allow a more restrictive set of rules. Or even just to cast the object to that child class.
--Larry Garfield
Hi Larry,
Thank you very much for your feedback! I think I have already partially
answered some of your questions in my previous email to Niels,
but let me answer your other questions below:
- I... do not understand the point of having public properties AND
getters/withers. A readonly class with withers, OK, a bit clunky to
implement but it would be your problem in C, not mine, so I don't care.
:-) But why getters AND public properties? If going that far, why not
finish up clone-with and then we don't need the withers, either? :-)
I know it's disappointing, but the public modifiers are just a typo which
were forgotten there from the very first iteration of the API :) However,
I'm fine with having public readonly properties without getters as well, as
long as we declare this a policy that we are going to adopt... Withers are
indeed a must for now (and their implementation indeed requires some magic
in C...).
- Making all the parameters to Url required except port makes little sense
to me. User/pass is more likely to be omitted 99% of the time than port.
In practice, most components are optional, in which case it would be
inaccurate to not make them nullable. Empty string wouldn't be quite the
same, as that is still a value and code that knows to skip empty string
when doing something is basically the same as code that knows to skip
nulls. We should assume people are going to instantiate this class
themselves often, not just get it from the parser, so it should be designed
to support that.
I may have misunderstood what you wrote, but all the parameters - including
port - are required. If you really meant "nullable" instead of "required",
then you are right. Apart from this, I'm completely fine with making these
parameters optional, especially if we decide not to have the UrlParser (my
initial assumption was that the Url class is going to be instantiated via
UrlParser::parseUrl() calls).
- I would not make Url final. "OMG but then people can extend it!"
Exactly. I can absolutely see a case for an HttpUrl subclass that enforces
scheme as http/https, or an FtpUrl that enforces a scheme of ftp, etc. Or
even an InternalUrl that assumes the host is one particular company, or
something. (If this sounds like scope creep, it's because I am confident
that people will want to creep this direction and we should plan ahead for
it.)
Without having thought much about its consequences on the implementation,
I'm fine with removing the final modifier.
- If the intent of the withers is to mimic PSR-7, I don't think it does so
effectively. Without the interface, it couldn't be a drop-in replacement
for UriInterface anyway. And we cannot extend it to add the interface if
it's final. Widening the parameters in PSR-7 interfaces to support both
wouldn't work, as that would be a hard-BC break for any existing
implementations. So I don't really see what the goal is here.
I've just answered this to Ben, but let me reiterate: PSR-7's UriInterface
is only needed because PHP doesn't have a Url internal class. :)
Máté
Hi Everyone,I've been working on a new RFC for a while now, and time has come to present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the
parse_url()
function is optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
The RFC states:
<snip> The Url\Url class is intentionally compatible with the PSR-7 UriInterface. </snip>It mirrors the interface, but it can’t be swapped out for a UriInterface instance, especially since it can’t be extended, so I wouldn’t consider it compatible. I would still need to write a compatibility layer that composes Url\Url and implements UriInterface.
<snip> This makes it possible for a next iteration of the PSR-7 standard to use Url\Url directly instead of requiring implementations to provide their own Psr\Http\Message\UriInterface implementation. </snip>Since PSRs are concerned with shared interfaces and this class is final and does not implement any interfaces, I’m not sure how you envision “a next iteration” of PSR-7 to use this directly, unless what you mean is that UriInterface would be deprecated and applications would type directly against Url\Url.
Cheers,
Ben
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
As a maintainer of a PHP userland URI toolkit I have a couple of
questioms/remarks on the proposal. Fist, I look forward for finally
having a real Url parser AND validator in PHP core. Any effort on that
direction is always a welcomed good news.
As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
"web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url
. In my view the Url
class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.
Therefore, I would rather suggest we ALSO include support for RFC3986
and RFC3987 specification properly and give both specs a go (at the same
time!) and a clear way to instantiate your Url
with one or the other spec.
In clear, my ideal situation would be to add to the parser at least 2
named constructors UrlParser::fromRFC3986
and UrlParser::fromWHATWG
or something similar (name can be changed or improved).
While this is an old article by Daniel Stenberg
(https://daniel.haxx.se/blog/2017/01/30/one-url-standard-please/), it
conveys with more in depth analysis my issues with the WHATWG spec and
its usage in PHP if it were to be use as the ONLY available parser in
PHP core for URL.
the PSR-7 relation is also unfortunate from my POV: PSR-7 UriInterface
is designed to be at its core an HTTP URI representation (so it shares
the same type of issue as the WHATWG spec!) meaning in absence of a
scheme it falls back to the HTTP scheme validation. This is why the
interface can forgone any nullable component because the HTTP spec
allows it, other schemes do not. For instance the FTP scheme prohibits
the presence of the query and fragment components which means they MUST
be null
in that case.
By removing PSR-7 constraints we could add
- the
Url::(get|to)Components
method: it would mimicsparse_url
returned value and as such ease migration fromparse_url
- the
Url::getUsername
andUrl::getPassword
to access the username
and password component individually. You would still use
thewithUserInfo
method to update them but you give the developer the
ability to access both components directly from theUrl
object.
These additions would remove the need for
-
UrlParser::parseUrlToArray
-
UrlParser::parseUrlComponent
-
UrlComponent
Enum
Cheers,
Ignace
Therefore, I would rather suggest we ALSO include support for RFC3986 and RFC3987 specification properly and give both specs a go (at the same time!) and a clear way to instantiate your
Url
with one or the other spec.
In clear, my ideal situation would be to add to the parser at least 2 named constructorsUrlParser::fromRFC3986
andUrlParser::fromWHATWG
or something similar (name can be changed or improved).While this is an old article by Daniel Stenberg (https://daniel.haxx.se/blog/2017/01/30/one-url-standard-please/), it conveys with more in depth analysis my issues with the WHATWG spec and its usage in PHP if it were to be use as the ONLY available parser in PHP core for URL.
I agree that I would love to see a more general IRI parser, with maybe a URI parser being a subtype of an IRI parser.
Cheers,
Ben
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the
WHATWG URL living standard), since theparse_url()
function is
optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
Hey,
That's great that you've made the Url class readonly. Immutability is
realiable. And I fully agree that a better parser is needed.
I agree with the otters that
- the enum might be fine without the backing, if it's needed at all
- I'm not convinced a separate UrlParser is needed,
Url::someFactory($str) should be enough - getters seem unnecessary, they should only be added if you can be sure
they are going to be used for compatibility with PSR-7 - treating $query as a single string is clumsy, having some kind of bag
or at least an array to represent it would be cooler and easier to build
and manipulate
I wanted to add that it might be more useful to make all the Url
constructor arguments optional. Either nullable or with reasonable
defaults. So you could $url = new Url(path: 'robots.txt'); foreach ($domains as $d) $r[] = file_get_contents($url->withHost($d))
and stuff
like that.
Similar modifiers would be very useful for the query stuff, e.g. $u = Url::current(); return $u->withQueryParam('page', $u->queryParam->page + 1);
.
Sure, all of that can be done in the userland as long as you drop
final
:)
BR,
Juris
[…] add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api
First-pass comments/thoughts.
As others have mentioned, it seems the class would/could not actually satisfy PSR-7. Realistically, the PSR-7 interface package or someone else would need to create a new class that combines the two, potentially as part of a transition away from it to the built-in class, with future PSRs building directly on Url. If we take that as given, we might as well design for the end state, and accept that there will be a (minimal) transition. This end state would benefit from being designed with the logical constraints of PSR-7 (so that migration is possible without major surprises), but without restricting us to its exact API shape, since an intermediary class would come into existence either way.
For example, Url could be a value class with merely 8 public properties. Possibly with a UrlImmutable subclass, akin to DateTime, where the properties are read-only instead a clone method could return Url?).
It might be more ergonomic to leave the parser as implementation detail, allowing the API to be accessed from a single import rather than requiring two. This could look like Url::parse() or Url::parseFromString().
For the Url::parseComponent() method, did you consider accepting the existing PHP_URL_* constants? They appear to fit exactly, in naming, description, and associated return types.
Without UrlParser/UrlComponent, I'd adopt it direclty in applications and frameworks. WIthout it, further wrapping seems likely for improved usability. This is sometimes benefitial when exposing low-level APIs, but it seems like this is close to fitting in a single class, as demonstrated by the WHATWG URL API.
One thing I feel is missing, is a method to parse a (partial) URL relative to another. E.g. to expand or translate paths between two URLs. Consider expanding "/w/index.php", or "index.php" relative to "https://wikipedia.org/w/". Or expanding "//example.org" relative to either "https://wikipedia.org" vs "http://wikipedia.org". The WHATWG URL API does this in the form of a second optional string|Stringable parameter to Url::parse(). Implementing "expand URL" with parsing of incomplete URLs is error-prone and hard to get right. Including this would be valuable.
See also Net_URL2 and its resolve() method https://pear.php.net/package/Net_URL2 https://github.com/pear/Net_URL2
--
Timo Tijhof
https://timotijhof.net/
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
I was exploring wrapping ada_url for PHP (
https://github.com/lnear-dev/ada-url). It works, but it's a bit slower,
likely due to the implementation of the objects. I was planning to embed
the zvals directly in the object, similar to PhpToken, but I haven't had
the chance and don't really need it anymore. Shouldn't be too much work to
clean it up though
On Fri, Jun 28, 2024 at 3:38 PM Máté Kocsis kocsismate90@gmail.com
wrote:Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by any
means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
MátéI was exploring wrapping ada_url for PHP (
https://github.com/lnear-dev/ada-url). It works, but it's a bit slower,
likely due to the implementation of the objects. I was planning to embed
the zvals directly in the object, similar to PhpToken, but I haven't had
the chance and don't really need it anymore. Shouldn't be too much work to
clean it up though
I’ve updated the implementation, and with Ada 2.9.0, the performance is now
closer to parse_url
for short URLs and even outperforms it for longer
URLs. You can see the benchmarks in the "Run benchmark script" section of
this GitHub Actions run.
cheers,
Lanre
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to present it to a wider audience.
Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the
parse_url()
function is optimized for performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.
You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api https://wiki.php.net/rfc/url_parsing_api
Regards,
Máté
Hi Máté
Something that I thought about lately is how the existing URL parser in PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the url extension could be used instead of the current one for those usages (opt-in).
Kind regards
Niels
Hi Máté
Something that I thought about lately is how the existing URL parser in
PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we
rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the
url extension could be used instead of the current one for those usages
(opt-in).Kind regards
Niels
Hi Niels,
As mentioned before, I believe the "pluggable" system can only be applied
once a RFC3986 URL object is available, using the WHATWG URL
would constitute a major BC. I would even go a step further and state that
even by using the RFC3986 URL object you would still face some issues, for
instance,
in regards to file
scheme based URL. Those are not parsed the same
way with parse_url
function and RFC3986 rules.
Maybe that change may land on PHP9 or the behaviour may be deprecated to be
removed in PHP10 whenever that one happens.
On Sun, Jul 21, 2024 at 1:22 PM Niels Dossche dossche.niels@gmail.com
wrote:
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the WHATWG
URL living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by any
means, the RFC only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api <
https://wiki.php.net/rfc/url_parsing_api>Regards,
MátéHi Máté
Something that I thought about lately is how the existing URL parser in
PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we
rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the url
extension could be used instead of the current one for those usages
(opt-in).Kind regards
Niels
Hi Ignace, Niels,
Sorry for being silent for so long, I was working hard on the
implementation besides some summer activities :) I can say that I had
really good progress in the last month and now I think (hope) that I
managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:
- The uriparser library is now used for parsing URIs based on RFC 3986.
- I renamed the extension to "uri" in favor of "url" in order to
make the name more generic and to express the new use-case. - There is no Url\UrlParser class anymore. The Uri\Uri class now includes
the relevant factory methods. - Uri/Uri is now an abstract class which is implemented by 2 concrete
classes: Uri\Rfc3986Uri and Uri\WhatwgUri. - WhatWG URL parsing now returns the exact error code according to the
specification (although a reference parameter is used for now - but this is
TBD) - As suggested by Niels, it's now possible to plug an URI parsing
implementation into PHP. A new uri.default_handler INI option is also added.
Currently, integration is only implemented forFILTER_VALIDATE_URL
though.
The approach also makes it possible to register additional 3rd party
libraries for parsing URIs (like ADA URL). - It looks like that performance significantly improved according to the
rough benchmarks performed in CI.
Please re-read the RFC as it shares a bit more details than my quick
summary above: https://wiki.php.net/rfc/url_parsing_api
There are some questions I still didn't manage to find an answer for
though. Most importantly, the URI parser libraries used don't support
modification
of the URI. That's why I had to get rid of the "wither" methods for now
which were originally part of the API. I think it's unfortunate, and I'll
try to do my
best to reclaim them.
Additionally, due to technical reasons, extending the Uri\Uri class in
userland is only possible if all the methods are overridden by the child.
It's because
I had to use "computed" properties in the implementation (roughly, they are
stored in an internal C struct unlike regular properties). That's why it
may be
better if userland code could use (and possibly implement) an Uri\Uri
interface instead.
In one of my previous emails, I had some concerns that RFC 3986 and WhatWg
spec can really share the same interface (they do in my current
implementation
despite that they are different classes). I still share this concern
because WhatWg specifies the "user" and "password" URL components, while
RFC 3986
only specifies the notion of "userinfo" (which is usually just
user:password, but it's not necessarily the case as far as I understood).
The RFC implementation
of the RFC 3986 parser currently splits the 'userinfo' component at the ":"
character, but doing so doesn't seem very spec compliant.
Arnaud suggested that it would be better if the query parameters could be
retrieved both escaped and unescaped after parsing. I haven't had time to
investigate
the possibilities, but my gut feeling is that it's only possible to achieve
with some custom code. Arnaud also had questions regarding canonization.
Currently,
it's not performed when calling the __toString() method, because only
uriparser library supports this feature, and I didn't want to diverge the
two implementations.
I'm not even sure that it's a good idea to always do it so I'm thinking
about the possibility to selectively enable this feature (i.e. adding a
separate "toCanonizedString"
method).
Regards,
Máté
Hi Everyone,
I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.Last year, I learnt that PHP doesn't have built-in support for parsing URLs
according to any well established standards (RFC 1738 or the WHATWG URL
living standard), since theparse_url()
function is optimized for
performance instead of correctness.In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_apiRegards,
Máté
Máté, thanks for putting this together.
Whenever I need to work with URLs there are a few things missing that I would love to see incorporated into any change in PHP that brings us a spec-compliant parsing class.
First of all, I typically care most about WhatWG URLs because the PHP code I’m working with is making decisions about HTML that a browser will interpret. Paramount above all other concerns that code on the server can understand content in the same way that the browsers will, otherwise we will invite security issues. People may have valid critiques with the WhatWG specification, but it’s also the most-relevant specification for users of much or most of the PHP code we write, and it’s valuable because it allows us to talk about URLs in the same way a browser would.
I’m worried about the side-effects that having a global uri.default_handler could have with code running differently for no apparent reason, or differently based on what is calling it. If someone is writing code for a controlled system I could see this being valuable, but if someone is writing a framework like WordPress and has no control over the environments in which code runs, it seems dangerous to hope that every plugin and every host runs compatible system configurations. Nobody is going to check ini_get( ‘uri.default_handler’ )
before every line that parses URLs. Beyond this, even just allowing a pluggable parser invites broken deployments because PHP code that is reading from a browser or sending output to one needs to speak the language the browser is speaking, not some arbitrary language that’s similar to it.
One thing I feel is missing, is a method to parse a (partial) URL relative to another
Being able to parse a relative URL and know if a URL is relative or absolute would help WordPress, which often makes decisions differently based on this property (for instance, when reading an href
property of a link). I know these aren’t spec-compliant URLs, but they still represent valid values for URL fields in HTML and knowing if they are relative or not requires some amount of parsing specific details everywhere, vs. in a class that already parses URLs. Effectively, this would imply that PHP’s new URL parser decodes document.querySelector( ‘a’ ).getAttribute( ‘href’ )
, which should be the same as document.querySelector( ‘a’ ).href
, and indicates whether it found a full URL or only a portion of one.
-
$url->is_relative
or$url->is_absolute
-
$url->specificity = URL::Relative | URL::Absolute
the URI parser libraries used don't support modification of the URI
Having methods to add query arguments, change the path, etc… would be a great way to simplify user-space code working with URLs. For instance, read a URL and then add a query argument if some condition within the URL warrants it (for example, the path ends in .png
).
Was it intended to add this to the RFC before it’s finalized?
I would not make Url final. "OMG but then people can extend it!" Exactly.
My counter-point to this argument is that I see security exploits appear everywhere that functions which implement specifications are pluggable and extendable. It’s easy to see the need to create a class that limits possible URLs, but that also doesn’t require extending a class. A class can wrap a URL parser just as it could extend one. Magic methods would make it even easier.
A problem that can arise with adding additional rules onto a specification like this is that the subclass gets used in more places than it should and then somewhere some PHP code allows a malicious URL because it failed to parse and then the inspection rules weren’t applied.
Finally, I frequently find the need to be able to consider a URL in both the display context and the serialization context. With Ada we have normalize_url()
, parse_search_params()
, and the IDNA functions to convert between the two representations. In order to keep strong boundaries between security domains, it would be nice if PHP could expose the two variations: one is an encoded form of a URL that machines can easily parse while the other is a “plain string” in PHP that’s easier for humans to parse but which might not even be a valid URL. Part of the reason for this need is that I often see user-space code treating an entire URL as a single text span that requires one set of rules for full decoding; it’s multiple segments that each have their own decoding rules.
- Original [ https://xn--google.com/secret/../search?q=🍔 ]
-
$url->normalize()
[ https://xn--google.com/search?q=%F0%9F%8D%94 ] -
$url->for_display()
Displayed [ https://䕮䕵䕶䕱.com/search?q=🍔 ]
Having this in the RFC would give everyone the tools they need to effectively and safely set links within an HTML document.
All the best,
Dennis Snell
Hi Dennis,
Even though I didn't answer for a long time, I was improving my RFC
implementation in the meanwhile as well as evaluating your suggestions.
I’m worried about the side-effects that having a global
uri.default_handler could
have with code running differently for no apparent reason, or differently
based on what is calling it. If someone is writing code for a controlled
system I could see this being valuable, but if someone is writing a
framework like WordPress and has no control over the environments in which
code runs, it seems dangerous to hope that every plugin and every host runs
compatible system configurations. Nobody is going to checkini_get( ‘uri.default_handler’ )
before every line that parses URLs. Beyond this,
even just allowing a pluggable parser invites broken deployments
because PHP code that is reading from a browser or sending output to one
needs to speak the language the browser is speaking, not some arbitrary
language that’s similar to it.
You convinced me with your arguments regarding the issues a global
uri.default_handler
INI config can cause, especially after having read a blog post by Daniel
Stenberg about the topic (
https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/). That's why I
removed this from the RFC in favor of relying on configuring the parser at
the individual feature level. However, I don't agree with removing a
pluggable parser because of the following reasons:
- the current method (parse_url() based parser) is already doomed, isn't
compliant with any spec, so it already doesn't speak the language the
browser is speaking - even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web - in addition, there are tools which aren't compliant with the WhatWg spec,
but with some other. Most prominently, cURL is mostly RFC3986 compliant
with some additional flavour of WhatWg according to
https://everything.curl.dev/cmdline/urls/browsers.html
That's why I intend to keep support for pluggability.
Being able to parse a relative URL and know if a URL is relative or
absolute would help WordPress, which often makes decisions differently
based on this property (for instance, when reading anhref
property of a
link). I know these aren’t spec-compliant URLs, but they still represent
valid values for URL fields in HTML and knowing if they are relative or not
requires some amount of parsing specific details everywhere, vs. in a class
that already parses URLs. Effectively, this would imply that PHP’s new URL
parser decodesdocument.querySelector( ‘a’ ).getAttribute( ‘href’ )
,
which should be the same asdocument.querySelector( ‘a’ ).href
, and
indicates whether it found a full URL or only a portion of one.
$url->is_relative
or$url->is_absolute
$url->specificity = URL::Relative | URL::Absolute
The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when
the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL, and then WhatWgUri should let you know whether the originally passed
in URI was relative or not, did I get you right? This feature is certainly
possible with RFC3986 URIs (even without the base parameter), but WhatWg
requires the above mentioned workaround for parsing + I have to look into
how this can be implemented...
Having methods to add query arguments, change the path, etc… would be a
great way to simplify user-space code working with URLs. For instance, read
a URL and then add a query argument if some condition within the URL
warrants it (for example, the path ends in.png
).
I managed to retain support for the "wither" methods that were originally
part of the proposal. This required using custom code for the uriparser
library, while the maintainer of Lexbor was kind enough to add native
support for modification after I submitted a feature request. However,
convenience methods for manipulating query parameters are still not part of
the RFC because it would increase the scope of the RFC even more, and due
to other issues highlighted by Ignace in his prior email:
https://externals.io/message/123997#124077. As I really want such a
feature, I'd be eager to create a followup RFC dedicated for handling query
strings.
My counter-point to this argument is that I see security exploits appear
everywhere that functions which implement specifications are pluggable and
extendable. It’s easy to see the need to create a class that limits possible
URLs, but that also doesn’t require extending a class. A class can wrap a
URL parser just as it could extend one. Magic methods would make it even
easier.
Right now, it's only possible to plug internal URI implementation into PHP,
userland classes cannot be used, so this probably reduces the issue.
However, I recently bumped into a technical issue with URIs not being final
which I am currently trying to assess how to solve. More information is
available at one of my comments on my PR:
https://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c06124aa2f331afa8#r1847316607
As far as I see the situation currently, it would probably be better to
make these classes final so that similar unforeseen issues and
inconsistencies cannot happen again (we can unfinalize them later anyway).
Finally, I frequently find the need to be able to consider a URL in both
the display context and the serialization context. With Ada we have
normalize_url()
,parse_search_params()
, and the IDNA functions to
convert between the two representations. In order to keep strong boundaries
between security domains, it would be nice if PHP could expose the two
variations: one is an encoded form of a URL that machines can easily parse
while the other is a “plain string” in PHP that’s easier for humans to
parse but which might not even be a valid URL. Part of the reason for this
need is that I often see user-space code treating an entire URL as a single
text span that requires one set of rules for full decoding; it’s multiple
segments that each have their own decoding rules.
- Original [ https://xn--google.com/secret/../search?q=🍔 ]
$url->normalize()
[ https://xn--google.com/search?q=%F0%9F%8D%94 ]$url->for_display()
Displayed [ https://䕮䕵䕶䕱.com/search?q=
https://xn--google.com/search?q=🍔 ]
Even though I didn't entirely implement this suggestion, I added
normalization support:
- the normalize() method can be used to create a new URI instance whose
components are normalized based on the current object - the toNormalizedString() method can be used when only the normalized
string representation is needed - the newly added equalsTo() method also makes use of normalization to
better identify equal URIs
For more information, please refer to the relevant section of the RFC:
https://wiki.php.net/rfc/url_parsing_api#api_design. The forDisplay()
method also seems to be useful at the first glance, but since this may be a
controversial optional feature, I'd defer it for later...
Regards,
Máté
Hi Ignace, Niels,
Sorry for being silent for so long, I was working hard on the
implementation besides some summer activities :) I can say that I had
really good progress in the last month and now I think (hope) that I
managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:
I'm not fluent enough in the different parsing styles to comment on the difference there.
I do have concerns about the class design, though. Given the improvements to the language, the accessor methods offer zero benefit at all. Public-read properties (readonly or otherwise) would be faster and offer no less of a guarantee. If you want to allow someone to extend the class and provide some custom logic, use aviz instead of readonly and extenders can use hooks instead of the methods. The getters don't offer any value anymore.
It took me a while to realize that, I think, the fromWhatWg() method is using an in/out parameter for error handling. That is an insta-no on my part. in/out reference parameters make sense in C, maybe C++, and basically nowhere else. I view them as a code smell everywhere they're used in PHP. Better alternatives include exceptions or union returns.
It looks like you've removed the with*() methods. Why? That means it cannot be used as a builder mechanism, which is plenty valuable. (Though could be an issue with query as a string vs array.)
The WhatWgError looks to me like it's begging to be an Enum.
I am confused by the new ini value. It's for use in cases where you're NOT parsing the URL yourself, but relying on some other extension that does URL parsing internally as a side effect?
As usual, I am not a fan of an ini setting, but I cannot think of a different alternative off hand.
--Larry Garfield