[RFC] [Discussion] Add WHATWG compliant URL parsing API

7 months ago by ignace nyamagana butera — view source

unread

Hi Maté,

I finally got the time to review the proposed API and I did some experiments using a PHP userland polyfill for RFC3986Uri to test the water and to see if I did understood everything.

First thing first, the API is really well thought and at least for me and my League/Uri package it is really easy to
go around and use it if needed. Having said that I had some questions during implementation.

Specifically for RFC3986Uri I see that the only difference between the parse named constructor and the constructor is that the former will return null instead of throwing an exception. But it is not clear if both methods can work with partial URI. What is the expected result of

new Rfc3986Uri('?query#fragment');

will the class throw an exception because the missing base URI or will the parsing still occur and return a new instance of the URI ? Whatever the answer I think it should be clearly stated in the RFC. Because from the look of it
One may think that partial parsing which is use a lot is not longer supported and possible and that the URI should always be absolute. If partial parsing is in fact no longer supported maybe a distinct method should be added to support for that scenario. AFAIK, the WHATWGUri does not support partial URI and always required an absolute URI. So maybe adding a distinct named constructor specifically for the RFC3986 would make understanding the code easier and reduce suprises on usage ?

I also think that the RFC should emphasized that the RFC3986 URI is only parsing the URI and not validating the URI like the WHATWGUri counterpart. the following URI will pass without issue

new Rfc3986('https:example.com');

this is a valid RFC3986 URI but it is clearly not a valid http URL.

7 months ago by kocsismate90@gmail.com — view source

unread

Hi Ignace,

Thank you for your efforts!

Specifically for RFC3986Uri I see that the only difference between the
parse named constructor and the constructor is that the former will
return null instead of throwing an exception. But it is not clear if both
methods can work with partial URI. What is the expected result of

new Rfc3986Uri('?query#fragment');

As you supposed, Uri\Rfc3986Uri can parse such a relative URI no matter
which method is used, while Uri\WhatWgUri will throw an exception/return
null. That's why I'm still evaluating the possibility of calling the latter
class "URL" in order to make it clear that the scheme is required.

The naming question initially came up during an internal PHP Foundation
discussion where Tim proposed that the auxiliary WHATWG related classes
(WhatWgError, WhatWgErrorType) should be put into a separate Uri\WhatWg sub
namespace. However, it was not clear for me whether it's a good idea to
also put the main URI representations into their respective sub
namespaces (so that we would have Uri\Rfc3986\Uri and Uri\WhatWg\Uri),
because this way one should use an alias if they want to use both classes
in the same file, and I neither like the idea of using Uri\Rfc3986\Rfc3986Uri
and Uri\WhatWg\WhatWgUri, because it's completely inconsistent with the
latest practices. That's why I'm now leaning towards using Uri\Rfc3986\Uri
and Uri\WhatWg\Url: this way, there's a very clear distinction about the
expected URI format, while the classes can be put into a
separate namespaces without class name clash. Additionally, class names
would become shorter, easier to write and comprehend.

I also think that the RFC should emphasized that the RFC3986 URI is only
parsing the URI and not validating the URI like the WHATWGUri
counterpart. the following URI will pass without issue

new Rfc3986('https:example.com');

this is a valid RFC3986 URI but it is clearly not a valid http URL.

Hm, thanks again for finding this gotcha. Yes, this is also a difference
between the two specifications: while RFC3986 will resolve example.com as a
path (since "//" after the scheme would indicate that example.com is part
of the authority component), WHATWG will automatically resolve the input
URI as "https://example.com/";, making it a valid HTTP URL in fact.
Fortunately, the behavior of both classes are in line with their respective
specifications. In case of RFC 3986, the spec says:

A parser of the generic URI syntax can parse any URI reference into
its major components. Once the scheme is determined, further
scheme-specific parsing can be performed on the components. In other
words, the URI generic syntax is a superset of the syntax of all URI
schemes.

So the underlying parser doesn't do the scheme specific processing -- which
is understandable. IMO that's why it's useful to allow the extension of URI
classes so that the child implementations can do further processing at
will. Alternatively, I could imagine adding support for scheme-specific
processors: i.e. an array of a Uri\SchemeProcessor interface instances
could be passed to URIs and the methods of the relevant class based on the
URI's scheme would be executed when necessary (during parsing,
normalization, etc). This is a possible rabbit hole again, so I don't want
to include this in the current proposal, but I think it's an
interesting possibility.

Another topic I wanted to bring up is encoding and decoding of URI
components. This problem was found by Arnaud during an offline discussion.
Let me quote my interpretation of his words that I added to the RFC a few
days ago (
https://wiki.php.net/rfc/url_parsing_api#how_special_characters_are_handled
):

Encoding and decoding special characters is a crucial aspect of URI parsing.

For this purpose, both RFC 3986 and WHATWG use percent-encoding
https://en.wikipedia.org/wiki/Percent-encoding (i.e. the % character is
encoded as %25). However, the two standards differ significantly in this
regard:

RFC 3986 defines that “URIs that differ in the replacement of an
unreserved character with its corresponding percent-encoded US-ASCII octet
are equivalent”, which means that percent-encoded characters and their
decoded form are equivalent. On the contrary, WHATWG defines URL equivalence
by the equality of the serialized URLs, and never decodes percent-encoded
characters, except in the host. This implies that percent-encoded
characters are not equivalent to their decoded form (except in the host).

The difference between RFC 3986 and WHATWG comes from the fact that the
point of view of a maintainer of the WHATWG specification is that webservers
may legitimately choose to consider encoded and decoded paths distinct, and
a standard cannot force them not to do so
https://github.com/whatwg/url/issues/606#issuecomment-926395864. This
is a substantial BC break compared to RFC 3986, and it is actually a
source of confusion among users of the WHATWG specification based on the
large number of tickets related to this question.

Currently, we are brainstorming how to best resolve this problem. It is
very important to specify exactly what kind of representation people should
expect when they invoke a getter, so Arnaud suggested that we should have a
fine-grained APi by adding a $mode enum parameter to the getters with the
following possible values:

ComponentMode::Raw: return the raw value, exactly as the component is

represented in the URL (as if we just returned a substr() of the url)
ComponentMode::PercentDecoded: Raw, but every percent-encoded character
is decoded
ComponentMode::WhatWGNormalized and RFC3986Normalized: The value
normalized exactly as specified in the specs. This may or may not
percent-decode (or do so partially), it depends on the spec. There are two
different modes for that because the specs do not agree on how to
normalize, and the consumer may want to rely on one or the other. Although
the URI could infer which mode to use based on what parser was used. I
don't know which is more useful.
ComponentMode::PercentDecodedNormalized: This one is wrong if we have
more than normalization mode, but I think that we should at least have a
mode that combines percent-decoding and normalization.

I'm not yet sure I prefer this idea, and there are surely technical issues
with this (as far as I see now, doing so would require the usage of double
the amount of memory for a single object than it's currently needed). Of
course, if we didn't have a common interface, then this would be much less
of a problem... So getting rid of the interface would also be an option,
because it looks like that trying to align both specifications according to
the same interface seems more and more difficult as I get more and more
insights about the edge cases. On the other hand, I'm not sure it's a good
outcome that PHP users would have to explicitly choose whether their code
uses either RFC 3986 or WHATWG (and they have to possibly convert URIs back
and forth between the two specifications).

Regards,
Máté

7 months ago by ignace nyamagana butera — view source

unread

Hi Ignace,

Thank you for your efforts!
Specifically for RFC3986Uri I see that the only difference between
the `parse` named constructor and the constructor is that the former
will return `null` instead of throwing an exception. But it is not
clear if both methods can work with partial URI. What is the
expected result of

new Rfc3986Uri('?query#fragment');
As you supposed, Uri\Rfc3986Uri can parse such a relative URI no matter
which method is used, while Uri\WhatWgUri will throw an exception/return
null. That's why I'm still evaluating the possibility of calling the
latter class "URL" in order to make it clear that the scheme is required.

The naming question initially came up during an internal PHP Foundation
discussion where Tim proposed that the auxiliary WHATWG related classes
(WhatWgError, WhatWgErrorType) should be put into a separate Uri\WhatWg
sub namespace. However, it was not clear for me whether it's a good idea
to also put the main URI representations into their respective sub
namespaces (so that we would have Uri\Rfc3986\Uri and Uri\WhatWg\Uri),
because this way one should use an alias if they want to use both
classes in the same file, and I neither like the idea of using
Uri\Rfc3986\Rfc3986Uri andUri\WhatWg\WhatWgUri, because it's completely
inconsistent with the latest practices. That's why I'm now
leaning towards using Uri\Rfc3986\Uri and Uri\WhatWg\Url: this
way, there's a very clear distinction about the expected URI
format, while the classes can be put into a separate namespaces without
class name clash. Additionally, class names would become shorter, easier
to write and comprehend.
I also think that the RFC should emphasized that the RFC3986 URI is
only **parsing** the URI and not validating the URI like the
WHATWGUri counterpart. the following URI will pass without issue

new Rfc3986('https:example.com <http://example.com>');

this is a valid RFC3986 URI but it is clearly not a valid http URL.
Hm, thanks again for finding this gotcha. Yes, this is also a difference
between the two specifications: while RFC3986 will resolve example.com
http://example.com as a path (since "//" after the scheme would
indicate that example.com http://example.com is part of the authority
component), WHATWG will automatically resolve the input URI as "https://
example.com/ https://example.com/", making it a valid HTTP URL in
fact. Fortunately, the behavior of both classes are in line with their
respective specifications. In case of RFC 3986, the spec says:

A parser of the generic URI syntax can parse any URI reference into
its major components. Once the scheme is determined, further
scheme-specific parsing can be performed on the components. In other
words, the URI generic syntax is a superset of the syntax of all URI
schemes.

So the underlying parser doesn't do the scheme specific processing --
which is understandable. IMO that's why it's useful to allow
the extension of URI classes so that the child implementations can do
further processing at will. Alternatively, I could imagine adding
support for scheme-specific processors: i.e. an array of a
Uri\SchemeProcessor interface instances could be passed to URIs and the
methods of the relevant class based on the URI's scheme would be
executed when necessary (during parsing, normalization, etc). This is a
possible rabbit hole again, so I don't want to include this in the
current proposal, but I think it's an interesting possibility.

Another topic I wanted to bring up is encoding and decoding of URI
components. This problem was found by Arnaud during an offline
discussion. Let me quote my interpretation of his words that I added to
the RFC a few days ago (https://wiki.php.net/rfc/
url_parsing_api#how_special_characters_are_handled <https://
wiki.php.net/rfc/url_parsing_api#how_special_characters_are_handled>):
Encoding and decoding special characters is a crucial aspect of
URI parsing. For this purpose, both RFC 3986 and WHATWG use percent-
encoding <https://en.wikipedia.org/wiki/Percent-encoding>; (i.e. the
|%| character is encoded as |%25|). However, the two standards
differ significantly in this regard:

RFC 3986 defines that “URIs that differ in the replacement of an
unreserved character with its corresponding percent-encoded US-
ASCII octet are equivalent”, which means that percent-encoded
characters and their decoded form are equivalent. On the contrary,
WHATWG defines URL equivalence by the equality of the serialized
URLs, and never decodes percent-encoded characters, except in the
host. This implies that percent-encoded characters are not
equivalent to their decoded form (except in the host).

The difference between RFC 3986 and WHATWG comes from the fact that
the point of view of a maintainer of the WHATWG specification is
that webservers may legitimately choose to consider encoded and
decoded paths distinct, and a standard cannot force them not to do
so <https://github.com/whatwg/url/
issues/606#issuecomment-926395864>. This is a substantial BC break
compared to RFC 3986, and it is actually a source of confusion among
users of the WHATWG specification based on the large number of
tickets related to this question.
Currently, we are brainstorming how to best resolve this problem. It is
very important to specify exactly what kind of representation people
should expect when they invoke a getter, so Arnaud suggested that we
should have a fine-grained APi by adding a $mode enum parameter to the
getters with the following possible values:
ComponentMode::Raw: return the raw value, exactly as the component
is represented in the URL (as if we just returned a `substr()` of the url)
ComponentMode::PercentDecoded: Raw, but every percent-encoded
character is decoded
ComponentMode::WhatWGNormalized and RFC3986Normalized: The value
normalized exactly as specified in the specs. This may or may not
percent-decode (or do so partially), it depends on the spec. There
are two different modes for that because the specs do not agree on
how to normalize, and the consumer may want to rely on one or the
other. Although the URI could infer which mode to use based on what
parser was used. I don't know which is more useful.
ComponentMode::PercentDecodedNormalized: This one is wrong if we
have more than normalization mode, but I think that we should at
least have a mode that combines percent-decoding and normalization.
I'm not yet sure I prefer this idea, and there are surely technical
issues with this (as far as I see now, doing so would require the usage
of double the amount of memory for a single object than it's currently
needed). Of course, if we didn't have a common interface, then this
would be much less of a problem... So getting rid of the interface would
also be an option, because it looks like that trying to align both
specifications according to the same interface seems more and more
difficult as I get more and more insights about the edge cases. On the
other hand, I'm not sure it's a good outcome that PHP users would have
to explicitly choose whether their code uses either RFC 3986 or WHATWG
(and they have to possibly convert URIs back and forth between the two
specifications).

Regards,
Máté

Hi Máté,
Thanks for the thorough explanation as where the RFC is at the moment.
My biggest takeaway from your explanation is that currently we are
trying to unify something that IMHO can not be unify.

RFC3986 is a parsing RFC whose goal is to lay down the foundation for
other RFCs to validate scheme specific URI. It needs to be generic with
all the caveat that comes from it being a generic specification.

WHATWG URL standard goal is to almost always succeed or to never fail,
depending on how one sees it. The specification is geared toward
parsing, validating and normalizing URL.
The goal is to allow the HTTP client to rescue most if not all the HTTP
call even those that were badly issued.

Since both specifications end goal are different they are bound to treat
URI their mean for their own specific goal differently. Creating a
system which can, at the same time, please servers (RFC3986) and clients
(WHATWG) goals is impossible, at the moment, IMHO. In the end we will
either displease one side or both and most certainly confuse the PHP
developer and provide an unecessary complex feature that no one will
want to use.

IMHO the current unique interface is a premature optimization. RFC3986
should have its own interface or base class and the same is true for the
WHATWG URL.

Once this is clear I think all the other issues then are resolved.

comparing both types of URL becomes meaningless (it should throw or
always return false)
comparing two URLs from the same type should no longer suffer from the
encoding/decoding issues (you no longer need to deploy a complex and
somehow hard to debug/understand encoding system that does not exist in
any other language).

use Uri\Rfc3986Uri;
use Uri\WhatWgUri;

new Rfc3986Uri("http://example.com")->equals(new 
WhatWgUri("http://ExAMple.com"));
// should return false or throw
new Rfc3986Uri("http://example.com")->equals(new 
Rfc3986Uri("http://ExAMple.com"));
// should return true

Last but not least one of the biggest issue with parse_url is that it
lacks support for i18n domain name and it seems that the current
implementation for Rfc3986Uri does not either. I would
expect a class that supports RFC3986/RFC3987 out of the box or that uses
an enum to specify which RFC needs to be followed.

new Uri\Rfc3986\Uri(uri: "path?query", base: "https://example.com", 
version: Uri\Ietf::Rfc3987);

What do you think ? IMHO RFC3987 should be the default value to allow
most people on earth to safely use the URL wihout having to explicitly
specify the RFC used for parsing. Again the WHATWG URL does not suffer
from this issue but that's because it was built knowing about it which
is not the case of RFC3986 until RFC3987 was brought into the light!

Best regards,
Ignace