Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126183 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 3CAE41A00BD for ; Sat, 28 Dec 2024 13:42:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1735393183; bh=NyvslUNkqVb+OG4TaxnAFirFULWghxQPv2wkGZwB8Zg=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=IH+ouOElNisdNT4A2RLqqX/yrZGsdjP/92YvuDQ9hNkWD/i9SeqU1vnU93D9EFSXT dGknfTCtxD2B4sncqGmvPOlsMRhT2AiD17mee6YGopwLeHR2xfy+w6sh6hlvtiMSO0 eV4gksbjT91tDQRCL1Js34JCwASgsKkhtEVkLywLzFZV5KhAxLoRbJD8TfM+JAis1j nntUPHwJW1DiZwVAJ4plwoTW737cslpPf4zv7SUpDcdfGSMBiI+S0scp0dgxJgUfIm 4HXESEDC6e5quNL9t6NxLamxg162nZNxKX5RgIu3E/3ZE3jzR9X/3o3Wq2Fps02p9R GmV7y4/KiU4kw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 0CAE318004B for ; Sat, 28 Dec 2024 13:39:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 28 Dec 2024 13:39:41 +0000 (UTC) Received: by mail-qk1-f172.google.com with SMTP id af79cd13be357-7b6fc3e9e4aso695207285a.2 for ; Sat, 28 Dec 2024 05:42:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735393360; x=1735998160; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=cJKY5xc1tPolRQXEkqF0ddRDsvq2e+8x8EQYMrQ5GiU=; b=MJZ6Tk4HfO491A5yfjb+frBuxp8T2Xix8ZwBEQAhHTY/QAND9lzVyK3u0qs3xJRsfg UGwwaw0VlX5lJtI6E34cQX1oq9I2fO+tCC2DNwFi5EF1dOf+jfVGI+n9vGMOca+h0cpy CTVtZuQ8EKwdoNJ1Ja8GeA2XQQvLFvTmfLmCF1Bl1aEkRCszFXr9hX0VKhwhU4q/h2kY PutCQfggJ7Yd53EkK8Atwmjc4m02kqAvZuPdNY3WK57gb/nmWYtNutMC3ejkzlYzDafa y7ymrZlkJmS64oRBM3mwezdaf9h/MC7fg/RWmoEyTXLFGulAzr+OJt65n+Q5MO50hh7W jPeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735393360; x=1735998160; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=cJKY5xc1tPolRQXEkqF0ddRDsvq2e+8x8EQYMrQ5GiU=; b=SEWhKheW+fKOkDG8Kn9yYu2PAlYAodvnwxy3u4Vh/7oU17jKzuyF4UIJJD0vl5j5il ukClwvuenTLIa9NOK80miQ2FloyknXuTE2loJ0W089heeBV0vqb1KQydzbJ14soVnb3G eU0kO0HhXJ818y7PiLRpThtHKFPfJOrIAtHfOnp3ENCtPuZf+yyP16CVTizxqu8VFwzC QnuSTUeYlJEeJ/TVjau7V8lNbATcPTki3uKF/9nz0TZ2ZdQPzXKt2Md/0c/qJzjp51R4 S/yIrKU15NuSKRYos9R4mZeoL/bzwRhFOBpRltXbRNYl5ixx+4RfYiXMg7xwrTztiQ37 Rj8A== X-Gm-Message-State: AOJu0Yy8oHu/quduSwwKza4hxELwFUFXZGbzqxwOtFDQ4VTFxuOJHclg P1XLoLQEgPcFw0Qv1tSk2wQ4c4hrm+GbICUYkWFbvrXt34ZzwiOjG+9hJG2OYbtDuvVCbwTiaet RzVajou6UxNgkm0ubBjCHk5ie4h4Wx0Ig X-Gm-Gg: ASbGncuEP3Xmp81w21JeeqzTCwQd6lQbX7OTNqCnj/wk+olO/Vb2z8n1rfKRUHApOZq NDhrDVnVffuZFsJgSKTxjw7j065zdHa+8zqUxug== X-Google-Smtp-Source: AGHT+IGdmojlXhSB1xjiTPGUjQWIb4KxGMhoOxXyap48tNKVR0qAEY6WOX6bVDqQsieAWe3CidVsMSHg51AkYI0H+Zo= X-Received: by 2002:a05:622a:13:b0:466:b2c9:fb00 with SMTP id d75a77b69052e-46a4a8c1c27mr503291631cf.3.1735393360017; Sat, 28 Dec 2024 05:42:40 -0800 (PST) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <20241227134930.CBC591A00C5@qa.php.net> In-Reply-To: <20241227134930.CBC591A00C5@qa.php.net> Date: Sat, 28 Dec 2024 14:42:29 +0100 Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: ignace nyamagana butera Cc: internals@lists.php.net Content-Type: multipart/alternative; boundary="000000000000c4feb3062a54c09e" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --000000000000c4feb3062a54c09e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ignace, Thank you for your efforts! > Specifically for RFC3986Uri I see that the only difference between the > `parse` named constructor and the constructor is that the former will > return `null` instead of throwing an exception. But it is not clear if bo= th > methods can work with partial URI. What is the expected result of > > new Rfc3986Uri('?query#fragment'); > As you supposed, Uri\Rfc3986Uri can parse such a relative URI no matter which method is used, while Uri\WhatWgUri will throw an exception/return null. That's why I'm still evaluating the possibility of calling the latter class "URL" in order to make it clear that the scheme is required. The naming question initially came up during an internal PHP Foundation discussion where Tim proposed that the auxiliary WHATWG related classes (WhatWgError, WhatWgErrorType) should be put into a separate Uri\WhatWg sub namespace. However, it was not clear for me whether it's a good idea to also put the main URI representations into their respective sub namespaces (so that we would have Uri\Rfc3986\Uri and Uri\WhatWg\Uri), because this way one should use an alias if they want to use both classes in the same file, and I neither like the idea of using Uri\Rfc3986\Rfc3986U= ri and Uri\WhatWg\WhatWgUri, because it's completely inconsistent with the latest practices. That's why I'm now leaning towards using Uri\Rfc3986\Uri and Uri\WhatWg\Url: this way, there's a very clear distinction about the expected URI format, while the classes can be put into a separate namespaces without class name clash. Additionally, class names would become shorter, easier to write and comprehend. > I also think that the RFC should emphasized that the RFC3986 URI is only > **parsing** the URI and not validating the URI like the WHATWGUri > counterpart. the following URI will pass without issue > > new Rfc3986('https:example.com'); > > this is a valid RFC3986 URI but it is clearly not a valid http URL. > Hm, thanks again for finding this gotcha. Yes, this is also a difference between the two specifications: while RFC3986 will resolve example.com as a path (since "//" after the scheme would indicate that example.com is part of the authority component), WHATWG will automatically resolve the input URI as "https://example.com/", making it a valid HTTP URL in fact. Fortunately, the behavior of both classes are in line with their respective specifications. In case of RFC 3986, the spec says: A parser of the generic URI syntax can parse any URI reference into its major components. Once the scheme is determined, further scheme-specific parsing can be performed on the components. In other words, the URI generic syntax is a superset of the syntax of all URI schemes. So the underlying parser doesn't do the scheme specific processing -- which is understandable. IMO that's why it's useful to allow the extension of URI classes so that the child implementations can do further processing at will. Alternatively, I could imagine adding support for scheme-specific processors: i.e. an array of a Uri\SchemeProcessor interface instances could be passed to URIs and the methods of the relevant class based on the URI's scheme would be executed when necessary (during parsing, normalization, etc). This is a possible rabbit hole again, so I don't want to include this in the current proposal, but I think it's an interesting possibility. Another topic I wanted to bring up is encoding and decoding of URI components. This problem was found by Arnaud during an offline discussion. Let me quote my interpretation of his words that I added to the RFC a few days ago ( https://wiki.php.net/rfc/url_parsing_api#how_special_characters_are_handled ): Encoding and decoding special characters is a crucial aspect of URI parsing= . > For this purpose, both RFC 3986 and WHATWG use percent-encoding > (i.e. the % character is > encoded as %25). However, the two standards differ significantly in this > regard: > > RFC 3986 defines that =E2=80=9CURIs that differ in the replacement of an > unreserved character with its corresponding percent-encoded US-ASCII octe= t > are equivalent=E2=80=9D, which means that percent-encoded characters and = their > decoded form are equivalent. On the contrary, WHATWG defines URL equivale= nce > by the equality of the serialized URLs, and never decodes percent-encoded > characters, except in the host. This implies that percent-encoded > characters are not equivalent to their decoded form (except in the host). > > The difference between RFC 3986 and WHATWG comes from the fact that the > point of view of a maintainer of the WHATWG specification is that webserv= ers > may legitimately choose to consider encoded and decoded paths distinct, a= nd > a standard cannot force them not to do so > . This > is a substantial BC break compared to RFC 3986, and it is actually a > source of confusion among users of the WHATWG specification based on the > large number of tickets related to this question. > Currently, we are brainstorming how to best resolve this problem. It is very important to specify exactly what kind of representation people should expect when they invoke a getter, so Arnaud suggested that we should have a fine-grained APi by adding a $mode enum parameter to the getters with the following possible values: ComponentMode::Raw: return the raw value, exactly as the component is > represented in the URL (as if we just returned a substr() of the url) > ComponentMode::PercentDecoded: Raw, but every percent-encoded character > is decoded > ComponentMode::WhatWGNormalized and RFC3986Normalized: The value > normalized exactly as specified in the specs. This may or may not > percent-decode (or do so partially), it depends on the spec. There are tw= o > different modes for that because the specs do not agree on how to > normalize, and the consumer may want to rely on one or the other. Althoug= h > the URI could infer which mode to use based on what parser was used. I > don't know which is more useful. > ComponentMode::PercentDecodedNormalized: This one is wrong if we have > more than normalization mode, but I think that we should at least have a > mode that combines percent-decoding and normalization. I'm not yet sure I prefer this idea, and there are surely technical issues with this (as far as I see now, doing so would require the usage of double the amount of memory for a single object than it's currently needed). Of course, if we didn't have a common interface, then this would be much less of a problem... So getting rid of the interface would also be an option, because it looks like that trying to align both specifications according to the same interface seems more and more difficult as I get more and more insights about the edge cases. On the other hand, I'm not sure it's a good outcome that PHP users would have to explicitly choose whether their code uses either RFC 3986 or WHATWG (and they have to possibly convert URIs back and forth between the two specifications). Regards, M=C3=A1t=C3=A9 --000000000000c4feb3062a54c09e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Ignace,

Thank you for your efforts!
=C2= =A0
Specifically for RFC3986Uri I see that the only difference between the `par= se` named constructor and the constructor is that the former will return `n= ull` instead of throwing an exception. But it is not clear if both methods = can work with partial URI. What is the expected result of

new Rfc3986Uri('?query#fragment');

<= div>As you supposed, Uri\Rfc3986Uri=C2=A0can parse such a relative URI no m= atter which method is used, while Uri\WhatWgUri will throw an exception/ret= urn null. That's why I'm still evaluating the possibility of callin= g the latter class "URL" in order to make it clear that=C2=A0the = scheme is required.

The naming question initially = came up during an internal PHP Foundation discussion where Tim proposed tha= t the auxiliary WHATWG related classes (WhatWgError, WhatWgErrorType) shoul= d be put into a separate=C2=A0Uri\WhatWg sub namespace. However, it was not= clear=C2=A0for me whether it's a good idea to also put the main URI re= presentations into their respective sub namespaces=C2=A0(so that we would h= ave Uri\Rfc3986\Uri and Uri\WhatWg\Uri), because this way one should use an alias if they want to use both clas= ses in the same file, and I neither like the idea of using=C2=A0Uri\Rfc3986\Rfc3986Uri and=C2=A0Uri\WhatWg\WhatWgUri, because it's completely inconsistent wit= h the latest practices. That's why I'm now leaning=C2=A0towards usi= ng=C2=A0Uri\Rfc3986\Uri and U= ri\WhatWg\Url: this way,=C2=A0there's a very clear distinction about th= e expected URI format,=C2=A0while the classes can be put into a separate=C2= =A0namespaces without class name clash. Additionally, class names would bec= ome shorter, easier to write and comprehend.
=C2=A0
<= blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-l= eft:1px solid rgb(77,77,77);padding-left:1ex"> I also think that the RFC should emphasized that the RFC3986 URI is only **= parsing** the URI and not validating the URI like the WHATWGUri counterpart= . the following URI will pass without issue

new Rfc3986('https:example.com');

this is a valid RFC3986 URI but it is clearly not a valid http URL.

Hm, thanks again for finding this gotcha. Yes,= this is also a difference between the two specifications: while RFC3986 wi= ll resolve example.com as a path (since = "//" after the scheme would indicate that example.com is part of the authority component), WHATWG will au= tomatically resolve the input URI as "https://example.com/", making it a valid HTTP URL in fact. Fortun= ately, the behavior of both classes are in line with their respective speci= fications. In case of RFC 3986, the spec says:
A parser of the generic URI syntax can parse any URI re=
ference into
its major components.  Once the scheme is determined, further
scheme-specific parsing can be performed on the components.  In other
words, the URI generic syntax is a superset of the syntax of all URI
schemes.

So the underlying parser doesn't do the scheme specif= ic processing -- which is understandable. IMO that's why it's usefu= l to allow the=C2=A0extension of URI classes so that the child implementati= ons can do further processing at will. Alternatively, I could imagine addin= g support for scheme-specific processors: i.e. an array of a Uri\SchemeProc= essor interface instances could be passed to URIs and the methods of the re= levant class based on the URI's scheme would be executed when necessary= (during parsing, normalization, etc). This is a possible rabbit hole again= , so I don't want to include this in the current proposal,=C2=A0but I t= hink it's an interesting=C2=A0possibility.

Ano= ther topic I wanted to bring up is encoding and decoding of URI components.= This problem was found by Arnaud during an offline discussion. Let me quot= e my interpretation of his words that I added to the RFC a few days ago (https://wiki.php.net/rfc/url_parsing_api#how_special_characters_= are_handled):

Enco= ding and decoding special characters is a crucial aspect of=C2=A0URI=C2=A0parsing. For this purpose= , both=C2=A0RFC=C2=A03986 and W= HATWG use=C2=A0percent-encoding=C2=A0(i.e. the=C2=A0%=C2=A0character is encoded as=C2=A0%25). However, the two= standards differ significantly in this regard:

R= FC=C2=A03986 defines that =E2=80=9CURIs that differ in the replaceme= nt of an unreserved character with its corresponding percent-encoded US-ASCII=C2=A0octet are equivalent=E2=80=9D, which means that percent-encoded char= acters and their decoded form are equivalent. On the contrary, WHATWG defin= es=C2=A0URL=C2=A0equivalenc= e by the equality of the serialized URLs, and never decodes percent-encoded= characters, except in the host. This implies that percent-encoded characte= rs are not equivalent to their decoded form (except in the host).

The difference betwe= en=C2=A0RFC=C2=A03986 and WHATW= G comes from the fact that the point of view of a maintainer of the WHATWG = specification is that=C2=A0webservers may legitimately= choose to consider encoded and decoded paths distinct, and a standard cann= ot force them not to do so. This is a substantial BC break compared to= =C2=A0RFC=C2=A03986, and it is = actually a source of confusion among users of the WHATWG specification base= d on the large number of tickets related to this question.

=
Currently, we are brainstorming how to best resolve this problem. It i= s very important to specify exactly what kind of representation people shou= ld expect when they invoke a getter, so Arnaud suggested that we should hav= e a fine-grained APi by adding a $mode enum parameter to the getters with t= he following possible values:

ComponentMode::Raw: return the raw value, exactly as = the component is represented in the URL (as if we just returned a substr() = of the url)
ComponentMode::PercentDecoded: Raw, but every percent-encoded character is decoded
ComponentMode::
WhatWGNormalized = and RFC3986Normalized: The value normalized exactly as specified in the spe= cs. This may or may not percent-decode (or do so partially), it depends on = the spec. There are two different modes for that because the specs do not a= gree on how to normalize, and the consumer may want to rely on one or the o= ther. Although the URI could infer which mode to use based on what parser w= as used. I don't know which is more useful.
ComponentMode::PercentDecodedNormalized: This one is = wrong if we have more than normalization mode, but I think that we should a= t least have a mode that combines percent-decoding and normalization.
=C2=A0
I'm not yet sure I prefer this idea, and t= here are surely technical issues with this (as far as I see now, doing so w= ould require the usage of double the amount of memory for a single object t= han it's currently needed). Of course, if we didn't have a common i= nterface, then this would be much less of a problem... So getting rid of th= e interface would also be an option, because it looks like that trying to a= lign both=C2=A0specifications= =C2=A0according to the same interface seems more and more difficult as I ge= t more and more insights about the=C2=A0edge cases. On the other hand, I= 9;m not sure it's a good outcome that PHP users would have to explicitl= y choose whether their code uses either RFC 3986 or WHATWG (and they have t= o possibly convert URIs back and forth between the two specifications).

Regards,
M=C3=A1t=C3=A9
--000000000000c4feb3062a54c09e--