Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126546 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 6A4071A00BC for ; Sun, 2 Mar 2025 22:00:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1740952665; bh=jBfGJmiiyNRlsxfgoRpOWqoWtmusQvABf3/6vOejCPY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=dD+id4L2nABipvwZcBY2IXO7eYRLk4btFO7eK3jcSD3ezmej3CGDPUkxlpeiuxRuF LuE2rt7G1ar/m359cgJ9K3uPWzC7Auc9aGSv4WfQlesx1A7TqUb0OPGLtmtpWZib4a BcBCxKbYHy4d7989z6J5vpdcaZZdOFYcz1lJTwSm8gztTzDuqFWzUWJHEYr4BU0ZYI ls0v+RG4Hny43GZX/7e4yOnLiPVarzzIJRdoevyLhvUVbtPeUuA4yXskuHjgvVU/3n 96+wpHcL8o7Y35qgfVYDlosJ90+tLU7R1Y0a9xAW5PoQCRvQU81zn7HqrnSGawVneV 2saNtFuTd0OOg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id E6FC8180086 for ; Sun, 2 Mar 2025 21:57:43 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 2 Mar 2025 21:57:43 +0000 (UTC) Received: by mail-qt1-f169.google.com with SMTP id d75a77b69052e-47210ab1283so60857501cf.1 for ; Sun, 02 Mar 2025 14:00:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1740952820; x=1741557620; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=KNlXj7AmQGUulECYSEInE9nuoSZQrMaGYBmDSxYOlAU=; b=Un7xM/0bF9iamHyguyb793mjNFxPV/Q9bkqleyaayfKrEGF/URlblvAkcRBj9nKs0D iQEggp8yU5aJXv1DrDctOSgzDNcY3h0lENxvBePeQxevTwQFVycvsnWRAUK7gpeHecmZ dkW1zrB8Ds0oLly42olRJeNvadfDWKyf+7Fl4v3VYwrfYE7x4FaMIR1yBmf/UiCo6IUh Jmgf9wAoj+zDRI5OORe/DsrVWWPIr+PFhR5TaO7l/fC/Q61ITG+Hp0fQLjFMwuAp+whk rpFEQnfwSYNsqpzxf2Ktc32xO5msSlZQ1DNPZ4tGe7+OsP4DE1VnmVaf2RFpVdbIXcr1 YTzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740952820; x=1741557620; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=KNlXj7AmQGUulECYSEInE9nuoSZQrMaGYBmDSxYOlAU=; b=JzMuZeAl87xJR8qoahDsO33U3Oy5ZU8kl9An5i5ozTv8wqqcRn9OPSBN5NT9xhBY+K P6fIa3sdntwitrxzC3tBMhQoWV43xD3ajolxxWY4HYsjrEAt4T1pRYYVqU/9A2SAYflE DKEu0NiPj6dHqKlKJGU3LX93EeQXD8+l7Fg/yf4rFhlcYmxI6L+HytR8FCGtxX4BpDlK sC5xtB64PhubINbUKthlwFCuolG9RHq1aNbRlXz+OdXQMo76r+TG5fGRJXCMV37Zrv3z dKzsVB/qk20LJzVofK8UZilf0s0+V1aPO8waPFV2VUvrOt6L23pjPWgqMiVJQLpU9YIN E8MA== X-Gm-Message-State: AOJu0Yx6tp4e9DaoV6hO+uPRMdDNv+yDhgV3tHI8wkOQoc6rF9bpaWmJ vh56/nBxq/y51TFkFDSWe09G3pvp2UD2PNDsIsyJeUBYXcsloXXMsh0BOI+Zscm8fCDvxcCSEQu NTUhfdyO7Uaf3ioawzACwzs5CT17pE+4itlA= X-Gm-Gg: ASbGncuS19Lh2F2Qbms1SRGbeeVgatED5smOL8izjCK+qT3oY4g57YOy2Ae/Slr6RcU QhViULrRqIOOBc5gf5GL2cIFWOmrPP93UbsWVIw2Deh4rVMr0GnntwmG1OT6Zsfhp6eyrNStL2d QO/or+I2/g03NklTGkxtJ+CVYJyw== X-Google-Smtp-Source: AGHT+IE24D27UcOlueqx8WVZkeAD/Va21R/zMfICtxdDM/gLIeq80if3XBQyWQqDXW2wsfD4hhB39AJtq6K92FMeYKc= X-Received: by 2002:ac8:7d42:0:b0:472:1573:faaf with SMTP id d75a77b69052e-474bc06abcamr116382441cf.22.1740952819619; Sun, 02 Mar 2025 14:00:19 -0800 (PST) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> In-Reply-To: Date: Sun, 2 Mar 2025 23:00:08 +0100 X-Gm-Features: AQ5f1JqUeh0VN44P-Rxgfseu4pT24H0fmwhv_nIzxKpzyQ6dQOaFv40vHxAGsKI Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: =?UTF-8?Q?Tim_D=C3=BCsterhus?= Cc: Internals Content-Type: multipart/alternative; boundary="00000000000062620c062f632a60" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --00000000000062620c062f632a60 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Tim, Thank you again for the thorough review! > The naming of these methods seems to be a little inconsistent. It should > either be: > > ->getHostForDisplay() > ->toStringForDisplay() > > or > > ->getDisplayHost() > ->toDisplayString() > > but not a mix between both of them. > Yes, I completely agree with your concern. I'm just not sure yet which combination I'd prefer. Probably the latter one? > Yes. Besides the remark above, my previous arguments still apply (e.g. > `with()`ers not being able to construct instances for subclasses, > requiring to override all of them). I'm also noticing that serialization > is unsafe with subclasses that add a `$__uri` property (or perhaps any > property at all?). > Hm, yes, you are right indeed that withers cannot really create new instances on their own because the whole URI string is needed to instantiate a new object... which is only accessible if it's reconstructed by swapping the relevant component with its new value. Please note that trying to serialize a $__uri property will result in an exception. 1. > > The `toDisplayString()` method that you mentioned above is not in the > RFC. Did you mean `toHumanFriendlyString()`? Which one is correct? > The toHumanFriendlyString() method stuck there from a previous version of the proposal, since then I converted it to toDisplayString(). > 2. > > The example output of the `$errors` array does not match the stub. It > contains a `failure` property, should that be `softError` instead? > The $softError property is also an outdated name: I recently changed it to $failure to be consistent with the wording that the WHATWG specification uses. > 3. > > The RFC states "When trying to instantiate a WHATWG Url via its > constructor, a Uri\InvalidUriException is thrown when parsing results in > a failure." > > What happens for Rfc3986 when passing an invalid URI to the constructor? > Will an exception be thrown? What will the error array contain? Is it > perhaps necessary to subclass Uri\InvalidUriException for use with > WhatWgUrl, since `$errors` is not applicable for 3986? > The first two questions are answered right at the top of the parsing section: "the constructor: It expects a URI, and optionally, a base URL in order to support reference resolution. When parsing is unsuccessful, a Uri\InvalidUriException is thrown." The $errors property will contain an empty array though, as you supposed. I don't see much problem with using the same exception in both cases, however I'm also fine with making the $errors property nullable in order to indicate that returning errors is not supported by the implementation triggering the error. > > 4. > > The RFC does not specify when `UninitializedUriException` is thrown. > That's a very good catch! I completely forgot about some exceptions. This one is used for indicating that an URI is not correctly initialized: when a URI instance is created without actually invoking the constructor, or the parse method, or __unserialize(), then any methods that try to use the internally stored URI will trigger this exception. > 5. > > The RFC does not specify when `UriOperationException` is thrown. > 6. > > Generally speaking I believe it would help understanding if you would > add a `/** @throws InvalidUriException */` to each of the methods in the > stub to make it clear which ones are able to throw (e.g. resolve(), or > the withers). It's harder to find this out from =E2=80=9CEnglish=E2=80=9D= rather than > =E2=80=9Ccode=E2=80=9D :-) > > Good idea! I've added the PHPDoc as well as created a dedicated "Exceptions= " section. 7. > > In the =E2=80=9CComponent retrieval=E2=80=9D section: Please add even mor= e examples of > what kind of percent-decoding will happen. For example, it's important > to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is > decoded to `=3D`. This really is the same case as with `%2F` in a path. > The explanation > Thanks for calling these cases out, I've significantly reworked the relevant sections. First of all, I added much more details to the general overview about percent-encoding: https://wiki.php.net/rfc/url_parsing_api#percent-encoding_decoding as well as extended the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section with more information about the two component representations, and added a general clarification related to reserved characters. Additionally, the https://wiki.php.net/rfc/url_parsing_api#component_modification section makes it clear how percent-encoding is performed when the withers are used. After thinking about the question a lot, finally the current encoding-decoding rules seem logical to me, but please double-check them. It's easy to misinterpret such long and complex specifications. Long story short: when parsing an URI or modifying a component, RFC 3986 fails hard if an invalid character is found, while WHATWG implementation automatically percent-encodes it while also triggering a soft-error. While retrieving the "normalized-decoded" representation of a URI component= , percent-decoding is performed when possible: - in case of RFC3986: reserved and invalid characters are not percent-decoded (only unreserved ones are) - in case of WHATWG: invalid characters and characters with special meaning (that fall into the percent-encode set of the given component) are not percent-decoded The relevant sections will give a little more reasoning why I went with these rules. "the URI is normalized (when applicable), and then the reserved > characters in the context of the given component are percent-decoded. > This means that only those reserved characters are percent-decoded that > are not allowed in a component. This behavior is needed to be able to > unambiguously retrieve components." > > alone is not clear to me. =E2=80=9Creserved characters that are not allow= ed in a > component=E2=80=9D. I assume this means that `%2F` (/) in a path will not= be > decoded, but `%3F` (?) will, because a bare `?` can't appear in a path? > I hope that this question is also clear after my clarifications + the reconsidered logic. > 8. > > In the =E2=80=9CComponent retrieval=E2=80=9D section: You compare the beh= avior of > WhatWgUrl and Rfc3986Uri. It would be useful to add something like: > > $url->getRawScheme() // does not exist, because WhatWgUrl always > normalizes the scheme > Done. > > to better point out the differences between the two APIs with regard to > normalization (it's mentioned, but having it in the code blocks would > make it more visible). > Done. > > 9. > > In the =E2=80=9CComponent Modification=E2=80=9D section, the RFC states t= hat WhatWgUrl > will automatically encode `?` and `#` as necessary. Will the same happen > for Rfc3986? Will the encoding of `#` also happen for the query-string > component? The RFC only mentions the path component. The above referenced sections will give a clear answer for this question as well. TLDR: after your message, I realized that automatic percent-encoding also triggers a (soft) error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri= , so it won't do any automatic percent-encoding. It's unfortunate, because this behavior is not consistent with WHATWG, but it's more consistent with the parsing rules of = its own specification, where there are only hard errors, and there's no such thing as "automatic correction". > I'm also wondering if there are cases where the withers would not > round-trip, i.e. where `$url->withPath($url->getPath())` would not > result in the original URL? > I am currently not aware of any such situation... I even wrote about this aspect fairly long, because I think "roundtripability" is a very important attribute. Thank you for raising awareness of this! > > 10. > > Can you add examples where the authority / host contains IPv6 literals? > It would be useful to specifically show whether or not the square > brackets are returned when using the getters. It would also be > interesting to see whether or not IPv6 addresses are normalized (e.g. > shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`). > Good idea again! I've added an example containing an IPv6 host at the very end of the component retrieval section. And yes, they will be enclosed within a [] pair as per the spec. It also surprised me, but IP address normalization is only performed by WHATWG during recomposition! But nowhere else... > > 11. > > In =E2=80=9CComponent Recomposition=E2=80=9D the RFC states "The > Uri\Rfc3986\Uri::toString() returns the unnormalized URI string". > > Does this mean that toString() for Rfc3986 will always return the > original input? > Yes, effectively that's the case, only WHATWG modifies the input according to my knowledge. In the past, I had the impression that RFC 3986 also did a few changes, but then I had to realize that it was not the case after I had dug deep into the code of uriparser. > > 12. > > It would be useful to know whether or not the classes implement > `__debugInfo()` / how they appear when `var_dump()`ing them. > I've added an example. That's all I managed to write for now, but I'll try to answer the rest of the messages and feedback as soon as possible. :) Regards, M=C3=A1t=C3=A9 --00000000000062620c062f632a60 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Tim,

Thank you=C2=A0again for the thorough review!

=C2=A0
The naming of these methods seems to be a little inconsistent. It should either be:

=C2=A0 =C2=A0 =C2=A0->getHostForDisplay()
=C2=A0 =C2=A0 =C2=A0->toStringForDisplay()

or

=C2=A0 =C2=A0 =C2=A0->getDisplayHost()
=C2=A0 =C2=A0 =C2=A0->toDisplayString()

but not a mix between both of them.

Yes= , I completely agree with your=C2=A0concern. I'm just not sure yet whic= h combination I'd prefer.
Probably the latter one?
= =C2=A0
Yes. Besides the remark above, my previous arguments still apply (e.g.
`with()`ers not being able to construct instances for subclasses,
requiring to override all of them). I'm also noticing that serializatio= n
is unsafe with subclasses that add a `$__uri` property (or perhaps any
property at all?).

Hm, yes, you are rig= ht indeed that withers cannot really create new instances on
thei= r own because the whole URI string is needed to instantiate a new=C2=A0obje= ct... which is only
accessible if it's reconstructed by=C2=A0= swapping the relevant component with its new value.

Please note that trying to serialize a=C2=A0$__uri property will result in an exception.
1.

The `toDisplayString()` method that you mentioned above is not in the
RFC. Did you mean `toHumanFriendlyString()`? Which one is correct?

The toHu= manFriendlyString() method stuck there from a previous version of the propo= sal,
since then I c= onverted it to=C2=A0toDisplay= String().
=C2=A0
2.

The example output of the `$errors` array does not match the stub. It
contains a `failure` property, should that be `softError` instead?

The $sof= tError property is=C2=A0also an outdated name: I recently changed it= to $failure
to be consistent=C2=A0with the wording that the WHATWG=C2=A0specification uses.
=C2=A0
3.

The RFC states "When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results in a failure."

What happens for Rfc3986 when passing an invalid URI to the constructor? Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since `$errors` is not applicable for 3986?

The first two questions are answered right at the top of th= e parsing section:

"the constructor: It expects a URI, and optionally, a base URL= in order to support reference resolution.
When parsing is unsuccessful, a Uri\InvalidUriExcep= tion is thrown."<= /div>

The $errors property=C2=A0will contain an empty array though,=C2=A0as you supposed. I = don't see much problem
with using the same exception in both cases,=C2=A0however I'm a= lso fine with=C2=A0making the=C2=A0$errors property
nullable in order to indicate that returni= ng errors is not supported by the implementation triggering
the error.
=C2=A0=

4.

The RFC does not specify when `UninitializedUriException` is thrown.

That's a very good catch! I completely fo= rgot about some exceptions. This one is used
for indicating that = an URI is not correctly initialized: when a=C2=A0URI instance is created
without actually invoking the constructor, or the parse m= ethod,=C2=A0or __unserialize(= ),
then any methods= that try to use the=C2=A0internally stored URI will trigger this=C2=A0exception.

<= /div>

5.

The RFC does not specify when `UriOperationException` is thrown.

6.

Generally speaking I believe it would help understanding if you would
add a `/** @throws InvalidUriException */` to each of the methods in the stub to make it clear which ones are able to throw (e.g. resolve(), or
the withers). It's harder to find this out from =E2=80=9CEnglish=E2=80= =9D rather than
=E2=80=9Ccode=E2=80=9D :-)


Good idea! I've added the PHPDoc a= s well as created a dedicated "Exceptions"
section.

7.

In the =E2=80=9CComponent retrieval=E2=80=9D section: Please add even more = examples of
what kind of percent-decoding will happen. For example, it's important =
to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is decoded to `=3D`. This really is the same case as with `%2F` in a path. The explanation

Thanks for calling thes= e cases out, I've significantly reworked the relevant sections.
First of all, I added much more details to the general overview about pe= rcent-encoding:
https://wiki.php.net/rfc/url_parsing_api#percent-e= ncoding_decoding as well as extended
th= e=C2=A0https://wiki.php.net/rfc/url_parsing_api#component_retrieval section with m= ore information
about the two component representations, and added a gen= eral clarification related to reserved
characters. Additionally, the=C2= =A0https://wiki.php.net/rfc/url_parsing_api#component_modification section<= /span>
makes it clear how percent-encoding is performed when the withers are us= ed.

After thinki= ng about the question a lot, finally the current encoding-decoding rules se= em
logical to me, but please double-check them. It's easy to = misinterpret such long and complex
specifications.

=
Long story short: when parsing an URI or modifying a component, = RFC 3986 fails hard if
an invalid character is found, while WHATW= G implementation automatically percent-encodes
it while also=C2= =A0triggering a soft-error.

While retrieving the= =C2=A0"normalized-decoded"= representation of a URI=C2= =A0component, percent-decoding is
performed when possible:=
- in case of RFC3986:=C2= =A0reserved and invalid chara= cters are not percent-decoded=C2=A0(only unreserved ones a= re)
- in case of WHATWG: invalid characters and = characters with special meaning (that fall into the percent-encode set
of the given component)= =C2=A0are not percent-decoded=

<= div>The relevant sections will give = a little more reasoning why I went with these rules.

<= /div>
"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components."

alone is not clear to me. =E2=80=9Creserved characters that are not allowed= in a
component=E2=80=9D. I assume this means that `%2F` (/) in a path will not b= e
decoded, but `%3F` (?) will, because a bare `?` can't appear in a path?=

I hope that this question is also clea= r after my clarifications=C2=A0+ the reconsidered logic.


8.

In the =E2=80=9CComponent retrieval=E2=80=9D section: You compare the behav= ior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:

=C2=A0 =C2=A0 =C2=A0$url->getRawScheme() // does not exist, because What= WgUrl always
normalizes the scheme

Done.
= =C2=A0

to better point out the differences between the two APIs with regard to normalization (it's mentioned, but having it in the code blocks would <= br> make it more visible).

Done.
= =C2=A0

9.

In the =E2=80=9CComponent Modification=E2=80=9D section, the RFC states tha= t WhatWgUrl
will automatically encode `?` and `#` as necessary. Will the same happen for Rfc3986? Will the encoding of `#` also happen for the query-string
component? The RFC only mentions the path component.
=C2= =A0
The above referenced sections will give a clear answer for th= is question as well.
TLDR: after your message, I realized that au= tomatic percent-encoding also triggers a (soft)
error case for=C2= =A0WHATWG, so I changed my mind with= regards to=C2=A0Uri\Rfc3986\= Uri,
so it won't do any automatic percent-en= coding. It's unfortunate,=C2=A0because this behavior is not
consistent with WHATWG, but it's=C2=A0more consistent w= ith the=C2=A0parsing rules of= =C2=A0its own specification,=C2=A0
where=C2=A0there are only hard errors,=C2=A0and there's no such thi= ng as=C2=A0"automatic co= rrection".


I'm also wondering if there are cases where the withers would not
round-trip, i.e. where `$url->withPath($url->getPath())` would not result in the original URL?

I am curren= tly not aware of=C2=A0any such situation... I even wrote about this aspect fairly
long, because I think "roundtripability&qu= ot; is=C2=A0a very important attribute. Thank you for
raising awareness of this!
<= div>=C2=A0

10.

Can you add examples where the authority / host contains IPv6 literals? It would be useful to specifically show whether or not the square
brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`).
<= br>
Good idea again! I've added an example containing an IPv6= host at the=C2=A0very
end of the component=C2=A0retrieval section.=C2=A0=C2=A0And yes, they will be enclosed=C2=A0within a [] pai= r as
per the spec.<= /span>

It also=C2=A0surprised me, but IP= address=C2=A0normalization i= s only performed by WHATWG
during recomposition! But nowhere else...
=C2=A0

11.

In =E2=80=9CComponent Recomposition=E2=80=9D the RFC states "The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".

Does this mean that toString() for Rfc3986 will always return the
original input?

Yes, effectively that&#= 39;s the case, only WHATWG modifies the input according to my knowledge.
In the past, I had the impression that RFC 3986 also did a few chan= ges,
but then I had to realize that it was not the case after I h= ad dug deep into the code of uriparser.
=C2=A0

12.

It would be useful to know whether or not the classes implement
`__debugInfo()` / how they appear when `var_dump()`ing them.

I've added an example.

T= hat's all I managed to write for=C2=A0now, but I'll try to answer t= he rest of the messages and=C2=A0feedback
as soon as possible. :)=
=C2=A0
Regards,
M=C3=A1t=C3=A9
--00000000000062620c062f632a60--