Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:127102 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id B92201A00BC for ; Sun, 13 Apr 2025 12:11:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1744546126; bh=ZMiAipToOgS2ei2LWYtl4dGXOBkRLYz51mkrZYWwLiA=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=C65Vq1sRAon5pbbUuKviyt+HMedRP/a3wR7xfWG/Q999c4Xozfb5dHUYnG3i2At2E xMeIlx5iJcRwy8SeFM+6MwjdeuaW/LVA8VSdg7XJdHI4rnISE9RJfYTBp/qou6O764 HNlt0cO51hbypcdAfkEDKDJ0iik+ly73/hdCbsGg5N7oA2TN5Um6NipokDvvUSs/2p lqa55uuNcKcdT86YMX+hjGjPG8ag14uJ6H0qgvq5LYiea55HXUW5fFTAh2OL9VpPcj 7i2twuwWfiB4LvAYqnvTq0Sw93It1GI+Qpq7Cr4zO1hs+Kdy/NJufBZ+TO4OwaNifw EgWUZ3oC3r1dA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 59E7C180077 for ; Sun, 13 Apr 2025 12:08:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-qt1-f180.google.com (mail-qt1-f180.google.com [209.85.160.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 13 Apr 2025 12:08:42 +0000 (UTC) Received: by mail-qt1-f180.google.com with SMTP id d75a77b69052e-4769bbc21b0so31380571cf.2 for ; Sun, 13 Apr 2025 05:11:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1744546264; x=1745151064; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=0VpoZIoXje9CyRo8RnTIstY95FqnnRm3IHa5MlBiXfg=; b=WbRxMmEMEm16VVrgt40He6r+izaI3XtswlVPUByl3/m6/i1Rf84f0cD03kvvuhoQaF amPj9NHYXaf0i543D1MWwWTIp4xRI0F3mUQFzX/daxttL1QmqO/lzFxK76pGUc9oKbZp TvPdX85ooU7l70sQAcnSThSU2bbVc/YE+VsqzbsfQLouQjBUogntVAr82Czw7AzOnNAZ PBsK0q4lB2GBmiYy/kQ3117l60uD580QUDcEvF6+XHU01RUiWaD//f5KF2qPSCiE8q4C pmp4TbfWQRVFcLh5IcvArODmbFwaIPOW2F4hi7z3nkU2OmkIN8hrX0IV5aTcI2fX6HMZ rWZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744546264; x=1745151064; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=0VpoZIoXje9CyRo8RnTIstY95FqnnRm3IHa5MlBiXfg=; b=dLJp/2PCcs64rRmYX1rRqX4/95WwxlzXALXsmCG9mNT90Jl0YrEiPTO1hNxm7jrMe9 IOTb3cPGmnKRCbtc+ivbVaIcIr9dBSx6sifOv7geJJs+zmSUQNbY7XjGxCyA82eF9b2f fZCNdwU5dJAGiHI+VFugmFYDnFVJVQYbLchRDnzPK4otCBdcgY2SAgUNNx3LDKZ8tOU3 zsvpdeMca475tkG6skvMPtpJPh3Fb0Aj4nULp81fHsxHkWQ00a26rBLO13GBw+gPngQc eBONbhpAcQ7lYAWIGeKHqMDWq12L/sHjqp6Y51PBdqnuJpvo/OWvvpuqyd0hy80EXE6S L8qA== X-Gm-Message-State: AOJu0YzAdct0X+BsfpyNa3TfMoVVbAUQRGyn7ZSgbp2c/0SBtbveTY7W f3ZWxg0Q1gKJahUYa2eB3Nln1eIa065u7CF/Gg0XN0u6hOqZCGHkqAV8W01+JZIucbof4J8Fyw8 qA6mB9X9Y570kCOAfTZTP4tg32JizR5L7WoQ= X-Gm-Gg: ASbGnct7fZk/LCeYONb0Ah/bQwgGbalt4rkLxh5tCkkDYIyPGj1V9Wc4UA0SLGOI7+P +UwSV+RxNd3jtvf3yTwfTmA3kyohUMGnLccvFq0to7QIbWlFbffAE5l7hyhSlCjl4loMFh7hNIp N5RFpJ6+ojtHG3grZAyjEkYw== X-Google-Smtp-Source: AGHT+IGHvFeG7T6EubSjTC/XnGh9zqb+ZU2+CyhUgfsWmLbdegU0vJMNT6jpkJ+1YKG5U7ESqC1VWXwFNEazU46a32Q= X-Received: by 2002:ac8:5755:0:b0:476:8917:5efe with SMTP id d75a77b69052e-479775d589bmr117946771cf.42.1744546263773; Sun, 13 Apr 2025 05:11:03 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <8df04e01-deac-404b-beb7-cd982423db63@bastelstu.be> In-Reply-To: <8df04e01-deac-404b-beb7-cd982423db63@bastelstu.be> Date: Sun, 13 Apr 2025 14:10:52 +0200 X-Gm-Features: ATxdqUHvaZAnZ3zOYMYo8G1LyThF6wZXBt-4G20E-_cNXincQ2c9Orz4zvtbEOU Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: =?UTF-8?Q?Tim_D=C3=BCsterhus?= Cc: Internals Content-Type: multipart/alternative; boundary="00000000000058b6e50632a7d4d6" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --00000000000058b6e50632a7d4d6 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Tim, I think I would prefer: > > namespace Uri { > class InvalidUriException extends \Uri\UriException > { > } > } > > namespace Uri\WhatWg { > class InvalidUrlException extends \Uri\InvalidUriException { > /** @var list */ > public readonly array $errors; > } > } > > (note the use of Url in the name of the sub-exception) > > While this would result in a little more boilerplate, it would make > static analysis tools more useful, since the `$errors` array could be > properly typed instead of being just `array`. > OK, this makes sense to me, and I've just implemented it. > > 7. > >> > >> In the =E2=80=9CComponent retrieval=E2=80=9D section: Please add even = more examples of > >> what kind of percent-decoding will happen. For example, it's important > >> to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is > >> decoded to `=3D`. This really is the same case as with `%2F` in a path= . > >> The explanation > >> > > [=E2=80=A6] > > The relevant sections will give a little more reasoning why I went with > > these rules. > > I've tested some of the examples against the implementation, but it does > not match the description. Is the implementation up to date? > > > $url =3D new Uri\WhatWg\Url("https://example.com/foo/bar%2Fbaz"); > > var_dump($url->getPath()); // > /foo/bar%2Fbaz > var_dump($url->getRawPath()); // > /foo/bar%2Fbaz > > results in: > > string(12) "/foo/bar/baz" > string(14) "/foo/bar%2Fbaz" > Yes, it is currently up-to-date, but I made some changes in WHATWG encoding not long ago and I didn't notice that the chosen behavior negatively affects this case... Let me share the details, because decoding of WHATWG URLs seems very problematic. Originally, my intention was to percent-decode characters based on the individual components' "percent-encode set" (i.e. https://url.spec.whatwg.org/#fragment-percent-encode-set for the fragment). These are the characters that are automatically percent-encoded when encountered. One of my problems with this behavior was that characters in "percent-encode sets" are not entirely in line with "URL code points" (basically valid characters in an URL: https://url.spec.whatwg.org/#url-code-points). Most notably, the "#", the "[", and "]" characters are present in some percent-encoding sets, while missing from the valid URL code points. If characters were percent-decoded based on the "percent-encode sets", then there would be some issues when the result is passed to a wither: the WHATWG setter algorithms emit a soft error in these cases (e.g. in case of the query string, the https://url.spec.whatwg.org/#dom-url-search steps trigger https://url.spec.whatwg.org/#query-state, where the 3.1. step takes into action). To be fair, soft errors are not exposed in case of WHATWG withers, so it's currently rather a theoretical problem than an actual one (but I'm still considering adding a `$softErrors` parameter to WHATWG withers). In any case, I believe the end of the "Component modification section" of the RFC shares some background information regarding percent-decoding behavior. At last, when I changed the RFC so that only those characters were percent-decoded which were "URL code points", I didn't notice that the example you referred to above would go outdated: as "/" is an URL code point, it's currently percent-decoded by getPath(). Unfortunately, I still don't know what the best approach would be. > Please also give an explicit example for `%3F` in a path. I know that it > is reserved from reading the Rfc3986, but I think it's a little > unintuitive. You can adjust the last example in the component retrieval > section to make it show all cases. So: > > $uri =3D new > Uri\Rfc3986\Uri("https:// > [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=3Dbar%26baz%3= Dqux"); > > echo $uri->getHost(); // > [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] > echo $uri->getRawHost(); // > [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] > echo $uri->getPath(); // /foo/bar%3Fbaz > echo $uri->getRawPath(); // /foo/bar%3Fbaz > echo $uri->getQuery(); // > foo=3Dbar%26baz%3Dqux > echo $uri->getRawQuery(); // > foo=3Dbar%26baz%3Dqux > Why is this behavior unintuitive? I think the already added examples should already make it clear that percent-encoded characters are never percent-decoded (the component modification part also has one example). > During testing I also noticed that the Rfc3986 implementation removes > trailing slashes from the path when using the normalized version. This > was a little unexpected, because to me this is the difference between a > directory and a file. I don't think there are clear examples showing > that. So: > > $uri =3D new Uri\Rfc3986\Uri("https://example.com/foo/bar/"); > > echo $uri->getPath(); // /foo/bar > echo $uri->getRawPath(); // /foo/bar/ > Yes, I agree it's weird. I'll have a look at the code again if the normalizer removes the trailing slash, or I messed up something. > >> In the =E2=80=9CComponent Modification=E2=80=9D section, the RFC state= s that WhatWgUrl > >> will automatically encode `?` and `#` as necessary. Will the same > >> happen > >> for Rfc3986? Will the encoding of `#` also happen for the query-string > >> component? The RFC only mentions the path component. I think the question for RFC 3986 is answered in the PHP RFC by the following paragraph: > In order to offer consistent behavior with the parsing rules of RFC 3986, > withers of Uri\Rfc3986\Uri also only accept properly formatted input, meaning characters > that are not allowed to be present in a component must be > percent-encoded. Let's see what this means in practice through the following example Effectively, RFC 3986 has different behavior than what WHATWG does. The latter question ("Will the encoding of `#` also happen for the query-string component?") was supposed to be answered by the RFC, because of this sentence: > WHATWG algorithm automatically percent-encodes characters that fall into the percent-encoding > character set of the given component It may be possible that "the given" part is misleading, but the behavior actually follows the WHATWG spec for all components. In any case, I change a few words to make this clear. Is the implementation already up to date with this change? When I try: > > var_dump( > (new Uri\Rfc3986\Uri('https://example.com/foo/path')) > ->withPath('some/path?foo=3Dbar') > ->toString() > ); > > I get > > string(36) "https://example.comsome/path?foo=3Dbar" > > which is completely wrong. > I haven't completely implemented withers yet for RFC 3986 (first and foremost validation is missing), so that's why you experienced this behavior. I would fix this later, but only if the vote succeeds. I've already worked a lot on the implementation without having any promise of the RFC to succeed. > I think this might be a misunderstanding of the WHATWG specification. It > seems to be also normalized during parsing: > > When I do the following in my Google Chrome: > > (new URL('https://[0:0::1]')).host; > > I get `[::1]`, which indicates the normalization happening. And likewise > will: > > (new URL('https://[2001:db8:0:0:0:0:0:1]')).host; > > result in `[2001:db8::1]`. > Yes, I realized that you are right. IP6 support used to be indeed incomplete or buggy until now, but I took some time, and corrected the behavior. > My expectation be be `[2001:db8:0:0:0:0:0:1]` for Rfc3986 and > `[2001:db8::1]` for WhatWg. I have also tested the behavior of > `withHost()` when leaving out the square brackets. The Rfc3986 correctly > throws an Exception, but WhatWg silently does nothing: > > $url =3D 'https://example.com/foo/path'; > > var_dump((new > Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString())= ; > > results in > > string(28) "https://example.com/foo/path" > This looks like this is the result of WHATWG's host setter algorithm ( https://url.spec.whatwg.org/#dom-url-hostname). After debugging the behavior, I noticed that "new Uri\WhatWg\Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse the port after the first ":" character. However, the setter algorithm obviously doesn't reach this point, since it only tries to parse the host, and then it stops (because of the state override). So I'm not sure this gotcha can be cured. I tried to reproduce the problem in Chrome, but I realized that the URL properties are not validated at all when they are set ("url.hostname =3D "2001:db8:0:0:0:0:0:1";" will change t= he hostname no problem)... Regards, M=C3=A1t=C3=A9 --00000000000058b6e50632a7d4d6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Tim,

<= div class=3D"gmail_quote gmail_quote_container">
I think I would prefer:

=C2=A0 =C2=A0 =C2=A0namespace Uri {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0class InvalidUriException extends \Uri\Ur= iException
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
=C2=A0 =C2=A0 =C2=A0}

=C2=A0 =C2=A0 =C2=A0namespace Uri\WhatWg {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0class InvalidUrlException extends \Uri\In= validUriException {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/** @var list<UrlValidat= ionError> */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0public readonly array $erro= rs;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
=C2=A0 =C2=A0 =C2=A0}

(note the use of Url in the name of the sub-exception)

While this would result in a little more boilerplate, it would make
static analysis tools more useful, since the `$errors` array could be
properly typed instead of being just `array<mixed>`.
=

OK, this makes sense to me, and I've just implement= ed it.

https://examp= le.com/foo/bar%2Fbaz");

=C2=A0 =C2=A0 =C2=A0var_dump($url->getPath());=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /= /
/foo/bar%2Fbaz
=C2=A0 =C2=A0 =C2=A0var_dump($url->getRawPath());=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// /foo/bar%2Fbaz

results in:

=C2=A0 =C2=A0 =C2=A0string(12) "/foo/bar/baz"
=C2=A0 =C2=A0 =C2=A0string(14) "/foo/bar%2Fbaz"
<= div>
Yes, it is currently up-to-date, but I made some changes= in WHATWG encoding not long ago and I didn't notice that
the= chosen behavior negatively affects this case... Let me share the details, = because decoding of WHATWG
URLs seems very problematic.

Originally, my intention was to percent-decode characters b= ased on the individual components' "percent-encode set" (i.e.=
https://url.spec.whatwg.org/#fragment-percent-encode-set for the f= ragment). These are the characters that are
automatically percent= -encoded when encountered. One of my problems with this behavior was that c= haracters in "percent-encode sets"
are not entirely in = line with "URL code points" (basically valid characters in an URL= : https://url.spec= .whatwg.org/#url-code-points).
Most notably, the "#"= ;, the "[", and "]" characters are present in some perc= ent-encoding sets, while missing from the valid URL
code points.<= /div>

If characters were percent-decoded based on the &q= uot;percent-encode sets", then there would be some issues when the res= ult is
passed to a wither: the WHATWG setter algorithms emit a so= ft error in these cases (e.g. in case of the query string, the
https://url.spec.what= wg.org/#dom-url-search steps trigger https://url.spec.whatwg.org/#query-state, where the = 3.1. step takes
into action). To be fair, soft errors are not exp= osed in case of WHATWG withers, so it's currently rather a theoretical = problem
than an actual one (but I'm still considering adding = a `$softErrors` parameter to WHATWG withers).

In a= ny case, I believe the end of the "Component modification section"= ; of the RFC shares some background information
regarding percent= -decoding behavior.

At last, when I changed the RF= C so that only those characters were percent-decoded which were "URL c= ode points", I didn't notice
that the example you referr= ed to above would go outdated: as "/" is an URL code point, it= 9;s currently percent-decoded by getPath().
Unfortunately, I stil= l don't know what the best approach would be.
=C2=A0
Please also give an explicit example for `%3F` in a path. I know that it is reserved from reading the Rfc3986, but I think it's a little
unintuitive. You can adjust the last example in the component retrieval section to make it show all cases. So:

=C2=A0 =C2=A0 =C2=A0$uri =3D new
Uri\Rfc3986\Uri("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo= /bar%3Fbaz?foo=3Dbar%26baz%3Dqux");

=C2=A0 =C2=A0 =C2=A0echo $uri->getHost();=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0//
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
=C2=A0 =C2=A0 =C2=A0echo $uri->getRawHost();=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
=C2=A0 =C2=A0 =C2=A0echo $uri->getPath();=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// /foo/b= ar%3Fbaz
=C2=A0 =C2=A0 =C2=A0echo $uri->getRawPath();=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 // /foo/bar%3Fbaz =C2=A0 =C2=A0 =C2=A0echo $uri->getQuery();=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 //
foo=3Dbar%26baz%3Dqux
=C2=A0 =C2=A0 =C2=A0echo $uri->getRawQuery();=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0//
foo=3Dbar%26baz%3Dqux

Why is this behav= ior unintuitive? I think the already added examples should already make it = clear that percent-encoded
characters are never percent-decoded (= the component modification part also has one example).
=C2=A0
During testing I also noticed that the Rfc3986 implementation removes
trailing slashes from the path when using the normalized version. This
was a little unexpected, because to me this is the difference between a directory and a file. I don't think there are clear examples showing that. So:

=C2=A0 =C2=A0 =C2=A0$uri =3D new Uri\Rfc3986\Uri("https://example.c= om/foo/bar/");

=C2=A0 =C2=A0 =C2=A0echo $uri->getPath();=C2=A0 =C2=A0 =C2=A0// /foo/bar=
=C2=A0 =C2=A0 =C2=A0echo $uri->getRawPath();=C2=A0 // /foo/bar/

Yes, I agree it's weird. I'll have a lo= ok at the code again if the normalizer removes the trailing slash, or I mes= sed up something.


>> In the =E2=80=9CComponent Modification=E2=80=9D section, the RFC s= tates that WhatWgUrl
>> will automatically encode `?` and `#` as necessary. Will the same =
>> happen
>> for Rfc3986? Will the encoding of `#` also happen for the query-st= ring
>> component? The RFC only mentions the path component.
<= div>
I think the question for RFC 3986 is answered in the PHP= RFC by the following paragraph:

> In order to = offer consistent behavior with the parsing rules of RFC 3986,
>= ; withers of Uri\Rfc3986\Uri also only accept=C2=A0properly formatted input= , meaning characters
> that are not allowed to be present in a= component must be
> percent-encoded. Let's see what this = means in practice through the following example

Ef= fectively, RFC 3986 has different behavior than what WHATWG does.

The latter question ("Will the encoding of `#` also ha= ppen for the query-string component?")
was supposed to be an= swered by the RFC, because of this sentence:

>= =C2=A0WHATWG algorithm automatically percent-encodes characters that fall i= nto the percent-encoding
> character set of the given componen= t

It may be possible that "the given" pa= rt is misleading, but the behavior actually follows the WHATWG spec
for all components. In any case, I change a few words to make this clear= .

Is the implementation already up to date with this change? When I try:

=C2=A0 =C2=A0 =C2=A0var_dump(
=C2=A0 =C2=A0 =C2=A0 =C2=A0 (new Uri\Rfc3986\Uri('https://examp= le.com/foo/path'))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ->withPath('= some/path?foo=3Dbar')
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ->toString()
=C2=A0 =C2=A0 =C2=A0);

I get

=C2=A0 =C2=A0 =C2=A0string(36) "https://example.comsome/p= ath?foo=3Dbar"

which is completely wrong.

I haven'= t completely implemented withers yet for RFC 3986 (first and foremost valid= ation is missing),
so that's why you experienced this=C2=A0be= havior. I would fix this later,=C2=A0but only if the vote succeeds. I'v= e already
worked a lot on the implementation without having any p= romise of the RFC to=C2=A0succeed.
=C2=A0
I think this might be a misunderstanding of the WHATWG specification. It seems to be also normalized during parsing:

When I do the following in my Google Chrome:

=C2=A0 =C2=A0 =C2=A0(new URL('https://[0:0::1]')).host;

I get `[::1]`, which indicates the normalization happening. And likewise will:

=C2=A0 =C2=A0 =C2=A0(new URL('https://[2001:db8:0:0:0:0:0:1]')).hos= t;

result in `[2001:db8::1]`.

Yes, I reali= zed that you are right. IP6 support used to be indeed incomplete or buggy u= ntil now,
but I took some time, and corrected the behavior.
=
=C2=A0
My expectation be be `[2001:db8:0:0:0:0:0:1]` for Rfc3986 and
`[2001:db8::1]` for WhatWg. I have also tested the behavior of
`withHost()` when leaving out the square brackets. The Rfc3986 correctly throws an Exception, but WhatWg silently does nothing:

=C2=A0 =C2=A0 =C2=A0$url =3D 'https://example.com/foo/path';=

=C2=A0 =C2=A0 =C2=A0var_dump((new
Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAs= ciiString());

results in

=C2=A0 =C2=A0 =C2=A0string(28) "https://example.com/foo/path&qu= ot;

This looks like this is the result = of WHATWG's host setter algorithm (https://url.spec.whatwg.org/#dom-url-hostname).
After debugging the behavior, I noticed that "new Uri\WhatWg\= Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse
the port after the first ":" character. However, the set= ter algorithm obviously doesn't reach this point, since it only tries t= o
parse the host, and then it stops (because of the state overrid= e). So I'm not sure this gotcha can be cured.

= I tried to reproduce the problem in Chrome, but I realized that the URL pro= perties are not validated at all
when they are set ("url.hos= tname =3D "2001:db8:0:0:0:0:0:1";" will change the hostname = no problem)...
=C2=A0
Regards,
M=C3=A1t=C3=A9= =C2=A0
--00000000000058b6e50632a7d4d6--