Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126425 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id C60D01A00BC for ; Sun, 16 Feb 2025 22:01:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1739743150; bh=nNjF8BrXyPaUjTcSnXKn7T+EaG/IhSzJKMTSZjUEHQc=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=EO2z+ZgEHsrlS/S+YTfdx20X0q9I6LKejU6CMiv9wcRinFbKIIxMvXiDgRRxK9HPt LUpc4xFn1J9nDF4Ai4OUiKiCC3NifULr18OrOEoy3kKCG6a7jog5CSiv3clKWVLAu5 6rKW6sZ+ru0HJAYNVG+nG2ph9NN19r2kcJ7Qk49jZANFO3rMKZmJf16SEiiCOx3tlq 6Rh3R7DppCqXg2Z/xfH8GqoebUuu82L0pqhWCMfDOvpLNNss2FsvCwQp2lkxYk0u1I ylPRZXGIHQPYoNttd7PfCX9hcEnZR29MABon2LRmxrAyGze+9HB6Jv8v4JSkUwLEhi HXDRQw/kmMxwA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id B58081801E4 for ; Sun, 16 Feb 2025 21:59:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.9 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 16 Feb 2025 21:59:06 +0000 (UTC) Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-471e28fac07so10323161cf.1 for ; Sun, 16 Feb 2025 14:01:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1739743307; x=1740348107; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=k0hAxQuEunDKHWVfjeonl8CFxshVCxdpXoJhkYPkeMc=; b=lM+HllJJR4BUv5Te1G7Juq4V/r0TiElF3ytTcZF5qfhMKANUuNGn/IG/uOIwyGL+9m aB+dyMfd6VFskefu0uvxWUesOp3M7p8H2OHQ2rKvyFHQnDQbUfU6/xOduX9pplDB+1xP teIqeOtMvJhoN9xz3Ar3eR1yz+aQJjrYQxrEADBZAK6l1l9EAu8S0bME0fENxv9W6yHx 1R7yCSDisvzWkzPn8BZ0MeqL8IZosHD59UHta3Egj/AKuQ/mHbCXM2LMG554Q5o3tV1J jPicWv78nNo/JPkIAzCiQ6FBrtQhE0432g/hfP7MDWopMvVMOvZ7v5G+DNclvUaYw2qy 6ksw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739743307; x=1740348107; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=k0hAxQuEunDKHWVfjeonl8CFxshVCxdpXoJhkYPkeMc=; b=FrkijhVK1HASLLd3qxhY7oUQT440RZOXMvgrRw1/lW4K+IyaAa18A/E+Y9rZaQkHUJ nBYfRVximXB/bVSD1BhO+OD6jgF7ZeW91QEUBmegfpPvbFtcVnikcMoNGcG4H96qNRt+ yuTonva2owQ5LOusQos3PkPEWIA1OrIIa6jJILcbBwOwNK3LxuukfQep+XfwfWU2wPfV YdTBiZX+GHzb24XdKgauq65LCkgwGbzY9RwZtFFm/JGztmB4q+2hOncnvR1NsM9iDUst 840szUNBU4WAz7NE9PsBWqJSiTAIxJQASikBgy2xRk6DS5yEgWkbnlNpb8zQGWJ5gn3/ Xtuw== X-Gm-Message-State: AOJu0YwsKdZrZAVvjDexNtXzPRLpwqzANxXQiaCwLOX9d5NEDiNwjZSq Jd4hNZj6+m4R+213mCIQtxfsQcDJe99T2rNBpcxmD7YYVRw8Hnrjk0B2pOWL2RCTeb/3R+KPKYB n9iHjrguowYl5kIsMqQVyiVvUNs0= X-Gm-Gg: ASbGnctIW8g8Q3vC6Oj1x68tekxnOjTnAcUrAderOeT/NxY+rx3a+EEK/iOprEFCRSG 6DMo7XIOcw9XNFrCtJa7tjSFKeAOh/i5CYKHZ+CL5xYSsWMMrnF9INNlqzpfEh5QFDxJQ9iXD X-Google-Smtp-Source: AGHT+IG1U84wXb7ThJgGIoLQjpk+y3y0KRQnpJ82bItXpuAGKvBlB1W90dYgwGj59kphrMewpz98zOLhirR40b4RbM8= X-Received: by 2002:a05:622a:190a:b0:471:c7e4:4a7a with SMTP id d75a77b69052e-471dbd5730emr109987971cf.29.1739743307158; Sun, 16 Feb 2025 14:01:47 -0800 (PST) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> In-Reply-To: <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> Date: Sun, 16 Feb 2025 23:01:36 +0100 X-Gm-Features: AWEUYZmvSehWyD8z89aYCoySaCHSNx6J0Tj9Kl-uXiK2mSC0VHH1KtaaXFWRoiA Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: Dennis Snell Cc: Internals Content-Type: multipart/alternative; boundary="000000000000d2dbea062e498db1" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --000000000000d2dbea062e498db1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Dennis, > > I only harp on the WhatWG spec so much because for many people this will > be the only one they are aware of, if they are aware of any spec at all, > and this is a sizable vector of attack targeting servers from user-suppli= ed > content. I=E2=80=99m curious to hear from folks here hat fraction of the = actual PHP > code deals with RFC3986 URLs, and of those, if the systems using them are > truly RFC3986 systems or if the common-enough URLs are valid in both spec= s. > I think Ignace's examples already highlighted that the two specifications differ in nuances so much that even I had to admit after months of trying to squeeze them into the same interface that doing so would be irresponsible. The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs with scheme-specific behavior - like ldap apparently), but even the UriInterface of PSR-7 can build upon it. On the other hand, Uri\WhatWg\Url will be useful for representing browser links and any other URLs for the web (i.e. an HTTP application router component should use this class). > Just to enlighten me and possibly others with less familiarity, how and > when are RFC3986 URLs used and what are those systems supposed to do when > an invalid URL appears, such as when dealing with percent-encodings as yo= u > brought up in response to Tim? > I am not 100% sure what I brought up to Tim, but certainly, the biggest difference between the two specs regarding percent-encoding was recently documented in the RFC: https://wiki.php.net/rfc/url_parsing_api#percent-encoding . The other main difference is how the host component is stored: WHATWG automatically percent-decodes it, while RFC3986 doesn't. This is summarized in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section (a bit below). > This would be fine, knowing in hindsight that it was originally a relativ= e > path. Of course, this would mean that it=E2=80=99s critical that ` > https://example.com` does not replace the actual host part if one is > provided in `$url`. For example, this code should work. > > ``` > $url =3D Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc=E2=80=99, = =E2=80=98 > https://example.com=E2=80=99 ); > $url->domain =3D=3D=3D 'wiki.php.net' > Yes. it's the case. Both classes only use the base URL for relative URIs. > Hopefully this won=E2=80=99t be too controversial, even though the concep= t was new > to me when I started having to reliably work with URLs. I choose the > example I did because of human risk factors in security exploits. " > xn--google.com" is not in fact a Google domain, but an IDNA domain > decoding to "=E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com =E2=80=9D > I got your point, so I implemented your suggestion. Actually, I made yet another larger API change in the meanwhile, but in any case, the WHATWG implementation now supports IDNA the following way: $url =3D Uri\WhatWg\Url::parse("https://=F0=9F=90=98.com/=F0=9F=90=98?=F0= =9F=90=98=3D=F0=9F=90=98", null); echo $url->getHost(); // xn--go8h.com echo $url->getHostForDisplay(); // =F0=9F=90=98.com echo $url->toString(); // https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=3D%F0%9F%90%98 echo $url->toDisplayString(); / https://=F0=9F=90=98.com/%F0%9F%90%98?%F0%9F%90%98=3D%F0%9F%90%98 Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out at the end of https://externals.io/message/126182#126184), and adding support for RFC3987 (therefore IRIs) would be a very heavy amount of work, = it's just not feasible within this RFC :( To make things worse, its code should be written from scratch, since I haven't found any suitable C library yet for this purpose. That's why I'll leave them for On other notes, let me share some of the changes since my previous message to the mailing list: - First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the proposal after Arnaud's feedback. Now, both the normalized (and decoded), as well as the non-normalized representation can equally be retrieved from the same URI instance. This was necessary to change in order for users to be able to consistently use URIs. Now, if someone needs an exact URI component value, they can use the getRaw*() getter. If they want the normalized and percent-decoded form then a get*() getter should be used. For more information, the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section should be consulted. - I made a few less important API changes, like converting the WhatWgError class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing the return type of some getters (removing nullability) etc. - I fixed quite some smaller details of the implementation along with a very important spec incompatibility: until now, the "path" component didn't contain the leading "/" character when it should have. Now, both classes conform to their respective specifications with regards to path handling. I think the RFC is now mature enough to consider voting in the foreseeable future, since most of the concerns which came up until now are addressed some way or another. However, the only remaining question that I still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should be final? Personally, I don't see much problem with opening them for extension (other than some technical challenges that I already shared a few months ago), and I think people will have legitimate use cases for extending these classes. On the other hand, having final classes may allow us to make slightly more significant changes without BC concerns until we have a more battle-tested API, and of course completely eliminate the need to overcome the said technical challenges. According to Tim, it may also result in safer code because spec-compliant base classes cannot be extended by possibly non-spec compliant/buggy children. I don't necessarily fully agree with this specific concern, but here it is. Regards, M=C3=A1t=C3=A9 --000000000000d2dbea062e498db1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Dennis,

I only harp on the WhatWG spec so much because for many people this will be= the only one they are aware of, if they are aware of any spec at all, and = this is a sizable vector of attack targeting servers from user-supplied con= tent. I=E2=80=99m curious to hear from folks here hat fraction of the actua= l PHP code deals with RFC3986 URLs, and of those, if the systems using them= are truly RFC3986 systems or if the common-enough URLs are valid in both s= pecs.

I think Ignace's examples alr= eady highlighted that the two specifications differ in nuances so much that= even I had to admit after months of=C2=A0trying to squeeze them into the s= ame interface that=C2=A0doing so would be irresponsible.
The Uri\= Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URI= s with scheme-specific behavior - like ldap apparently),=C2=A0but even the UriInterface of PSR-7 can build up= on it. On the other hand, Uri\WhatWg\Url will be useful for representing br= owser links and any other URLs for the web (i.e. an HTTP application router= component should use this class).
=C2=A0
Just to enlighten me and possibly others with less familiarity, how and whe= n are RFC3986 URLs used and what are those systems supposed to do when an i= nvalid URL appears, such as when dealing with percent-encodings as you brou= ght up in response to Tim?

I am not 100= % sure what I brought up to Tim, but certainly, the biggest difference betw= een the two specs regarding percent-encoding was recently documented in the= RFC:=C2=A0https://wiki.php.net/rfc/url_parsing_api#percent-encoding . The o= ther main difference is how the host component is stored: WHATWG automatica= lly percent-decodes it,=C2=A0while RFC3986 doesn't. This is summarized = in the=C2=A0https://wiki.php.net/rfc/url_parsing_api#component_retrieval = section (a bit below).
=C2=A0=C2=A0
This would be fine, knowing in hindsight that it was originally a relative = path. Of course, this would mean that it=E2=80=99s critical that `https://examp= le.com` does not replace the actual host part if one is provided in `$u= rl`. For example, this code should work.

```
=C2=A0 =C2=A0 $url =3D Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc=E2=80=99, =E2=80=98https://example.com=E2=80=99 );
=C2=A0 =C2=A0 $url->domain =3D=3D=3D 'wiki.php.net'

Yes. it's the case. Both classes only use the bas= e URL for relative URIs.
=C2=A0
Hopefully this won=E2=80=99t be too controversial, even though the concept = was new to me when I started having to reliably work with URLs. I choose th= e example I did because of human risk factors in security exploits.=C2=A0 &= quot;xn--google.com" is not in fact a Google domain, but an IDNA domai= n decoding to "=E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com=E2=80=9D

I got your point, so I implemented your sugg= estion. Actually, I made yet another larger API change in the meanwhile, bu= t in any case, the WHATWG implementation now supports IDNA the following wa= y:
$url =3D Uri\WhatWg\Url::parse("https://=F0=9F=90=98.com/=F0=9F=90=98?=F0=9F=90=98=3D=F0=9F=90=98", null);

ech= o $url->getHost(); // xn-= -go8h.com
echo $url->getHostForDisplay(); // =F0=9F=90=98.com
echo $ur= l->toString(); // https://xn--go8h.com/%F0%9F%90%98?%F0%9F%= 90%98=3D%F0%9F%90%98
echo $url->toDisplayString(); / https= ://=F0=9F=90=98.com/%F0%9F%90%98?%F0%9F%90%98=3D%F0%9F%90%98=C2= =A0
Unfortunately, RFC3986 doesn't support IDNA (= as Ignace already pointed out at the end of https://externals.io/message/126182#126184), an= d adding support for RFC3987 (therefore IRIs) would be a very heavy amount = of work,=C2=A0it's just not feas= ible within this RFC :( To make things worse, its code should be written fr= om scratch, since I haven't found any suitable C library yet for this p= urpose. That's why I'll leave them for

On other notes, let me sh= are some of the changes since my previous message to the mailing list:

- First and foremost, I removed the Uri\Rfc3986\Uri::n= ormalize() method from the proposal after Arnaud's feedback. Now, both = the normalized (and decoded), as well as the non-normalized representation = can equally be retrieved from the same URI instance. This was necessary to = change in order for users to be able to consistently use URIs. Now, if some= one needs an exact URI component value, they can use the getRaw*() getter. = If they want the normalized and=C2=A0percent-decoded form then a get*() get= ter should be used. For more information, the=C2=A0https://wiki.php.net/rfc/url_parsing_api#component_retrieval section should be consulted.

I think the RFC is now mature enou= gh to consider voting in the foreseeable=C2=A0future, since most of the con= cerns which came up until now are addressed some way or another. However, t= he only remaining question that I still have is whether the Uri\Rfc3986\Uri= and the Uri\WhatWg\Url classes should be final? Personally, I don't se= e much problem with opening them for extension (other than some technical c= hallenges that I already shared a few months ago), and I think people will = have legitimate use cases for extending these classes. On the other hand, h= aving final classes may allow us to make slightly more significant changes = without BC concerns until we have a more battle-tested API, and of course c= ompletely eliminate=C2=A0the=C2=A0need to overcome=C2=A0the said technical= =C2=A0challenges. According to Tim, it may also result in safer code becaus= e spec-compliant base classes cannot be extended by possibly non-spec compl= iant/buggy children. I don't necessarily fully agree with this specific= concern, but here it is.


--000000000000d2dbea062e498db1--