Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125984 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 68D321A00BD for ; Tue, 19 Nov 2024 08:49:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1732006356; bh=TMkMXmhh8LcHxfneSM9uLiDBRBTwlsIS4Of+3BaoPFo=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=ZdQydLOdVjDqRZ0ufTwp7sXXjDgu2Qr5Fvn4j86cLj8dh1MOOG6HL/Mg+pSR3i6Ny mBKoY1SkI2uJrDgKgGCZqcUMoL/B+Uc+9l+R3p1zXQPppZsiJvsUMW/9izovbriP3I 6kQT82Ik0zRionY+JDvQF85n3AHmOgi3dXxz28ZieAtiImzob8TYFnqu3VrGLrh4kf u3T8DRcxEwbUoc5RQ+jc68atTBVgUdjqR1/p2ziNneHOiTscB+ZThJYQO80qhTb/xQ NjsKVN3HPY0ZYPs3XeosnOlgXFNUDXrj/YCHJqrNRaj80HjC/nB1lLdOpHbdyVNwSz za/6jtsH70A0g== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id DE4A118002F for ; Tue, 19 Nov 2024 08:52:35 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: ** X-Spam-Status: No, score=2.4 required=5.0 tests=BAYES_50,BODY_8BITS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED,RCVD_IN_VALIDITY_RPBL_BLOCKED, RCVD_IN_VALIDITY_SAFE_BLOCKED,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 19 Nov 2024 08:52:32 +0000 (UTC) Received: by mail-qt1-f179.google.com with SMTP id d75a77b69052e-460ad98b031so20697251cf.0 for ; Tue, 19 Nov 2024 00:49:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732006193; x=1732610993; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=kEtsil5QNHzbY/gDxAl28P62IHoHZN4UJz2WVzElE5k=; b=IyjEAh8pJLtgzKNPt7krmbj8EN2nETzIMk7KT9QmE3n6OHrdFeXTnXif+i/F3jeInU 78IKHr0NC+553uEd6/sjnNgTSqTLV9aJU/KQemoZY+xqTR513lFQaMnCe2dVOUgBrS0O Uew8u32rxt90h7YKxh1qpvgNc3Hi+mH6LOt1UNhdlfuwikORwtTZQcJUHmjAHtkbrYxa ZVPLs1KDSLGMLkT/m8GOxAJklLNoevSDLGIPXVUxihXmXFD19rUA2u4Ln+hOVLMtuRP+ WOMa/8Q+nyudS/KdA8UxdhHSOqLuWZ5sSj5zl0Qmdaw8KKKnyUG74pj4dBTdg37M1OCI mWmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732006193; x=1732610993; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kEtsil5QNHzbY/gDxAl28P62IHoHZN4UJz2WVzElE5k=; b=n59qanY6bvnJp5Kqw4rLVnYelMoNzd7p+AAPj74dMaPO4lYmQ4GoVOWOB0eMIu43Ia f5lhv0tKk7BffLm/Q4GDRSNOIjdESnmb+WfI48Z4obC3E80KabMJ8NJ+jxupstN/1N6o E3bmKU9Df9xZ10XyfGOnA4kc/6+yYHHRNXvWyPQkBPt3t8xtkEAqpYuV6KQWrJH9mHgQ 3pmyu23pRl14f0e2y4rkxRKHhxGzuoL+MDs/nVSxN4e8clbUv88Boba9JdiRH05MKevj R/B0rlVCkzJ1D4HGLBUg9axzFmBqBQnQyqJzu/W4zT/zLd3MTZMuox8EZRkjeD5EAHSC hbUw== X-Gm-Message-State: AOJu0YyHQm5bdFZRYl5/464lC+ZlxpGjBfp9m46cLT6WBSDGJAxiz54R s+h/9OQX/56UC/IXKZA/cjA4AE1OlIzDba9YgQeVFh/n6b7+CGJGogNarCPjHHX3jD8zoz1n7N+ Zq/8c8JeD+fsVxnJIUa2XFp55X3C/A/9l X-Google-Smtp-Source: AGHT+IEDvED64cs/HvQDaUSzL1BG1jVIumVzbBRLmr8+Rr5S9OTStbpsT7OMU05lR5AuuFngihFZgsCnu6x7i8/cakM= X-Received: by 2002:ac8:57cb:0:b0:460:8d5c:34bd with SMTP id d75a77b69052e-46363e0ae87mr236231761cf.17.1732006192540; Tue, 19 Nov 2024 00:49:52 -0800 (PST) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> In-Reply-To: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> Date: Tue, 19 Nov 2024 09:49:41 +0100 Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: Dennis Snell Cc: Internals Content-Type: multipart/alternative; boundary="000000000000daf2d80627401d37" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --000000000000daf2d80627401d37 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Dennis, Even though I didn't answer for a long time, I was improving my RFC implementation in the meanwhile as well as evaluating your suggestions. I=E2=80=99m worried about the side-effects that having a global uri.default_handler could > have with code running differently for no apparent reason, or differently > based on what is calling it. If someone is writing code for a controlled > system I could see this being valuable, but if someone is writing a > framework like WordPress and has no control over the environments in whic= h > code runs, it seems dangerous to hope that every plugin and every host ru= ns > compatible system configurations. Nobody is going to check `ini_get( > =E2=80=98uri.default_handler=E2=80=99 )` before every line that parses UR= Ls. Beyond this, > even just *allowing* a pluggable parser invites broken deployments > because PHP code that is reading from a browser or sending output to one > needs to speak the language the browser is speaking, not some arbitrary > language that=E2=80=99s similar to it. > You convinced me with your arguments regarding the issues a global uri.default_handler INI config can cause, especially after having read a blog post by Daniel Stenberg about the topic ( https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/). That's why I removed this from the RFC in favor of relying on configuring the parser at the individual feature level. However, I don't agree with removing a pluggable parser because of the following reasons: - the current method (parse_url() based parser) is already doomed, isn't compliant with any spec, so it already doesn't speak the language the browser is speaking - even though the majority does, not everyone builds a browser application with PHP, especially because URIs are not necessarily accessible on the web - in addition, there are tools which aren't compliant with the WhatWg spec, but with some other. Most prominently, cURL is mostly RFC3986 compliant with some additional flavour of WhatWg according to https://everything.curl.dev/cmdline/urls/browsers.html That's why I intend to keep support for pluggability. > Being able to parse a relative URL and know if a URL is relative or > absolute would help WordPress, which often makes decisions differently > based on this property (for instance, when reading an `href` property of = a > link). I know these aren=E2=80=99t spec-compliant URLs, but they still r= epresent > valid values for URL fields in HTML and knowing if they are relative or n= ot > requires some amount of parsing specific details everywhere, vs. in a cla= ss > that already parses URLs. Effectively, this would imply that PHP=E2=80=99= s new URL > parser decodes `document.querySelector( =E2=80=98a=E2=80=99 ).getAttribu= te( =E2=80=98href=E2=80=99 )`, > which should be the same as `document.querySelector( =E2=80=98a=E2=80=99 = ).href`, and > indicates whether it found a full URL or only a portion of one. > > * `$url->is_relative` or `$url->is_absolute` > * `$url->specificity =3D URL::Relative | URL::Absolute` > The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use this variant of the parse() method if you want to parse a WhatWg compliant URL, and then WhatWgUri should let you know whether the originally passed in URI was relative or not, did I get you right? This feature is certainly possible with RFC3986 URIs (even without the base parameter), but WhatWg requires the above mentioned workaround for parsing + I have to look into how this can be implemented... Having methods to add query arguments, change the path, etc=E2=80=A6 would = be a > great way to simplify user-space code working with URLs. For instance, re= ad > a URL and then add a query argument if some condition within the URL > warrants it (for example, the path ends in `.png`). > I managed to retain support for the "wither" methods that were originally part of the proposal. This required using custom code for the uriparser library, while the maintainer of Lexbor was kind enough to add native support for modification after I submitted a feature request. However, convenience methods for manipulating query parameters are still not part of the RFC because it would increase the scope of the RFC even more, and due to other issues highlighted by Ignace in his prior email: https://externals.io/message/123997#124077. As I really want such a feature, I'd be eager to create a followup RFC dedicated for handling query strings. My counter-point to this argument is that I see security exploits appear > everywhere that functions which implement specifications are pluggable an= d > extendable. It=E2=80=99s easy to see the need to create a class that *lim= its* possible > URLs, but that also doesn=E2=80=99t require extending a class. A class ca= n wrap a > URL parser just as it could extend one. Magic methods would make it even > easier. > Right now, it's only possible to plug internal URI implementation into PHP, userland classes cannot be used, so this probably reduces the issue. However, I recently bumped into a technical issue with URIs not being final which I am currently trying to assess how to solve. More information is available at one of my comments on my PR: https://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c0= 6124aa2f331afa8#r1847316607 As far as I see the situation currently, it would probably be better to make these classes final so that similar unforeseen issues and inconsistencies cannot happen again (we can unfinalize them later anyway). > Finally, I frequently find the need to be able to consider a URL in both > the *display* context and the *serialization* context. With Ada we have > `normalize_url()`, `parse_search_params()`, and the IDNA functions to > convert between the two representations. In order to keep strong boundari= es > between security domains, it would be nice if PHP could expose the two > variations: one is an encoded form of a URL that machines can easily pars= e > while the other is a =E2=80=9Cplain string=E2=80=9D in PHP that=E2=80=99s= easier for humans to > parse but which might not even be a valid URL. Part of the reason for thi= s > need is that I often see user-space code treating an entire URL as a sing= le > text span that requires one set of rules for full decoding; it=E2=80=99s = multiple > segments that each have their own decoding rules. > > - Original [ https://xn--google.com/secret/../search?q=3D=F0=9F=8D=94 ] > - `$url->normalize()` [ https://xn--google.com/search?q=3D%F0%9F%8D%94 ] > - `$url->for_display()` Displayed [ https://=E4=95=AE=E4=95=B5=E4=95=B6= =E4=95=B1.com/search?q=3D > =F0=9F=8D=94 ] > Even though I didn't entirely implement this suggestion, I added normalization support: - the normalize() method can be used to create a new URI instance whose components are normalized based on the current object - the toNormalizedString() method can be used when only the normalized string representation is needed - the newly added equalsTo() method also makes use of normalization to better identify equal URIs For more information, please refer to the relevant section of the RFC: https://wiki.php.net/rfc/url_parsing_api#api_design. The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I'd defer it for later... Regards, M=C3=A1t=C3=A9 --000000000000daf2d80627401d37 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Dennis,

Even though I didn't answer for a long time, I wa= s improving my RFC implementation in the meanwhile as well as evaluating yo= ur suggestions.

I=E2=80=99m worried about the side-effects that having a = global=C2=A0uri.default_handler=C2=A0could have with co= de running differently for no apparent reason, or differently based on what= is calling it. If someone is writing code for a controlled system I could = see this being valuable, but if someone is writing a framework like WordPre= ss and has no control over the environments in which code runs, it seems da= ngerous to hope that every plugin and every host runs compatible system con= figurations. Nobody is going to check `ini_get( =E2=80=98uri.default_handle= r=E2=80=99 )` before every line that parses URLs. Beyond this, even just allowing=C2=A0a pluggable parser invites broken deployments because PH= P code that is reading from a browser or sending output to one needs to spe= ak the language the browser is speaking, not some arbitrary language that= =E2=80=99s similar to it.

You c= onvinced me with your arguments regarding the issues a=C2=A0global=C2=A0uri.default_handler= INI config=C2=A0can cause,= =C2=A0especially after having read a blog post by Daniel Stenberg about the= topic (https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/). Th= at's why I removed this from the RFC in favor of relying on configuring= the parser at the=C2=A0individual feature level. However, I don't agre= e with removing a pluggable parser because of the following reasons:=

- the current method (parse_url() based= parser) is already doomed, isn't compliant with any spec, so it alread= y doesn't speak the language the browser is speaking
<= span style=3D"color:rgb(191,191,191)">- even though the majority does, not = everyone builds a browser application with PHP, especially because URIs are= not necessarily accessible on the web
- in addition, there are tools which aren't complia= nt with the WhatWg spec, but with some other. Most prominently, cURL is mos= tly RFC3986 compliant with some additional flavour of WhatWg according to= =C2=A0https://everything.curl.dev/cmd= line/urls/browsers.html

That&= #39;s why I intend to keep support for pluggability.
=C2= =A0
Being able to parse a relative URL and know if a URL is relative or absol= ute would help WordPress, which often makes decisions differently based on = this property (for instance, when reading an `href` property of a link). I = know these aren=E2=80=99t spec-compliant URLs, but they =C2=A0still represe= nt valid values for URL fields in HTML and knowing if they are relative or = not requires some amount of parsing specific details everywhere, vs. in a c= lass that already parses URLs. Effectively, this would imply that PHP=E2=80= =99s new URL parser decodes =C2=A0`document.querySelector( =E2=80=98a=E2=80= =99 ).getAttribute( =E2=80=98href=E2=80=99 )`, which should be the same as = `document.querySelector( =E2=80=98a=E2=80=99 ).href`, and indicates whether= it found a full URL or only a portion of one.

=C2= =A0 * `$url->is_relative` or `$url->is_absolute`
=C2=A0 * `= $url->specificity =3D URL::Relative | URL::Absolute`

The Uri\WhatWgUri::parse() method accepts a (relat= ive) URI parameter when the 2nd (base URI) parameter is provided. So essent= ially you need to use this variant of the parse() method if you want to par= se a WhatWg compliant URL, and then=C2=A0WhatWgUri=C2=A0should let=C2=A0you know whether the originally= =C2=A0passed in URI was relative or not, did I get you right? This feature = is certainly possible with RFC3986 URIs (even without the base parameter), = but WhatWg requires the above mentioned workaround for parsing=C2=A0+ I hav= e to look into how this can be implemented...

Having methods t= o add query arguments, change the path, etc=E2=80=A6 would be a great way t= o simplify user-space code working with URLs. For instance, read a URL and = then add a query argument if some condition within the URL warrants it (for= example, the path ends in `.png`).

=
I managed to retain support for the "wither" methods that we= re originally part of the proposal. This required using custom code for the= uriparser library, while the maintainer of Lexbor was kind enough to add n= ative support for modification after I submitted a feature request. However= , convenience methods for manipulating query parameters=C2=A0are still not = part of the=C2=A0RFC because it would increase the scope of the RFC even mo= re, and due to other issues highlighted by Ignace in his prior email: https://externals.io/mess= age/123997#124077. As I really want such a feature, I'd be eager to= create a followup RFC dedicated for handling query strings.

=
My counter-p= oint to this argument is that I see security exploits appear everywhere tha= t functions which implement specifications are pluggable and extendable. It= =E2=80=99s easy to see the need to create a class that limits=C2=A0p= ossible URLs, but that also doesn=E2=80=99t require extending a class. A cl= ass can wrap a URL parser just as it could extend one. Magic methods would = make it even easier.

Right now,= it's only possible to plug internal URI implementation into PHP, userl= and classes cannot be used, so this probably reduces the issue. However, I = recently bumped into a technical issue with URIs not being final which I am= currently trying to assess how to solve. More information is available at = one of my comments on my PR:=C2=A0h= ttps://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c06= 124aa2f331afa8#r1847316607 As far as I see the situation=C2=A0currently, it would probably be better to make these classes final so that= similar unforeseen issues and inconsistencies cannot happen again (we can = unfinalize them later anyway).
=C2=A0
Finally, I frequently find the need to= be able to consider a URL in both the display=C2=A0context and the = serialization=C2=A0context. With Ada we have `normalize_url()`, `par= se_search_params()`, and the IDNA functions to convert between the two repr= esentations. In order to keep strong boundaries between security domains, i= t would be nice if PHP could expose the two variations: one is an encoded f= orm of a URL that machines can easily parse while the other is a =E2=80=9Cp= lain string=E2=80=9D in PHP that=E2=80=99s easier for humans to parse but w= hich might not even be a valid URL. Part of the reason for this need is tha= t I often see user-space code treating an entire URL as a single text span = that requires one set of rules for full decoding; it=E2=80=99s multiple seg= ments that each have their own decoding rules.

=C2= =A0- Original [=C2=A0https://xn--google.com/secret/../search?q=3D=F0= =9F=8D=94 ]
=C2=A0- `$url->normalize()` [=C2=A0https://xn--g= oogle.com/search?q=3D%F0%9F%8D%94 ]
=C2=A0- `$url->for_dis= play()` Displayed [ https://= =E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com/search?q=3D=F0=9F=8D=94 ]

Even though I didn't entirely impleme= nt this suggestion, I added normalization support:
- the normaliz= e() method can be used to create a new URI instance whose components are=C2= =A0normalized=C2=A0based on the curr= ent object
- the toNormalizedString() method can be used w= hen only the normalized string representation is needed
- the new= ly added equalsTo() method also makes use of normalization to better identi= fy equal URIs

For more information, please refer t= o the relevant section of the RFC:=C2=A0https://wiki.php.net/rfc/url_parsing_api#api_d= esign. The forDisplay() method also seems to be useful at the first gla= nce, but since this may be a controversial optional feature, I'd defer = it=C2=A0for later...

Regards,
M=C3=A1t= =C3=A9


--000000000000daf2d80627401d37--