Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:127244 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id 6F4C91A00BC for ; Tue, 29 Apr 2025 20:08:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1745957191; bh=tUu2xOl/Dg889RyunlCBSnIGFooVvciwPP1d6as5Dt0=; h=References:In-Reply-To:From:Date:Subject:To:From; b=DgrD2pJXIXFrQcBa/9INLb5/Qs2WWDTakO/l9UGy6T2dn+OU/CP+zBHs+NDk68Zaj 2qDUTdN9ccgL+igapkasFpFUr4LDCAysY912mNjKUutNsTSKSlSZyArWAqFdTOXWH1 4A9MWh3P0d0305w+sCEaKA36zNzezWDRfO2Us37eA3jy7x7Bsn9nEl4os+7cXjqyYS ZLONUNGjVK7RPfct7h2VfHd3RRA2GbUzYivDcvkyM2zVQvgxGDhFRbgvyWYzENCthG l7VTOI02jMRGH7rUDg9O5BAlH1bqGxWE6lsMYJzOCkHyYqzXyc1c84UoKOUGO20W/+ 9j9dbr/rrSvVg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8E5CE180074 for ; Tue, 29 Apr 2025 20:06:30 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: Error (Cannot connect to unix socket '/var/run/clamav/clamd.ctl': connect: Connection refused) X-Envelope-From: Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 29 Apr 2025 20:06:20 +0000 (UTC) Received: by mail-pf1-f182.google.com with SMTP id d2e1a72fcca58-736c1138ae5so6584820b3a.3 for ; Tue, 29 Apr 2025 13:08:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1745957316; x=1746562116; darn=lists.php.net; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=+mOL0r0eudsKG5SaxfJ9/2nTYyNIR4oDzir7YRXtYxg=; b=Lb7qxoSdDGVWMrIRvD3DIim1cPnPSC8GAZgPFUL4u4wKMQHmLBn8EOMnzXmAJr+/j5 njGZDRNomXOLn680disiBLp0S/5xnW32wjFyZq4CSDuQdSV39jaVRo7y+3BppcalcUs+ fm0CgNhyGcnjSFbQsXdd+hni8KB3Ef/qZP7XOb2/2KcJkt57LoNEawSZ30LkhDTct1Lu E+Bx/GYJzu3+KWaQCTdmOGeLjSxPGonnE/oZL9vjQ9iKImceKQMcsfvke/Yjy1V69Q6O 192o7tO1HHs9MIKtX9T9djZgw3farQ3YPZJkFA0a9pcR0D0AoEgJGL4ig/ti0bLC1xrQ e7sg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745957316; x=1746562116; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+mOL0r0eudsKG5SaxfJ9/2nTYyNIR4oDzir7YRXtYxg=; b=xMmT2fX5qZAQfd+/A94Y467tC2J1PkdW9CtXMNnb/3Vu3gF/7X811U+l/C4/z9u9Bc ZTvtTodDGN+ErtZYqSOBZVHFVlDjHzKdwJ9DPznoxNRWXP9u4roLLTy/FeKRH0Up9n9J V1nySxh1HSAIwAe4lRhHBl5o494Ct/V7RuvwBAKlaayVN+F6j26/41fMhkrWNAx58gpV DtuAE0J9W2wyue/l+ICfuXUPcjJPJm/tFAgb1t7sHHo/I8iIt3XFhictXW3oQUFVCIxJ YwEuQALxruYqI7Xs8FB+D8Sv+3JwUFSOi+P1d4eCZjb0zok2sRYfLrtxza6EG6n2uU97 F1dw== X-Forwarded-Encrypted: i=1; AJvYcCWO+cxpJEW0Glca3iNsOWgB5yqo9SRQmU6/ONcMHccDTAv0OZZGkHPBPnBgFQkVcHfmCQwSYD9ck6E=@lists.php.net X-Gm-Message-State: AOJu0YwLnMiWdwWRVQhpRebjum/CSESFGRA5+SAMxs5G7y0/wOgWXfRI 3Gn1GPBx6snK+ahcvslP0HZJAryZwK6EVQUwESiNzxXHJ8e6CF0RJIRelNz6MihnToRoqveDiJz 1wMTLKOPLp/quFmQN74NA5IMBTF0= X-Gm-Gg: ASbGncvDTtoL3jWMPl7PcHdvLV84YlZYqT653XIsQOUhBK6GON1CEzNIv6JhMZwg2t/ QzZajDER214AX9x4oBhoh5qqpBGi98efsSqaneD/jCIM8JpMxcMlW/xx7HBMIHVrAkDjCun76Fp vw04sbSqB3o4e2OojoDzpOO+OGPwTkGWWOseoyavaqqwDf4jShRy9IpA== X-Google-Smtp-Source: AGHT+IH6VYtBIe9sgaR42vWtN744RtwJbJ7OicuuQvqnH/MZBeIONXXQDlLCF1gFoh6P9geyfNssqgWk3UqzekOrOjw= X-Received: by 2002:a05:6a00:a86:b0:730:9752:d02a with SMTP id d2e1a72fcca58-74038956f1cmr830783b3a.4.1745957315520; Tue, 29 Apr 2025 13:08:35 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <8df04e01-deac-404b-beb7-cd982423db63@bastelstu.be> <33427cd03035ef084245c44290b56a55@bastelstu.be> <0aa1eefc3941bdea0092e935074daa58@bastelstu.be> <76d96ea8a78c6025128c0a4b01c94c0a@bastelstu.be> <74F64DCB-3A10-4CF8-8DAA-6089BC34EA89@pmjones.io> In-Reply-To: <74F64DCB-3A10-4CF8-8DAA-6089BC34EA89@pmjones.io> Date: Tue, 29 Apr 2025 22:08:24 +0200 X-Gm-Features: ATxdqUE2Bf2_-Mawj3g_p44mjAEYkO6YUBzD6dQIWHFnJmdoxUrEi3bvkseL9J0 Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: "Paul M. Jones" , =?UTF-8?Q?Tim_D=C3=BCsterhus?= , Internals , =?UTF-8?B?TcOhdMOpIEtvY3Npcw==?= Content-Type: multipart/alternative; boundary="00000000000095ac6a0633f05d54" From: nyamsprod@gmail.com (ignace nyamagana butera) --00000000000095ac6a0633f05d54 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Paul, I will try to address your concerns. Keep in mind that I am not the author of the RFC but I do like how it is currently shaped with some caveats but those can be put under future improvements. > So, one value added by splitting the classes is to resolve that asymmetry= . First, I agree with you. The method naming in the Uri\Rfc3986\Uri class could be improved even though it does not represent a showstopper to me, Adding the `raw` prefix or indeed flipping the raw* method and using normalized* would perhaps make for some clarification but I will leave that decision to M=C3=A1t=C3=A9. Apart from that, I believe the current RFC (especially around RFC3986) does address most if not all the issues regarding the specification. RFC3986 provides information around 3 key URI features: parsing, resolution and equivalence. In order to offer resolution and equivalence you ought to address normalization and thus encoding. Any userland package that does offer those features is required to handle component encoding/normalization first before performing the expected operation. Hence why I believe that if the new URI class does offer equivalence by consequence it can/should be able to expose URI component normalization out of the box. The need for a separate class is IMHO not needed. > For example, can you point out which projects mix "raw and half-normalized components"? Laminas for example or any PSR implementing class will try to encode the input string regardless of its encoding hence the wording around not to double encode the string you often encounter in mutator method docblock. The Uri on the other hand only expects well formed and encoded strings which leaves room for no wrong interpretation. This is an area that is left to be filled by URI packages for instance. > For Rfc3986\Uri, it looks like there are only two that are recognized: raw and normalized. Are there other string representations you feel the Uri class should recognize? If there are at least two representations possible then a `__toString` method is still a bad design because it may lead the developper to think that this is the only one string representation which is not true. Both representations are equivalent and represent as much the URI. And as a bonus, not having a `__toString` method prevents accidental URI comparison using the `=3D=3D` sign instead of using the correct `equals` method. (I kn= ow that because I've seen codebase where PSR-7 URI instances are compared using the class `__toString` method which is just wrong). PS1: I do appreciate the work you did put into your study around URI packages in the PHP ecosystem but we should not restrict the new API to only resolve or align to those used solutions instead we should try to expose an API susceptible to allow more flexibility than what PHP currently offers. PS2: I do not think the new API will replace the URI packages, we will still need them because, in the case of RFC3986 URI class, parsing is just one aspect or URI consumption, we still need scheme specific validation that only PHP userland package can offer. Best regards, Ignace Nyamagana Butera On Tue, Apr 29, 2025 at 3:55=E2=80=AFPM Paul M. Jones = wrote: > Hi Ignace & Mat=C3=A9 and all, > > tl;dr: I argue against Ignace's objections to splitting the URI class int= o > two classes (one that retains raw URI values and another that normalizes > values as-it-goes). Jump to the very end for a discussion regarding the > with() methods (search for the word "asymmetry" herein). > > * * * > > > On Apr 28, 2025, at 15:47, ignace nyamagana butera > wrote: > > > > The current approach in userland mixes both raw and half normalized > components as well as RFC3986 and RFC3987 specification with ambiguity > around normalization, input, constructior, what needs to be encoded where > and when > > Based on my research into existing URI projects < > https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md> I > don't think that's an accurate assessment of the ecosystem. > > For example, can you point out which projects mix "raw and half-normalize= d > components"? Nette is the only one that comes to mind, in that (during > parsing) it applies rawurldecode() to the host, user, password, and > fragment; but that's only one of the 18 projects. > > Likewise, of the 15 URI-centric projects, only one of them (league/uri) > offers both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/ir= i > and rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-UR= L > centric. So I don't see much ambiguity in any projects there. > > As far as normalization, only one project (opis) affords the ability to > normalize at creation time, though five of them offer a normalize() metho= d > with various effects (< > https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#norm= alizing>). > So, again, I don't see much ambiguity there either; they don't do > normalizing as-you-go, it's something you have to apply explicitly. > > Regarding inputs, they all presume "raw" inputs. Regarding constructors, > they mostly side with a full URI string. Regarding encoding, they mostly > retain values in their encoded form (there are three outliers, cf. < > https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#comp= onent-encoding > >). > > With all that in mind, we can see that the various authors of userland > projects have settled on remarkably similar patterns of usage that they > found valuable and useful for working with URIs. > > > > > - fulfill existing userland expectations; > > > > Existing userland expectations are mostly built around `parse_url` > > That's kind of true; 9 of the 18 projects use parse_url(), and 7/18 > implement the RFC 3986 parsing algorithm ... > > > > which is one of the reasons the RFC exists to improve the status quo an= d > to introduce in PHP valid parsers against recognizable URI specifications= . > Yes some adaptation will be needed to use them in userland but I believe > this work is easy to do, talking from the POV of a URI package maintainer= . > > ... but I don't imagine that replacing parse_url() in those projects with > the RFC 3986 algo would cause those projects to change any of their other > design decisions. What adaptations do you think would be needed around th= at > replacement? > > > > > - replace the toString()/toRawString() with a single idiomatic > __toString() in each class; > > > > For all the reasons explained in the RFC, adding a `__toString` method > is a bad architectural design for an URI. There are so many ways to > represent an URI that having a `__toString` for string representation > gives a false sense of "there can be only one true representation for a > single URI" which is not true. > > For Rfc3986\Uri, it looks like there are only two that are recognized: ra= w > and normalized. Are there other string representations you feel the Uri > class should recognize? > > (For Whatwg\Url, it looks like there are also only two: as-parsed, and as > ASCII, but I'm not addressing that part of the RFC here.) > > > > > - move normalization logic into the NormalizedUri class. > > > > The classes follow specifications that describe how normalization > should be. Why would you split the responsibilities in other classes ? Wh= at > would be the added value ? > > For one, unless I am missing something, there is an asymmetry between the > get() methods and the with() methods. What I'm seeing is that (e.g.) > Uri::withPath() expects a raw path argument, but getPath() returns the > normalized version. For symmetry, I would expect either: > > - `Uri::withPath(raw_value) : self` and `Uri::getPath() : raw_value`, or > - `Uri::withRawPath(raw_value) : self` and `Uri::getRawPath() : raw_value= ` > > Thus my first intuition that the "main" values in the URI need to be the > raw ones, and that getting the normalized ones should be the more verbose > case (e.g. `getNormalizedPath() : normalized_value`). > > So, one value added by splitting the classes is to resolve that asymmetry= . > Consumers expecting to get back from the URI what they put into it can us= e > the raw Uri variation; "API clients or signers fall in this category that > want to avoid introducing any unnecessary changes to URIs, in order to > avoid causing subtle bugs." > > Other consumers, who want to do things this new and different way > (normalized as-you-go, unlike anything currently in userland) can use the > NormalizedUri. > > (Or you could flip it around and say that the normalized variation is the > Uri class, and the raw version is RawUri.) > > > > -- pmj > > --00000000000095ac6a0633f05d54 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Paul,

I will try to address your con= cerns. Keep in mind that I am not the author of the RFC but I do like how i= t is currently shaped with some caveats but those can be put under future= =C2=A0improvements.

> So, one value added by sp= litting the classes is to resolve that asymmetry.

= First, I agree with you. The method naming in the Uri\Rfc3986\Uri class cou= ld be improved even though it does not represent a showstopper to me, Addin= g the `raw` prefix or indeed flipping the raw* method and using normalized*= would perhaps make for some clarification but I will leave that decision t= o M=C3=A1t=C3=A9.
Apart=C2=A0from that, I believe the current RFC= (especially around RFC3986) does address most if not all the issues regard= ing the specification. RFC3986 provides information around 3 key URI featur= es: parsing, resolution and equivalence. In order to offer resolution and e= quivalence=C2=A0you ought to address normalization and thus encoding. Any u= serland package that does offer those features is required to handle compon= ent=C2=A0encoding/normalization first before performing the expected operat= ion. Hence why I believe that if the new URI class does offer equivalence b= y consequence it can/should be able to=C2=A0expose URI component normalizat= ion out of the box. The need for a separate class is IMHO not needed.
=

> For example, can you point out which projects mix = "raw and half-normalized components"?

La= minas for example or any PSR implementing class will try to encode the inpu= t string regardless of its encoding hence the wording around not to double = encode=C2=A0the string you often encounter in mutator method docblock. The = Uri on the other hand only expects well formed and encoded strings which le= aves room for no wrong interpretation. This is an area that is left to be f= illed by URI packages for instance.

> For Rfc39= 86\Uri, it looks like there are only two that are recognized: raw and norma= lized. Are there other string representations you feel the Uri class should= recognize?

If there are at least two representations possibl= e then a `__toString` method is still a bad design because it may lead the = developper to think that this is the only one string representation which i= s not true. Both representations are equivalent and represent as much the U= RI. And as a bonus, not having a `__toString` method prevents accidental UR= I comparison using the `=3D=3D` sign instead of using the correct `equals` = method. (I know that because I've seen codebase where PSR-7 URI instanc= es are compared using the class=C2=A0 `__toString` method=C2=A0 which is ju= st wrong).

PS1: I do appreciate the work you did p= ut into your study around URI packages in the PHP ecosystem but we should n= ot restrict the new API to only resolve or align to those used solutions in= stead we should try to expose an API susceptible to allow more flexibility= =C2=A0than what PHP currently offers.
PS2: I do not think the new= API will replace the URI packages, we will still need them because, in the= case of RFC3986 URI class, parsing is just one aspect or URI consumption, = we still need scheme specific validation that only PHP userland package can= offer.

Best regards,
Ignace Nyamagana B= utera

On Tue, Apr 29, 2025 at 3:55=E2=80=AFPM Pa= ul M. Jones <pmjones@pmjones.io> wrote:
Hi = Ignace & Mat=C3=A9 and all,

tl;dr: I argue against Ignace's objections to splitting the URI class i= nto two classes (one that retains raw URI values and another that normalize= s values as-it-goes). Jump to the very end for a discussion regarding the w= ith() methods (search for the word "asymmetry" herein).

* * *

> On Apr 28, 2025, at 15:47, ignace nyamagana butera <
nyamsprod@gmail.com> wrote= :
>
> The current approach in userland mixes both raw and half normalized co= mponents as well as RFC3986 and RFC3987 specification with ambiguity around= normalization, input, constructior, what needs to be encoded where and whe= n

Based on my research into existing URI projects <https://github.com/uri-interop/interface/blob/1.x/READM= E-RESEARCH.md> I don't think that's an accurate assessment o= f the ecosystem.

For example, can you point out which projects mix "raw and half-normal= ized components"? Nette is the only one that comes to mind, in that (d= uring parsing) it applies rawurldecode() to the host, user, password, and f= ragment; but that's only one of the 18 projects.

Likewise, of the 15 URI-centric projects, only one of them (league/uri) off= ers both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/iri and= rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-URL cen= tric.=C2=A0 So I don't see much ambiguity in any projects there.

As far as normalization, only one project (opis) affords the ability to nor= malize at creation time, though five of them offer a normalize() method wit= h various effects (<https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#nor= malizing>). So, again, I don't see much ambiguity there either; = they don't do normalizing as-you-go, it's something you have to app= ly explicitly.

Regarding inputs, they all presume "raw" inputs. Regarding constr= uctors, they mostly side with a full URI string. Regarding encoding, they m= ostly retain values in their encoded form (there are three outliers, cf. &l= t;https://git= hub.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#component-encodin= g>).

With all that in mind, we can see that the various authors of userland proj= ects have settled on remarkably similar patterns of usage that they found v= aluable and useful for working with URIs.


> > - fulfill existing userland expectations;
>
> Existing userland expectations are mostly built around `parse_url`

That's kind of true; 9 of the 18 projects use parse_url(), and 7/18 imp= lement the RFC 3986 parsing algorithm ...


> which is one of the reasons the RFC exists to improve the status quo a= nd to introduce in PHP valid parsers against recognizable URI specification= s. Yes some adaptation will be needed to use them in userland but I believe= this work is easy to do, talking from the POV of a URI package maintainer.=

... but I don't imagine that replacing parse_url() in those projects wi= th the RFC 3986 algo would cause those projects to change any of their othe= r design decisions. What adaptations do you think would be needed around th= at replacement?


> > - replace the toString()/toRawString() with a single idiomatic __= toString() in each class;
>
> For all the reasons explained in the RFC, adding a `__toString` method= is a bad architectural design for an URI. There are so many ways to repres= ent an URI that=C2=A0 having a `__toString` for string representation gives= a false sense of "there can be only one true representation for a sin= gle URI" which is not true.

For Rfc3986\Uri, it looks like there are only two that are recognized: raw = and normalized. Are there other string representations you feel the Uri cla= ss should recognize?

(For Whatwg\Url, it looks like there are also only two: as-parsed, and as A= SCII, but I'm not addressing that part of the RFC here.)


> > - move normalization logic into the NormalizedUri class.
>
> The classes follow=C2=A0 specifications that describe how normalizatio= n should be. Why would you split the responsibilities in other classes ? Wh= at would be the added value ?

For one, unless I am missing something, there is an asymmetry between the g= et() methods and the with() methods. What I'm seeing is that (e.g.) Uri= ::withPath() expects a raw path argument, but getPath() returns the normali= zed version.=C2=A0 For symmetry, I would expect either:

- `Uri::withPath(raw_value) : self` and `Uri::getPath() : raw_value`, or - `Uri::withRawPath(raw_value) : self` and `Uri::getRawPath() : raw_value`<= br>
Thus my first intuition that the "main" values in the URI need to= be the raw ones, and that getting the normalized ones should be the more v= erbose case (e.g. `getNormalizedPath() : normalized_value`).

So, one value added by splitting the classes is to resolve that asymmetry. = Consumers expecting to get back from the URI what they put into it can use = the raw Uri variation; "API clients or signers fall in this category t= hat want to avoid introducing any unnecessary changes to URIs, in order to = avoid causing subtle bugs."

Other consumers, who want to do things this new and different way (normaliz= ed as-you-go, unlike anything currently in userland) can use the Normalized= Uri.

(Or you could flip it around and say that the normalized variation is the U= ri class, and the raw version is RawUri.)



-- pmj

--00000000000095ac6a0633f05d54--