Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:127238 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id CA20A1A00BC for ; Tue, 29 Apr 2025 13:55:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1745934811; bh=L5U/BR7gTCOZhfutvOdyyx/lWXic//3BMtvN+nfBEUE=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=bCoN0vv7sbZRTWKignFCAERHWmIwAzqU9SaJvpoRiQ/3jQzNzHX3BUX6mpj4HVx0i 8bqSquBf8dO1SdyNXTVaIB4nTI0E9wqWti61NYEeGclb2H4vVjce2PwiVdY1RvDTnS jRuR0zTiGopqIi5MNH4GrvgoAtw4X2VR0YTM3m/wU3k9S6esre2o/b62hEvNP15Kwq csCUgFcR79Myv9U8v+tH+FZYt+3CUwj8rdmtCiWznSJ/WzgkieRsVHA6Pv8rQAWCiS 9RJe0kSBYL7+6IfwtIOa4S9VOjVhHIV3kkVFxqoxsu6uiFCGJCQ2Z2iKU9ngj/3T0Y rxvsFWFJ+8FVQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 081E818007D for ; Tue, 29 Apr 2025 13:53:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_40,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: Error (Cannot connect to unix socket '/var/run/clamav/clamd.ctl': connect: Connection refused) X-Envelope-From: Received: from premium76-5.web-hosting.com (premium76-5.web-hosting.com [162.213.255.108]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 29 Apr 2025 13:53:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=pmjones.io; s=default; h=To:References:Message-Id:Content-Transfer-Encoding:Cc:Date: In-Reply-To:From:Subject:Mime-Version:Content-Type:Sender:Reply-To:Content-ID :Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To: Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe :List-Post:List-Owner:List-Archive; bh=3ciA14snLseR7Cl6MnhetA5+6ceL5UWYbjGKLmQgLFI=; b=QLgm6DWKFxUAqcNMYUi/oZdqM9 JF1kPII56B6KNyRFSrCSGGb/X2I4jreJUYTVIdYwbezd9K2axMNrOZoxLVG2ilgW/TwBU6KIfdy7t Cs0cR3RFEBplwTkr31z7gs3RMjuNuH7lvdDaLRdlGrcYvovtZSIRI7QMDB2o6pL9IMJHLQ7mG0GE+ 9lWOUhb+4MUA048JhK4IGJmktkWZoxWj3XqER49KAVOF6xxF+eTb4l887h52AkUEogtq3OuH5E8/0 afHfONTgcFPp800sH9UmfMuU+eUMHyBaPpWeKZRrchSP5KJHJCUMjWVUJU37uFgwFY5y0Sg6Mppdl LqhmCCCA==; Received: from 107-223-28-39.lightspeed.nsvltn.sbcglobal.net ([107.223.28.39]:53413 helo=smtpclient.apple) by premium76.web-hosting.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.1) (envelope-from ) id 1u9lR1-0000000AYcD-44NS; Tue, 29 Apr 2025 09:55:44 -0400 Content-Type: text/plain; charset=utf-8 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.400.131.1.6\)) Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API In-Reply-To: Date: Tue, 29 Apr 2025 08:55:31 -0500 Cc: =?utf-8?B?TcOhdMOpIEtvY3Npcw==?= , Internals Content-Transfer-Encoding: quoted-printable Message-ID: <74F64DCB-3A10-4CF8-8DAA-6089BC34EA89@pmjones.io> References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <8df04e01-deac-404b-beb7-cd982423db63@bastelstu.be> <33427cd03035ef084245c44290b56a55@bastelstu.be> <0aa1eefc3941bdea0092e935074daa58@bastelstu.be> <76d96ea8a78c6025128c0a4b01c94c0a@bastelstu.be> To: ignace nyamagana butera X-Mailer: Apple Mail (2.3826.400.131.1.6) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - premium76.web-hosting.com X-AntiAbuse: Original Domain - lists.php.net X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - pmjones.io X-Get-Message-Sender-Via: premium76.web-hosting.com: authenticated_id: pmjones@pmjones.io X-Authenticated-Sender: premium76.web-hosting.com: pmjones@pmjones.io X-Source: X-Source-Args: X-Source-Dir: X-From-Rewrite: unmodified, already matched From: pmjones@pmjones.io ("Paul M. Jones") Hi Ignace & Mat=C3=A9 and all, tl;dr: I argue against Ignace's objections to splitting the URI class = into two classes (one that retains raw URI values and another that = normalizes values as-it-goes). Jump to the very end for a discussion = regarding the with() methods (search for the word "asymmetry" herein). * * * > On Apr 28, 2025, at 15:47, ignace nyamagana butera = wrote: >=20 > The current approach in userland mixes both raw and half normalized = components as well as RFC3986 and RFC3987 specification with ambiguity = around normalization, input, constructior, what needs to be encoded = where and when Based on my research into existing URI projects = I = don't think that's an accurate assessment of the ecosystem. For example, can you point out which projects mix "raw and = half-normalized components"? Nette is the only one that comes to mind, = in that (during parsing) it applies rawurldecode() to the host, user, = password, and fragment; but that's only one of the 18 projects. Likewise, of the 15 URI-centric projects, only one of them (league/uri) = offers both RFC3986 and 3987 parsing; the two IRI-centric projects = (ml/iri and rmccue/requests) are explicitly IRIs; and rowbot is clearly = WHATWG-URL centric. So I don't see much ambiguity in any projects = there. As far as normalization, only one project (opis) affords the ability to = normalize at creation time, though five of them offer a normalize() = method with various effects = (). So, again, I don't see much ambiguity there either; they = don't do normalizing as-you-go, it's something you have to apply = explicitly. Regarding inputs, they all presume "raw" inputs. Regarding constructors, = they mostly side with a full URI string. Regarding encoding, they mostly = retain values in their encoded form (there are three outliers, cf. = ). With all that in mind, we can see that the various authors of userland = projects have settled on remarkably similar patterns of usage that they = found valuable and useful for working with URIs. > > - fulfill existing userland expectations; >=20 > Existing userland expectations are mostly built around `parse_url` That's kind of true; 9 of the 18 projects use parse_url(), and 7/18 = implement the RFC 3986 parsing algorithm ... > which is one of the reasons the RFC exists to improve the status quo = and to introduce in PHP valid parsers against recognizable URI = specifications. Yes some adaptation will be needed to use them in = userland but I believe this work is easy to do, talking from the POV of = a URI package maintainer. ... but I don't imagine that replacing parse_url() in those projects = with the RFC 3986 algo would cause those projects to change any of their = other design decisions. What adaptations do you think would be needed = around that replacement? > > - replace the toString()/toRawString() with a single idiomatic = __toString() in each class; >=20 > For all the reasons explained in the RFC, adding a `__toString` method = is a bad architectural design for an URI. There are so many ways to = represent an URI that having a `__toString` for string representation = gives a false sense of "there can be only one true representation for a = single URI" which is not true. For Rfc3986\Uri, it looks like there are only two that are recognized: = raw and normalized. Are there other string representations you feel the = Uri class should recognize? (For Whatwg\Url, it looks like there are also only two: as-parsed, and = as ASCII, but I'm not addressing that part of the RFC here.) > > - move normalization logic into the NormalizedUri class. >=20 > The classes follow specifications that describe how normalization = should be. Why would you split the responsibilities in other classes ? = What would be the added value ?=20 For one, unless I am missing something, there is an asymmetry between = the get() methods and the with() methods. What I'm seeing is that (e.g.) = Uri::withPath() expects a raw path argument, but getPath() returns the = normalized version. For symmetry, I would expect either: - `Uri::withPath(raw_value) : self` and `Uri::getPath() : raw_value`, or - `Uri::withRawPath(raw_value) : self` and `Uri::getRawPath() : = raw_value` Thus my first intuition that the "main" values in the URI need to be the = raw ones, and that getting the normalized ones should be the more = verbose case (e.g. `getNormalizedPath() : normalized_value`). So, one value added by splitting the classes is to resolve that = asymmetry. Consumers expecting to get back from the URI what they put = into it can use the raw Uri variation; "API clients or signers fall in = this category that want to avoid introducing any unnecessary changes to = URIs, in order to avoid causing subtle bugs."=20 Other consumers, who want to do things this new and different way = (normalized as-you-go, unlike anything currently in userland) can use = the NormalizedUri. (Or you could flip it around and say that the normalized variation is = the Uri class, and the raw version is RawUri.) -- pmj