Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126207 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: by qa.php.net (Postfix, from userid 65534) id EA2D11A00C5; Mon, 13 Jan 2025 15:09:50 +0000 (UTC) To: internals@lists.php.net,Dennis Snell Message-ID: <7644bd02-5a0a-4914-bba5-eb42c54aa924@gmail.com> Date: Mon, 13 Jan 2025 16:09:50 +0100 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Reply-To: nyamsprod@gmail.com Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> Content-Language: fr In-Reply-To: <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Posted-By: 94.225.114.231 From: nyamsprod@gmail.com (ignace nyamagana butera) On 03/01/2025 08:18, Dennis Snell wrote: > It seems that I’ve mucked up the mailing list again by deleting an old message I intended on replying to. Apologies all around for replying to an older message of my own. > Máté, thanks for your continued work on the URL parsing RFC. I’m only no returning from a bit of an extended leave, so I appreciate your diligence and patience. Here are some thoughts in response to your message from Nov. 19, 2024. > >> even though the majority does, not everyone builds a browser application > with PHP, especially because URIs are not necessarily accessible on the web > > This has largely been touched on elsewhere, but I will echo the idea that it seems valid to have to separate parsers for the two standards, and truly they diverge enough that it seems like it could be only a superficial thing for them to share an interface. > > I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs. > > Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim? > > Coming from the XHTML/HTML/XML side I know that there was substantial effort to enforce standards on browsers and that led to decades of security exploits and confusion, when the “official” standards never fully existed in the way people thought. I don’t mean to start any flame wars, but is the URL story at all similar here? > > I’m mostly worried that we could accidentally encourage risky behavior for developers who aren’t familiar with the nuances of having to URL specifications vs. having the simplest, least-specific interface point them in the right direction for what they will probably be doing. `parse_url()` is a great example of how the thing that looks _right_ is actually terribly prone to failure. > >> The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use > this variant of the parse() method if you want to parse a WhatWg compliant > URL > > If this means passing something like the following then I suppose it’s okay. It would be nice to be able to know without passing the second parameter, as there are multitude cases where no such base URL would be available, and some dummy parameter would need to be provided. > > ``` > $url = Uri\WhatWgUri::parse( $url, 'https://example.com' ) > var_dump( $url->is_relative_or_something_like_that ); > ``` > > This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that `https://example.com` does not replace the actual host part if one is provided in `$url`. For example, this code should work. > > ``` > $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘https://example.com’ ); > $url->domain === 'wiki.php.net' > ``` >> The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I'd defer it for later… > > Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com” > > This is a misleading URL to human readers, which is why the WhatWG indicates that “browsers should render a URL’s host by running domain to Unicode with the URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n]. > > The lack of a standard method here means that (a) most code won’t render the URLs the way a human would recognize them, and (b) those who do will run to inefficient and likely-incomplete user-space code to try and decode/render these hosts. > > It may be something fine for a follow-up to this work, but it’s also something I personally consider essential for any native support of handling URLs that are destined for human review. If sending to an `href` attribute it should be the normalized URL; but if displayed as text it should be easy to prevent tricking people in this way. > > In my HTML decoding RFC I tried to bake in this decision in the type of the function using an enum. Since I figured most people are unaware of the role of the context in which HTML text is decoded, I found the enum to be a suitable convenience as well as educational tool. > > ``` > $url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com > $url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com > ``` > > The names probably are terrible in all of my code snippets, but at this point I’m not proposing actual names, just code samples good enough to illustrate the point. By forcing a choice here (no default value) someone will see the options and probably make the right call. > > ---- > > This is all looking quite nice. I’m happy to see how the RFC continues to develop, and I’m eagerly looking forward to being able to finally rely on PHP’s handling of URLs. > > Happy new year, > Dennis Snell Hi Dennis, > I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs. Here's my take on both RFC. RFC3986/87 is a "parsing" RFC which leave the validation to each individual scheme, for instance the following URL is valid under RFC3986 but will be problematic under WHATWG URL spec ``` ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen) ``` The LDAP URL is RFC3986 compliant but adds its own validation rules on top of the RFC. This means that LDAP URL generation would be problematic if we only implement the WHATWG spec, hence why having a RFC3986/87 URI in PHP is crucial. Futhermore, the WHATWG spec not only parses but also in the same time validates and more agressively normalizes the URL something the RFC3986 does not do or more precisely recognizes and categorizes in two categories, the non-destructive and the destructive normalizations. These normalization affect the scheme, the path and also the host which can be very impactful in your application. ```pbp For the following URL 'https://0073.0232.0311.0377/b' RFC3986: 'https://0073.0232.0311.0377/b' WHATWG URL: 'https://59.154.201.255/b' ``` So this can be a source of confusion for developper. Last but not least RFC3986 alone will never be able to parses IDN domain names and required suport of RFC3987 IDN domains to do so. Hopefully with those examples you will understand the strenghts and weaknesses of each spec and why IMHO PHP needs both to be up to date.