Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126198 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id A77E61A00BD for ; Fri, 3 Jan 2025 07:18:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1735888556; bh=L18DxmYmEoMfA5PP0AXpYihTYhQcQGHWawzf9yO2kHQ=; h=From:Subject:In-Reply-To:Date:Cc:References:To:From; b=HhFLm1hc7UxaMsleBk5GtG5OXvOj+J/NCh6mmhlVP/t7J6ALTx4KBV6W7vSBkRK+9 Kf3nUbxUI3VbYVt6QGuLy447olo4RSHIHXBH0FDar5LPIKNxPnibPJ2Myl+vgrd94a 5a0liCMSNlhxqycgy2W18pNjpHzA3E+u2kJiZ8RbIXQNo7pD8xCw6Bbzio+qFig4vb BAwaJ6J1UI8OeMsxdq/imeXxYFtR7dT5q1pC3m6qTxaRGDIrf7iF666LKMt58iSeMz ymnvucdtrufPKCEMRLGQpKzXZF2VL6jSGOXBODnB0nv158gs0RjrZRjhcthALanjhA 4Xr1OPrVDdY2A== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 265D9180062 for ; Fri, 3 Jan 2025 07:15:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 3 Jan 2025 07:15:54 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id E50443805A0 for ; Fri, 3 Jan 2025 07:18:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:message-id:content-transfer-encoding:date :date:in-reply-to:subject:subject:mime-version:content-type :content-type:from:from; s=automattic1; t=1735888731; bh=L18DxmY mEoMfA5PP0AXpYihTYhQcQGHWawzf9yO2kHQ=; b=IZJTUPtA9h2ruqJZa9xCMff vClga9UD8TZpa/k3mpwuoXxOdvplT/bJwX2L1+OonOKX08C1IFKVM4qc64jjQJ6D sLiSB9pO4ZZCkb9dgJjvapsxH2YJ2XAdLwFkyb0sk0FrhSrOGeVWmuyeepd3ptU9 ASsrS/ouBvwuL5Hck1gG5+bxCRPBxEhD8OHBnevD0oKhh3GxDMU+Q8MOgZDoK1CC /winmf31krE3ANXyx42n178AGnlwaVUNcqd3XRYg1GcbXYAdkMMlbUAzLcbzkfZM 2MVCAYLctWxOqlLrQXUalWhD8+aYl32r51Z4dmmwByVxGOOGD4yBLvayVfH9xSw= = X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OUZfEC5GC3VY for ; Fri, 3 Jan 2025 07:18:51 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 896723804CB for ; Fri, 3 Jan 2025 07:18:51 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="mGgn3CnI"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 7B0CBA03D3 for ; Fri, 3 Jan 2025 07:18:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1735888731; bh=L18DxmYmEoMfA5PP0AXpYihTYhQcQGHWawzf9yO2kHQ=; h=From:Subject:In-Reply-To:Date:Cc:References:To:From; b=mGgn3CnIQUIJgiqdGODvxb2JsRy899ftzbmiSh4uA6jYyME+Bm/8RMWRc6n5FcqGf SCyMY9OxPlga6pDEf/n5Mu1qp2cQn2+c9YZp+1xHsa7mafiz3RaJB0+YvwfwECps8w gm/4nyzfBupbXUBnNECMTdTA5GGFrL+BJ8mQ4S309EBRQNM0aWMH7uX1eNdCLGtkuS R4WQxFL1WgIaAlMyjKwbxgfYBB/TYpTsC6YcVNHsgUL/HMD+Cadkqhj9Ji4xQFXJ7L yXf7zq96UggkhZliCqcg5kCRJfS4gBAz8oXZOYYNyBMX7PXZMfNP7MMkwHXJq+AiY6 7G5cuPRH3ajhg== Received: from mail-pl1-f198.google.com (mail-pl1-f198.google.com [209.85.214.198]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 7760CA0378 for ; Fri, 3 Jan 2025 07:18:51 +0000 (UTC) Received: by mail-pl1-f198.google.com with SMTP id d9443c01a7336-21632eacb31so121861535ad.0 for ; Thu, 02 Jan 2025 23:18:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735888730; x=1736493530; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:subject:mime-version:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=P5tffLO5uHs+TqUt8VDnKpbqtDilgdplfVjXjv3Ys0g=; b=AjnZZ0IdCM7e/CKMx1VOWBZvQFsz/m4JNClKLuxQFNTMsilX98Z5QWCcQd3xb5endx H+bHJe+oyv/lyl22LgTkqNaEDSSRGrDcN7sx67kCOINVDe0n2mqwNpajnYklQOQwh5ym Ynq3Lc4fpI2viDc/SSkK3czvSPV27UyiQE7MeZ0sibOwyWFVguts15CJiERm0T8F5lPS xS69FkNn5Kzm3qAM7ok0CGJLbpWgbocR/GQrKKiVLx3N9CUvY/mvBpJ5ZICwNIR77H9U Ai1kubctu9xvVWVj7DLALtVnsDzT0QRlCqw11b5FK060ccj9wCXop4PIzagsAtnyqECm 1axQ== X-Gm-Message-State: AOJu0YzaYExSdEtKzSqi36shBY8Xq1oNbPwJRvFwwcCqLScL4CFryOHQ pQ/slUrfjv0vVhzs9ikpWv2wSy0AnShmhIbjDW/VibSM+KaxEAQmr0b47NGCII/kNyRnrHxd95b xgwfpIMa3sIV8SrUneAhMQCOTanxez/73qdI8xqhIVrA2820ezKT5PTqBEw/qhb/oRQ== X-Gm-Gg: ASbGncsWwwgSq9PlyV8UISPhB8iSJ//eQL51SX2UIJG3Zw3WK8aLAzV6PtMFlvQElVX sd3U9p5w2EkgME7jjYS66nwkCnc1BRAVxyZ/+V/1YZDdnS5BXJa/0mETBkx9NgzEyzoYVaXzFOb iC2Sy81ddBIYdGvgKzaDCcsugNxZGcZZpZoKcbqyreB92r/yudDxCgIrzSdpxyrOk6EmUZg1jnM nggp2z7E6iKClUl+P2x6KzJLFh4Dd/6ZYE1IofysLMormzOqona5UtPaLl1YE10yCB8KyyunxAz Ttgtm2qxILYWRTSpIV/rq6K4cTxF4DPdKg5pRQ== X-Received: by 2002:a17:902:e852:b0:216:386e:dbc with SMTP id d9443c01a7336-219e6ea1d22mr762648085ad.13.1735888730551; Thu, 02 Jan 2025 23:18:50 -0800 (PST) X-Google-Smtp-Source: AGHT+IHEVlIKKGIqojgp0OkoZZj4QS4sOAxDBD+iHtetoNv4rpEnM2jD+bdJDs3yMPvc+T5NW5Py5w== X-Received: by 2002:a17:902:e852:b0:216:386e:dbc with SMTP id d9443c01a7336-219e6ea1d22mr762647855ad.13.1735888730124; Thu, 02 Jan 2025 23:18:50 -0800 (PST) Received: from smtpclient.apple (ip68-231-80-211.ph.ph.cox.net. [68.231.80.211]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-219dc9f73f6sm239050575ad.222.2025.01.02.23.18.47 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 02 Jan 2025 23:18:49 -0800 (PST) X-Google-Original-From: Dennis Snell Content-Type: text/plain; charset=utf-8 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.300.87.4.3\)) Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API In-Reply-To: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> Date: Fri, 3 Jan 2025 09:18:33 +0200 Cc: Internals Content-Transfer-Encoding: quoted-printable Message-ID: <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> To: =?utf-8?B?TcOhdMOpIEtvY3Npcw==?= X-Mailer: Apple Mail (2.3826.300.87.4.3) From: dennis.snell@automattic.com (Dennis Snell) It seems that I=E2=80=99ve mucked up the mailing list again by deleting = an old message I intended on replying to. Apologies all around for = replying to an older message of my own. M=C3=A1t=C3=A9, thanks for your continued work on the URL parsing RFC. = I=E2=80=99m only no returning from a bit of an extended leave, so I = appreciate your diligence and patience. Here are some thoughts in = response to your message from Nov. 19, 2024. > even though the majority does, not everyone builds a browser = application with PHP, especially because URIs are not necessarily accessible on the = web This has largely been touched on elsewhere, but I will echo the idea = that it seems valid to have to separate parsers for the two standards, = and truly they diverge enough that it seems like it could be only a = superficial thing for them to share an interface. I only harp on the WhatWG spec so much because for many people this will = be the only one they are aware of, if they are aware of any spec at all, = and this is a sizable vector of attack targeting servers from = user-supplied content. I=E2=80=99m curious to hear from folks here hat = fraction of the actual PHP code deals with RFC3986 URLs, and of those, = if the systems using them are truly RFC3986 systems or if the = common-enough URLs are valid in both specs. Just to enlighten me and possibly others with less familiarity, how and = when are RFC3986 URLs used and what are those systems supposed to do = when an invalid URL appears, such as when dealing with percent-encodings = as you brought up in response to Tim? Coming from the XHTML/HTML/XML side I know that there was substantial = effort to enforce standards on browsers and that led to decades of = security exploits and confusion, when the =E2=80=9Cofficial=E2=80=9D = standards never fully existed in the way people thought. I don=E2=80=99t = mean to start any flame wars, but is the URL story at all similar here? I=E2=80=99m mostly worried that we could accidentally encourage risky = behavior for developers who aren=E2=80=99t familiar with the nuances of = having to URL specifications vs. having the simplest, least-specific = interface point them in the right direction for what they will probably = be doing. `parse_url()` is a great example of how the thing that looks = _right_ is actually terribly prone to failure. > The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter = when the 2nd (base URI) parameter is provided. So essentially you need = to use=20 this variant of the parse() method if you want to parse a WhatWg = compliant=20 URL If this means passing something like the following then I suppose it=E2=80= =99s okay. It would be nice to be able to know without passing the = second parameter, as there are multitude cases where no such base URL = would be available, and some dummy parameter would need to be provided. ``` $url =3D Uri\WhatWgUri::parse( $url, 'https://example.com' ) var_dump( $url->is_relative_or_something_like_that ); ``` This would be fine, knowing in hindsight that it was originally a = relative path. Of course, this would mean that it=E2=80=99s critical = that `https://example.com` does not replace the actual host part if one = is provided in `$url`. For example, this code should work. ``` $url =3D Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc=E2=80=99, = =E2=80=98https://example.com=E2=80=99 ); $url->domain =3D=3D=3D 'wiki.php.net' ``` > The forDisplay() method also seems to be useful at the first glance, = but since this may be a controversial optional feature, I'd defer it for = later=E2=80=A6 Hopefully this won=E2=80=99t be too controversial, even though the = concept was new to me when I started having to reliably work with URLs. = I choose the example I did because of human risk factors in security = exploits. "xn--google.com" is not in fact a Google domain, but an IDNA = domain decoding to "=E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com=E2=80=9D This is a misleading URL to human readers, which is why the WhatWG = indicates that =E2=80=9Cbrowsers should render a URL=E2=80=99s host by = running domain to Unicode with the URL=E2=80=99s host and false.=E2=80=9D = [https://url.spec.whatwg.org/#url-rendering-i18n]. The lack of a standard method here means that (a) most code won=E2=80=99t = render the URLs the way a human would recognize them, and (b) those who = do will run to inefficient and likely-incomplete user-space code to try = and decode/render these hosts. It may be something fine for a follow-up to this work, but it=E2=80=99s = also something I personally consider essential for any native support of = handling URLs that are destined for human review. If sending to an = `href` attribute it should be the normalized URL; but if displayed as = text it should be easy to prevent tricking people in this way. In my HTML decoding RFC I tried to bake in this decision in the type of = the function using an enum. Since I figured most people are unaware of = the role of the context in which HTML text is decoded, I found the enum = to be a suitable convenience as well as educational tool. ``` $url->toString( Uri\WhatWg\RenderContext::ForHumans ); // = =E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com $url->toString( Uri\WhatWg\RenderContext::ForMachines ); // = xn-google.com ``` The names probably are terrible in all of my code snippets, but at this = point I=E2=80=99m not proposing actual names, just code samples good = enough to illustrate the point. By forcing a choice here (no default = value) someone will see the options and probably make the right call. ---- This is all looking quite nice. I=E2=80=99m happy to see how the RFC = continues to develop, and I=E2=80=99m eagerly looking forward to being = able to finally rely on PHP=E2=80=99s handling of URLs. Happy new year, Dennis Snell=