Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125299 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 35D321ADF73 for ; Mon, 26 Aug 2024 22:25:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724711269; bh=kDFXITsV/6x+IUeMN8hIbXOwWCg9v5l0C2gSJkzLuDc=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=HOF2zZ5WSdUiQ6n9ssH9IR1cjd++cVLBQoMFK/axGDQYUkqHHGR6dAcIYN4LSNwiK aTTstqfsiPqdQFP4FC3M2KglDstRxhvGsGgG0RFDKvVN8YLP+L+/H9Hk9wnWQ3+9tQ 5t8EDMeuRgs7rORa30+Z6Bay1XptxgmqZuGKW5ZPbWkkV3RVqttOJbyzGjC8vN21pi GsHEl/zufI8/9aVBioqraY8+citlCT+Vxm9GnSW7eh/Mv+QgITXNdxfuDzYyhnocOf ITsa18/COkG00a0fwZqV0nINWJ/0h4qRoelSk/UvZmqVP/pJ1W4dJjwD46NaV/QdLB aV863ldHIkDjQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 038651801E5 for ; Mon, 26 Aug 2024 22:27:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: ** X-Spam-Status: No, score=2.1 required=5.0 tests=BAYES_50,BODY_8BITS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, HTML_MESSAGE,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 26 Aug 2024 22:27:42 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id DB2D034098F for ; Mon, 26 Aug 2024 22:25:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=x-mailer:references:in-reply-to:date:date:subject:subject :mime-version:content-type:content-type:message-id:from:from :received:received:received:received:received:received; s= automattic1; t=1724711148; bh=kDFXITsV/6x+IUeMN8hIbXOwWCg9v5l0C2 gSJkzLuDc=; b=L/+cvxaP7MuxLpy3bGKnCK5C50RROBzPYyThqSFoIG2jc3Upjq ER6pZ1sA6IOCYHYoY5QnOpb+C7iT+VlMdxGfLwlOtEQqEfXGcQ5O0pIKfjowTYyq INzMV+hyiDVfxzNmAux0GLipzgbQBIMyLAeUkQQFGaWTmUQEfkxXSBDCduLwoJR4 PpxiRpdddH34bJbcVKx0D4Oc8weo9Fzu7QXV0SYUWt39wCTUb0+sHGp7hTs3s7nF 9fmfJR6/ZosaiZ+q6KI6S1kcVKxKXLAQ413FwNX/A6aPYX5dYtVPZmZXZ3gRb5ed 2oLIV494TG9+XJkmGso6ryydUORodwJYAQJA== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BvmWYIFGo0JA for ; Mon, 26 Aug 2024 22:25:48 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 537F834091E for ; Mon, 26 Aug 2024 22:25:48 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="S14qeLvc"; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="GKBiTzk8"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=automattic.com header.i=@automattic.com header.b="QAi/lEau"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 3FC23A03C9 for ; Mon, 26 Aug 2024 22:25:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724711148; bh=kDFXITsV/6x+IUeMN8hIbXOwWCg9v5l0C2gSJkzLuDc=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=S14qeLvcojcQG6z6BTRcJLroD4cVfF0yjfOoZoCiuvVlViaKrzkltw6FT37oJPAUr g+fQSzArSjuu+Y7tK5jyP+cCflehkSVu3DoqN36utz0AZNvp9d75r4vbys9/OBxtcu m6Ycr+EMjj/yl2YwcT/q7D7ZWxRz8kYaauglgBMgTbqcZRlXwdTMecRK8R1QtrYlnd l5S823oCX0F++yBu0qlNAbDvrsnD5h6/7lrB4ne71A2moiPPoLCKPGVzcM9rBsjCmG TNK5Ypa1NZZqzCGgfVChnCI2igx/502OYoXZXu4LaIF5/JQ62vAtkI+UTEIp+TWp0A FNhH6kgnwlhpg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1724711148; bh=kDFXITsV/6x+IUeMN8hIbXOwWCg9v5l0C2gSJkzLuDc=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=GKBiTzk8JlnD38KpS1vRJibUAJzYCr/PJNj6RFbjXXQWsWfotW/RvqGTLRiL5pWyD q7VtuoR9ai9P1kF49Fsr+96UPOxiJcNCDI/dWyXSE09fNPKFMUcwMkAHqG6b/JOICo IwKluhC2kw6RR4AhkG6fu4cwPm+g2a11caXbb0xsfNnVjOiJzuRexwZVUI9hy4bXuq XhypczuxaTDkI+dpvgLpV5cVbsMOtA5AvYaV8RoODrwlGOYkpz405n0UeIGbLmL6mC evImAEnUQwER+Ax9PkDamG2hGLai5yehzDVnkSYoPXzZUqFI1l8kkFq5r6dZyKQmb/ y60ZIvwVdwqzg== Received: from mail-io1-f69.google.com (mail-io1-f69.google.com [209.85.166.69]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id 272ACA023E for ; Mon, 26 Aug 2024 22:25:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic2; t=1724711148; bh=kDFXITsV/6x+IUeMN8hIbXOwWCg9v5l0C2gSJkzLuDc=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=QAi/lEauzgeUbFjeN2loPDJEmqjn8JKdp0Chy1EjtC8+dhhhPE1BGy1ZuS7PbBCbv LO6ODUvHhCMSEmCVMzuo58557HFd0ZqVkO2aVLRDSP9cns2+vpKvRqR6D1aCpJHtLR e2CfLfZ/fagNZyYzPPeXC/xLWl8lCtm3jRYRNKPBypFJlCmmYV91Rf3o7ZhLV6j5Ff 5Yq2XGAMus/cRGnkSv4pvOeBrT8bjTIE2+BBtFcPtPZKTJtFQvWMx1678HFNU5okWg l9tYbYv5N31oxsHonzkmp7Qxqj2GbJpumSnsr09+1d5MgfLw4q4Yb0vy2JTEpkpwEC 3eAu0nK4i1DJw== Received: by mail-io1-f69.google.com with SMTP id ca18e2360f4ac-81f959826ccso510096439f.3 for ; Mon, 26 Aug 2024 15:25:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724711147; x=1725315947; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jLauX6xfpZXWq8tvh/y0hAV03EY15FiNRhRa7aqe5jE=; b=C5PfyXCL6pS0IySfkcuC9wnAeJcULYyL2oeVCrPOm9+Gg5h/odcsmZym8WIezJADfB 0ramAV2R4lj+rMh9rdOTDBb2ttSEoJJoLxHgv/9bDBDKf/AnAmYGd+pFF0Xn0YowKYX3 CCrUaldPEsqEHUBGwZfVOxuVW0w/pVnsmuWfc+2Uin8ZqtuYVBQP6pfeDtTw+kQ14Hmv YR/hlq1C9REPXwX9KzLaGPuk4Sg9sfH0uyDySltnf6i3PWOPMRwYjFWub/XFEAJHnMi0 F3odnnCultco+WXekJNcXPUHK72hKGA+d7W/lIJOhp/QzhyHqkOM2P9+15VeZ/cE9oyk Wqtw== X-Gm-Message-State: AOJu0YzclTdt2xvSalyvwf31VZvRrxburZn+HJkdSBf83dO0trJVwANn OTgHMgLDgTPWC986c/VIDtTUMf28z+9g/nImuWKFPkRaOJgB0BBO36SNEYaAz4E/u0RveHm/ULj +5etQ7D+OtzMCSEC3BRVyFvp4/VOkIepn3YHfTj7P1QW3RU2LCXbj/ZQvU5QssKsAeQ== X-Received: by 2002:a05:6602:2b04:b0:81f:803d:cbe7 with SMTP id ca18e2360f4ac-827881ae8bemr1564305039f.12.1724711147401; Mon, 26 Aug 2024 15:25:47 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFNDvOL2RZegYmfWhyEwMNJb5dAa00aBRndMtEHFiHzQ8kg4uVCJ9qta1gIOzbKRUm5yQuBJg== X-Received: by 2002:a05:6602:2b04:b0:81f:803d:cbe7 with SMTP id ca18e2360f4ac-827881ae8bemr1564301639f.12.1724711146829; Mon, 26 Aug 2024 15:25:46 -0700 (PDT) Received: from smtpclient.apple (ip70-171-161-83.om.om.cox.net. [70.171.161.83]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4ce710c4ac4sm2354830173.134.2024.08.26.15.25.46 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Aug 2024 15:25:46 -0700 (PDT) X-Google-Original-From: Dennis Snell Message-ID: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_E926BF61-55B3-49B2-B7A6-B6F7AEBA79C4" Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\)) Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API Date: Mon, 26 Aug 2024 17:25:35 -0500 In-Reply-To: Cc: Internals To: =?utf-8?B?TcOhdMOpIEtvY3Npcw==?= References: X-Mailer: Apple Mail (2.3776.700.51) From: dennis.snell@automattic.com (Dennis Snell) --Apple-Mail=_E926BF61-55B3-49B2-B7A6-B6F7AEBA79C4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > Hi Everyone, >=20 > I've been working on a new RFC for a while now, and time has come to=20= > present it to a wider audience. >=20 > Last year, I learnt that PHP doesn't have built-in support for parsing = URLs=20 > according to any well established standards (RFC 1738 or the WHATWG = URL=20 > living standard), since the parse_url() function is optimized for=20 > performance instead of correctness. >=20 > In order to improve compatibility with external tools consuming URLs = (like=20 > browsers), my new RFC would add a WHATWG compliant URL parser = functionality=20 > to the standard library. The API itself is not final by any means, the = RFC=20 > only represents how I imagined it first. >=20 > You can find the RFC at the following link:=20 > https://wiki.php.net/rfc/url_parsing_api >=20 > Regards,=20 > M=C3=A1t=C3=A9 >=20 M=C3=A1t=C3=A9, thanks for putting this together. Whenever I need to work with URLs there are a few things missing that I = would love to see incorporated into any change in PHP that brings us a = spec-compliant parsing class. First of all, I typically care most about WhatWG URLs because the PHP = code I=E2=80=99m working with is making decisions about HTML that a = browser will interpret. Paramount above all other concerns that code on = the server can understand content in the same way that the browsers = will, otherwise we will invite security issues. People may have valid = critiques with the WhatWG specification, but it=E2=80=99s also the = most-relevant specification for users of much or most of the PHP code we = write, and it=E2=80=99s valuable because it allows us to talk about URLs = in the same way a browser would. I=E2=80=99m worried about the side-effects that having a global = uri.default_handler could have with code running differently for no = apparent reason, or differently based on what is calling it. If someone = is writing code for a controlled system I could see this being valuable, = but if someone is writing a framework like WordPress and has no control = over the environments in which code runs, it seems dangerous to hope = that every plugin and every host runs compatible system configurations. = Nobody is going to check `ini_get( =E2=80=98uri.default_handler=E2=80=99 = )` before every line that parses URLs. Beyond this, even just allowing a = pluggable parser invites broken deployments because PHP code that is = reading from a browser or sending output to one needs to speak the = language the browser is speaking, not some arbitrary language that=E2=80=99= s similar to it. > One thing I feel is missing, is a method to parse a (partial) URL = relative to another Being able to parse a relative URL and know if a URL is relative or = absolute would help WordPress, which often makes decisions differently = based on this property (for instance, when reading an `href` property of = a link). I know these aren=E2=80=99t spec-compliant URLs, but they = still represent valid values for URL fields in HTML and knowing if they = are relative or not requires some amount of parsing specific details = everywhere, vs. in a class that already parses URLs. Effectively, this = would imply that PHP=E2=80=99s new URL parser decodes = `document.querySelector( =E2=80=98a=E2=80=99 ).getAttribute( =E2=80=98href= =E2=80=99 )`, which should be the same as `document.querySelector( = =E2=80=98a=E2=80=99 ).href`, and indicates whether it found a full URL = or only a portion of one. * `$url->is_relative` or `$url->is_absolute` * `$url->specificity =3D URL::Relative | URL::Absolute` > the URI parser libraries used don't support modification of the URI Having methods to add query arguments, change the path, etc=E2=80=A6 = would be a great way to simplify user-space code working with URLs. For = instance, read a URL and then add a query argument if some condition = within the URL warrants it (for example, the path ends in `.png`). Was it intended to add this to the RFC before it=E2=80=99s finalized? > I would not make Url final. "OMG but then people can extend it!" = Exactly. My counter-point to this argument is that I see security exploits appear = everywhere that functions which implement specifications are pluggable = and extendable. It=E2=80=99s easy to see the need to create a class that = limits possible URLs, but that also doesn=E2=80=99t require extending a = class. A class can wrap a URL parser just as it could extend one. Magic = methods would make it even easier. A problem that can arise with adding additional rules onto a = specification like this is that the subclass gets used in more places = than it should and then somewhere some PHP code allows a malicious URL = because it failed to parse and then the inspection rules weren=E2=80=99t = applied. ---- Finally, I frequently find the need to be able to consider a URL in both = the display context and the serialization context. With Ada we have = `normalize_url()`, `parse_search_params()`, and the IDNA functions to = convert between the two representations. In order to keep strong = boundaries between security domains, it would be nice if PHP could = expose the two variations: one is an encoded form of a URL that machines = can easily parse while the other is a =E2=80=9Cplain string=E2=80=9D in = PHP that=E2=80=99s easier for humans to parse but which might not even = be a valid URL. Part of the reason for this need is that I often see = user-space code treating an entire URL as a single text span that = requires one set of rules for full decoding; it=E2=80=99s multiple = segments that each have their own decoding rules. - Original [ https://xn--google.com/secret/../search?q=3D=F0=9F=8D=94 ] - `$url->normalize()` [ https://xn--google.com/search?q=3D%F0%9F%8D%94 = ] - `$url->for_display()` Displayed [ = https://=E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com/search?q=3D=F0=9F=8D=94 = ] Having this in the RFC would give everyone the tools they need to = effectively and safely set links within an HTML document. ---- All the best, Dennis Snell --Apple-Mail=_E926BF61-55B3-49B2-B7A6-B6F7AEBA79C4 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

Hi Everyone,

I've been working on a new RFC for a while now, and time = has come to 
present it to a wider audience.

Last year, I learnt that PHP doesn't have = built-in support for parsing URLs 
according to any well established = standards (RFC 1738 or the WHATWG URL 
living standard), since = the parse_url() function is = optimized for 
performance instead of correctness.

In order to improve compatibility with = external tools consuming URLs (like 
browsers), my new RFC would add a = WHATWG compliant URL parser functionality 
to the standard library. The API = itself is not final by any means, the RFC 
only represents how I imagined it = first.

You can find the RFC at the following link: 
https://wiki.php.net/rfc/url_parsing_api

Regards, 
M=C3=A1t=C3=A9

M=C3=A1t=C3=A9, thanks for = putting this together.

Whenever I need to work = with URLs there are a few things missing that I would love to see = incorporated into any change in PHP that brings us a spec-compliant = parsing class.

First of all, I typically care = most about WhatWG URLs because the PHP code I=E2=80=99m working with is = making decisions about HTML that a browser will interpret. Paramount = above all other concerns that code on the server can understand content = in the same way that the browsers will, otherwise we will invite = security issues. People may have valid critiques with the WhatWG = specification, but it=E2=80=99s also the most-relevant specification for = users of much or most of the PHP code we write, and it=E2=80=99s = valuable because it allows us to talk about URLs in the same way a = browser would.

I=E2=80=99m worried about the = side-effects that having a global uri.default_handler could have with code running = differently for no apparent reason, or differently based on what is = calling it. If someone is writing code for a controlled system I could = see this being valuable, but if someone is writing a framework like = WordPress and has no control over the environments in which code runs, = it seems dangerous to hope that every plugin and every host runs = compatible system configurations. Nobody is going to check `ini_get( = =E2=80=98uri.default_handler=E2=80=99 )` before every line that parses = URLs. Beyond this, even just allowing a pluggable parser = invites broken deployments because PHP code that is reading from a = browser or sending output to one needs to speak the language the browser = is speaking, not some arbitrary language that=E2=80=99s similar to = it.

One thing I feel = is missing, is a method to parse a (partial) URL relative to = another

Being able to parse a = relative URL and know if a URL is relative or absolute would help = WordPress, which often makes decisions differently based on this = property (for instance, when reading an `href` property of a link). I = know these aren=E2=80=99t spec-compliant URLs, but they  still = represent valid values for URL fields in HTML and knowing if they are = relative or not requires some amount of parsing specific details = everywhere, vs. in a class that already parses URLs. Effectively, this = would imply that PHP=E2=80=99s new URL parser decodes =  `document.querySelector( =E2=80=98a=E2=80=99 ).getAttribute( = =E2=80=98href=E2=80=99 )`, which should be the same as = `document.querySelector( =E2=80=98a=E2=80=99 ).href`, and indicates = whether it found a full URL or only a portion of = one.

  * `$url->is_relative` or = `$url->is_absolute`
  * `$url->specificity =3D = URL::Relative | URL::Absolute`

the URI parser libraries used don't support = modification of the = URI

Having methods to add = query arguments, change the path, etc=E2=80=A6 would be a great way to = simplify user-space code working with URLs. For instance, read a URL and = then add a query argument if some condition within the URL warrants it = (for example, the path ends in `.png`).

Was it = intended to add this to the RFC before it=E2=80=99s = finalized?

I would = not make Url final. "OMG but then people can extend it!" = Exactly.

My counter-point to this argument is = that I see security exploits appear everywhere that functions which = implement specifications are pluggable and extendable. It=E2=80=99s easy = to see the need to create a class that limits possible URLs, = but that also doesn=E2=80=99t require extending a class. A class can = wrap a URL parser just as it could extend one. Magic methods would make = it even easier.

A problem that can arise with = adding additional rules onto a specification like this is that the = subclass gets used in more places than it should and then somewhere some = PHP code allows a malicious URL because it failed to parse = and then the inspection rules weren=E2=80=99t = applied.

----

Finally, = I frequently find the need to be able to consider a URL in both the = display context and the serialization context. = With Ada we have `normalize_url()`, `parse_search_params()`, and the = IDNA functions to convert between the two representations. In order to = keep strong boundaries between security domains, it would be nice if PHP = could expose the two variations: one is an encoded form of a URL that = machines can easily parse while the other is a =E2=80=9Cplain string=E2=80= =9D in PHP that=E2=80=99s easier for humans to parse but which might not = even be a valid URL. Part of the reason for this need is that I often = see user-space code treating an entire URL as a single text span that = requires one set of rules for full decoding; it=E2=80=99s multiple = segments that each have their own decoding = rules.

 - Original = [ https://xn--google.com/secret/../search?q=3D=F0=9F=8D=94 = ]
 - `$url->normalize()` = [ https://xn--google.com/search?q=3D%F0%9F%8D%94 = ]
 - `$url->for_display()` Displayed [ = https://=E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com/search?q=3D=F0=9F=8D=94 = ]

Having this in the RFC would give everyone = the tools they need to effectively and safely set links within an HTML = document.

----

All the = best,
Dennis Snell

= --Apple-Mail=_E926BF61-55B3-49B2-B7A6-B6F7AEBA79C4--