Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126981 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 9085A1A00BC for ; Mon, 31 Mar 2025 19:15:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1743448405; bh=u2kz+i25fN5L8VafnUJimXGPjzbgTR9Qp8gRMBu5Xdc=; h=Date:Subject:From:To:Cc:References:In-Reply-To:From; b=cofoQONZT8iMnW9jJkLono9K7iRkviTf2omCXIezicJtYsJOKTglIzhHclFIv0+zY 6mTM6E8ZM6wRn5Ol+l4UvrkA+FmnlAqdGL9yz2tfK0sBjhUcsRO8zW/Js3HpCV8yZB eyANCtNB8oFeIoTvgGnP8d00TbNvi8VvmqH4fgN9lfZ+Zvq4o2LyLvMw84N18lWZYx 5yCzXFSgGKnqrPLFTwENDK2PSW+HfaOnxojrYFe1OxBG2oFtgmEw1MY+gKEWiMMicG Eb2EIVKFX9ZBXtcmaVQUBSM23YC7n4xZ7uF/isyP4ISTss7n0BsbJNl1m6PVjI6w7L cxS0A1yVDB/NQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 165E618007C for ; Mon, 31 Mar 2025 19:13:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_40,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, HTML_MESSAGE,HTTP_ESCAPED_HOST,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 31 Mar 2025 19:13:23 +0000 (UTC) Received: by mail-wr1-f51.google.com with SMTP id ffacd0b85a97d-3912fdddf8fso3634022f8f.1 for ; Mon, 31 Mar 2025 12:15:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743448550; x=1744053350; darn=lists.php.net; h=in-reply-to:content-language:references:cc:to:from:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=4PjWdu0lR21o3aLkY1w1o0yeGCnWExcIgGPHa1s8t+8=; b=I/tpJ5yagLW/H7Y0RI+eqIErUBWxVFsgIjLO15dZZSjzdJsslXuOenqzgB1pkHWHI2 ciS6iDfbSeDLWtdiaydn0kFK47jrP/x+YJQrMf0AgfcvcyetG2NJTvoVzlwMsiD6UOCn U35aO9k996d5S55hRVSPAmdragFolu8XzaTq4MnfbzkmsuwqtNfJOOevmD3oBqRiYu0A EGQbVsmLAs1dBf43+5CLZnLdlrSBulSDIroCg2dTU8TVAolhBfNN79LuBiBfiM/ZYdPA 2yN7WJWkIB++JOQQIMKM8RoAuLoj63m7K3ltG5cyQT/RxEdIMPmZfmGmYs22muHOB2Oy 425Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743448550; x=1744053350; h=in-reply-to:content-language:references:cc:to:from:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=4PjWdu0lR21o3aLkY1w1o0yeGCnWExcIgGPHa1s8t+8=; b=KaflxyTToF9PiINfg8K0RyrWRa7CKNMCQ2NQLJtgwO7b2PoYNKVTXpYQzUky0qiQcw 5xuQZsomtiwO/Ud4Vb2x0/FXGOOxpJpNAxLUKO/+ZcnpAdMLcIM9wo9kEk1NIqpmvXv9 /keIeBTXLbYUYw62ZdaiagEfXr7nL3FwB9xgaBEj7ectg7FODTSrN2rdAuSM3ZVqYmSZ MtJjDpsyGiPQUiCw1+rbrjlSBDfghHwRm7dOeV3xv+x3hIbwMovNiJe0hfRIOrKYB6dx FOvKqNjkuHc23P83SywghOJiyrldM3SRB+8KIyuyDSoZEulSLRrzHyvm6le3MydkaPx+ rqPw== X-Gm-Message-State: AOJu0YyMXKrNFWnXI8OKW7p/conF77Q0Z47e5AfAPsUx9nDFGtmBcYMk U40kiELXYkcObB6KnNgTZFFRP/ZprldRHSwiFZ1gUTzENxCeeYPu58OLhA== X-Gm-Gg: ASbGnctHPVqkGhO4QORHuOE0aSFdC3L8tTp6Cm3NN24pLA/lOdjimRFs6klUEWO7Tbj AwKNCiup1c+EImO+c73ei3USrqWCECv9mSkXkUCmOAUnGmnPaxKjMWDsDQoogWYgmGpHFnvBVmq inx8DmOy/Be7HAuNC9LR6M4BsG7Pgu8uWOIXt71qbuMAsbOgEZZBhbc1jxoXIoy5F99f66oMtdy 4YYv3rXnkRCPIAYAnY8XfDLDJYznKreDDHddPMFEAsaGcBhb1NhyHYWd4dWPSN5G/WvUgo20Gml VRHWeTyAtpv/WLgRiH3cQyBN/SIw9X92z8uC3rFauLDeOarMSNu9T+gBYtjwlGSyVjBQhL5opZo tuJp8xii2KuA9G0k5xzhIjY0lspN0B/3lRQLGX0j3o4RpbpGkvBbNAQr/un36WFdu54TMEpipf1 7fMJhqnKkToostGx7V9eGZLgo= X-Google-Smtp-Source: AGHT+IFDISM+QIViHuQO9FQ5XOF2AK+QgN2QLspvHJYtqmP+XFvBgUJxz0VbbzOPfF45kjMeGtQqUg== X-Received: by 2002:a5d:5982:0:b0:391:98b:e5b3 with SMTP id ffacd0b85a97d-39c11b836f8mr8781616f8f.14.1743448549380; Mon, 31 Mar 2025 12:15:49 -0700 (PDT) Received: from ?IPV6:2a02:1811:3716:cb00:a06e:e1e7:595e:8314? (ptr-9c16nbdw1ipf1ppjlzo.18120a2.ip6.access.telenet.be. [2a02:1811:3716:cb00:a06e:e1e7:595e:8314]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-39c0b66363fsm12418894f8f.36.2025.03.31.12.15.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 31 Mar 2025 12:15:48 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------BNeD8e3FwdHLdHJCCK1lGqsb" Message-ID: Date: Mon, 31 Mar 2025 21:15:47 +0200 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: =?UTF-8?Q?Tim_D=C3=BCsterhus?= , =?UTF-8?B?TcOhdMOpIEtvY3Npcw==?= Cc: PHP Internals List References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <9bf11a89-39d9-457b-b0ea-789fd07d7370@gmail.com> <6430b9ed-638d-4247-9fa9-d1a9148c382b@gmail.com> <2e95e8fe-7cf0-493f-bd0a-9fff0956baaa@gmail.com> <7d715757cc2dfd71019d106b01c69aed@bastelstu.be> Content-Language: fr In-Reply-To: From: nyamsprod@gmail.com (Ignace Nyamagana Butera) This is a multi-part message in MIME format. --------------BNeD8e3FwdHLdHJCCK1lGqsb Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 30/03/2025 22:53, Ignace Nyamagana Butera wrote: > > > On 30/03/2025 14:42, Tim Düsterhus wrote: >> Hi >> >> Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera: >>> Hi Máté, >>> >>>    for RFC 3986: >>> https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then >>>    this string is parsed and validated. Unfortunately, I recently >>>    realized that this approach may leave room for some kind of parsing >>>    confusion attack, namely when the scheme is for example "https", the >>>    authority is empty, and the path is "example.com >>> ". This will result in a https://example.com >>>    URI. I believe a similar bug is not possible with the rest of the >>>    components because they have their delimiters. So possibly some >>>    other solution will be needed, or maybe adding some additional >>>    validation (?). >>> >>> This is not correct according to RFC3986 >>> https://datatracker.ietf.org/doc/html/rfc3986#section-3 >>> >>> >>> *When authority is present, the path must either be empty or begin >>> with a slash ("/") character. When authority is not present, the >>> path cannot begin with two slash characters ("//"). * >>> >>> So in your example it should throw an Uri\InvalidUriException 🙂 for >>> RFC3986 and in case of the WhatwgUrl algorithm it should trigger a >>> soft error and correct the behaviour for the http(s) schemes. >>> This is also one of the many reasons why at least for RFC3986 the >>> path component can never be `null` but that's another discussion. >>> Like I said having a `fromComponenta` named constructor would allow >>> the "removal" of the need for a UriBuilder (in your future section) >>> and would IMHO be useful outside of the context of the http(s) >>> scheme but I can understand it being left out of the current >>> implementation it might be brought back for future improvements. >> >> I just tested this with the implementation and it also appears to not >> yet be correct: >> >>     var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL >>     var_dump((new >> Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); // >> string(11) "example.com" >>     var_dump((new >> Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); >> // string(19) "https://example.com" >> >> and >> >>     var_dump((new >> Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); // >> string(3) "foo" >> >> Best regards >> Tim Düsterhus > > Hi Tim and Maté upon further inspection and verification of RFC3986 I > also see an issue with the example used for normalization in the RFC. > According to RFC3986 > (https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) : > > The reg-name syntax allows percent-encoded octets in order to > represent non-ASCII registered names in a uniform way that is > independent of the underlying name resolution technology. Non-ASCII > characters must first be encoded according to UTF-8 [STD63 ], and then > each octet of the corresponding UTF-8 sequence must be percent- > encoded to be represented as URI characters. URI producing > applications must not use percent-encoding in host unless it is used > to represent a UTF-8 character sequence. When a non-ASCII registered > name represents an internationalized domain name intended for > resolution via the DNS, the name must be transformed to the IDNA > encoding [RFC3490 ] prior to name lookup. > > From this we can infer that: > > - Host encoding can only happen for UTF-8 sequence but in your example > "ex%61mple.com" is used which is not conforming to the rules (ie it > should throw an InvalidUriException IMHO for the Uri class) I presume > for WhatWg URL it will get correctly converted with a soft error (??). > > - That when available IDNA is preferred to percent-encoded sequences > > Best regards > > Ignace Nyamagana Butera > > Hi Maté and all, I spotted another inconsistency in the normalization under RFC3986 According to the RFC (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.1) For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings. So during normalization for any component uppercased percent-encodings should be used which is not the case for the example in the RFC. see for instance $uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com"); // percent-encoded form of https://你好你好.com echo $uri->toString(); // https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com the `toString` method should return `https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com` instead. Best regards Ignace Nyamagana Butera --------------BNeD8e3FwdHLdHJCCK1lGqsb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit


On 30/03/2025 22:53, Ignace Nyamagana Butera wrote:


On 30/03/2025 14:42, Tim Düsterhus wrote:
Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:
Hi Máté,

   for RFC 3986:
   https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
   this string is parsed and validated. Unfortunately, I recently
   realized that this approach may leave room for some kind of parsing
   confusion attack, namely when the scheme is for example "https", the
   authority is empty, and the path is "example.com
   <http://example.com>". This will result in a https://example.com
   URI. I believe a similar bug is not possible with the rest of the
   components because they have their delimiters. So possibly some
   other solution will be needed, or maybe adding some additional
   validation (?).

This is not correct according to RFC3986 https://datatracker.ietf.org/doc/html/rfc3986#section-3


*When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path component can never be `null` but that's another discussion. Like I said having a `fromComponenta` named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I just tested this with the implementation and it also appears to not yet be correct:

    var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); // string(11) "example.com"
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); // string(19) "https://example.com"

and

    var_dump((new Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); // string(3) "foo"

Best regards
Tim Düsterhus

Hi Tim and Maté upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) :

 The reg-name syntax allows percent-encoded octets in order to
  represent non-ASCII registered names in a uniform way that is
   independent of the underlying name resolution technology.  Non-ASCII
   characters must first be encoded according to UTF-8 [STD63], and then
   each octet of the corresponding UTF-8 sequence must be percent-
   encoded to be represented as URI characters.  URI producing
   applications must not use percent-encoding in host unless it is used
   to represent a UTF-8 character sequence.  When a non-ASCII registered
   name represents an internationalized domain name intended for
   resolution via the DNS, the name must be transformed to the IDNA
   encoding [RFC3490] prior to name lookup.

From this we can infer that:

- Host encoding can only happen for UTF-8 sequence but in your example "ex%61mple.com" is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??).

- That when available IDNA is preferred to percent-encoded sequences

Best regards

Ignace Nyamagana Butera


Hi Maté and all,

I spotted another inconsistency in the normalization under RFC3986

According to the RFC (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.1)

For consistency, URI producers and normalizers should use uppercase hexadecimal
digits for all percent-encodings.

So during normalization for any component uppercased percent-encodings should be used which is not the case for the example in the RFC. see for instance

$uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com"); // percent-encoded form of https://你好你好.com
echo $uri->toString();                             // https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com

the `toString` method should return `https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com` instead.


Best regards

Ignace Nyamagana Butera

--------------BNeD8e3FwdHLdHJCCK1lGqsb--