Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126980 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 760DE1A00BC for ; Sun, 30 Mar 2025 20:54:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1743367895; bh=4ZEfx6d3kCg9VxbMl8SP962s+fogjFoSfwrCkzImvNc=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=aiiEJAh+V62305oQzpyfEv8+gU5f4n6VeJnVNTpR4Kd+iSRzRqWYWkjjDZEPcR4+d iaVZVP18Qzzxz9ipn4+R6uV71rmtMB2qvR0cD7QL+tPpE90UlVLk/q8jxi4KY8BhDg srpYvqZUJ+8aekNhJsDu/CnJEN15nDHh7abckX9ESM5SKAyuVZQqTIalaM6hktBeWW OTux/GR7foP88Aj3cZaqc0uE1MxXccb9ofNMLvChzIy4kYXvuyEydWSklfo5Cno1oA hnlRCy9dZ/luC0ApvSr6a6Aw7C7HU0/5qxJHuJfEdWeyh8Zs2ZlMHYSngEmMAooWxS SIcoe88z6/YBw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A55EF180083 for ; Sun, 30 Mar 2025 20:51:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_20,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f45.google.com (mail-wr1-f45.google.com [209.85.221.45]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 30 Mar 2025 20:51:34 +0000 (UTC) Received: by mail-wr1-f45.google.com with SMTP id ffacd0b85a97d-39127512371so2340614f8f.0 for ; Sun, 30 Mar 2025 13:54:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743368040; x=1743972840; darn=lists.php.net; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=b/Z6+B0hZL67Zi0tRNXSXYd7r5h6BYIFehz1IZC2dWA=; b=DXyaJ2CoGx5t6BTNNIjtRF07NG2pevy7536uoQza+U9Q8lwYEOGxHIn6EyR6KauLPK ST699nFpjKZCQQAUpeGN6A3W7FbZTWzlvt9zODbU9jlK+MXz27R0yttVJUR3vU8xOiTU 3kaGDhJ9fRxFGmmylj9hpvTcd7JSbWyaugNgIdPOn1a2+Bw0uJmytlVBphlNwlyIwZnK tP64iXOtzBKZRDO5IIEbjqVvpRet7e66akQK3PNlGxYovCbvJuwWkFrYmVu8Z0DgdNcX L3q/VPqZHt/MVV1nrz0vRkpD3I+xV8153AwgRi+bM9534SZXSOmVE/DGVFcvpFdaZbnU dDXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743368040; x=1743972840; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=b/Z6+B0hZL67Zi0tRNXSXYd7r5h6BYIFehz1IZC2dWA=; b=eCsI3D3vb8W7VPI04K7RTR6Mss9EKwb7RyKZ+XRMKMb/X+qjMxswoum2UkJ3XK4Ubp WNN6P/YeOWRKHGzEE61TbGDvTrsRzCppWJYd6sHGFiSveIL/Zv/1Q+Xs/e1ql39Fllo2 F1LHwVr4e3xZZZM9qVVCco/ocv1yFDxIoDMbbD4ERBVn25EnrgKdX5Hfls0jn42RpbBz ddgT5OzfvereWchGQsnIQIHkt0QbraYUZAVPnaGvuq42oRrHC0gl6IIMHQcb8NuGz0DU l1o6O6kPZ06eIkcL4rX182O5vorpt6gsTzFYJ1+zlE7KcmJmSXn+o23FbXH6L6VZ5Yys Bs2w== X-Forwarded-Encrypted: i=1; AJvYcCWOPajBra5U5qMFXpCtNcbMwETidyr95MvxUoApmPacVTN+AYRJmgjBtnakFm6PPyG4i7cXHryr/A8=@lists.php.net X-Gm-Message-State: AOJu0YzMd0rdNwN6SVqRmn8JrmKHTv1we3v8hTiH8hpvU3N+7Dw3sYi+ DgLF6YePSM372dSJOQagQgPDs52QbIXEyJEb4cwtmNZ8QeJjZsER X-Gm-Gg: ASbGnctnArd1rzJJypERH+hI0wn8s3L466F0xcLbgy1sytm85JypmX6f6PDTHkCgfa4 Vnc4alM+M5vh8vHmV4iJHQf4nj/c29mIB+KwaWoaAhcsgRAC5EcDYDxY0L2ZNETeDBCqGz7eHey Q2zstEXGwhPz5WrBQGqDtb7aS91ir5j95NYdW8b+EuPHGr4J8CuMG5W/4tNXDA7VFW90FDFE/o4 1S4kiAlUJ0sLPEm8qQP3+/he22p7TlGl8dQRP3DyJla1AFOwkSa1fAD9AiEXgt2dHW6/N7d14Y+ TRJK0Eg6c8BxAqAVPSC8JcFJ+oQf1lbpOOYKtCN4PnczC/KHbyu7ATtyKwGT5KFF0xsaJRkm0ON BBwkoMVkF/X3iCvNRNsaoBhQkWENSwP0rUrptT/IJqOsv+uRgPxvXG1w3/jPrXTjtOfLEngy94E k1cL6RT1gG6Jnw5O9y+7SAYd4= X-Google-Smtp-Source: AGHT+IHsgshDSvUVLLa41Ke8pSmlOgpNn45FsTuYCMGEspfmKAUD2nE45mtUZZzWB3WpLu3NJeJfBQ== X-Received: by 2002:a5d:64c6:0:b0:390:d796:b946 with SMTP id ffacd0b85a97d-39c12117a89mr4943028f8f.44.1743368040048; Sun, 30 Mar 2025 13:54:00 -0700 (PDT) Received: from ?IPV6:2a02:1811:3716:cb00:917e:a4ce:1b4a:8f4d? (ptr-9c16nbdnv3rpvpntcgd.18120a2.ip6.access.telenet.be. [2a02:1811:3716:cb00:917e:a4ce:1b4a:8f4d]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-39c0b6588d0sm9284133f8f.7.2025.03.30.13.53.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 30 Mar 2025 13:53:59 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------0XuS2YN6wfArPOY8OScqrI1w" Message-ID: Date: Sun, 30 Mar 2025 22:53:57 +0200 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: =?UTF-8?Q?Tim_D=C3=BCsterhus?= Cc: =?UTF-8?B?TcOhdMOpIEtvY3Npcw==?= , PHP Internals List References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <9bf11a89-39d9-457b-b0ea-789fd07d7370@gmail.com> <6430b9ed-638d-4247-9fa9-d1a9148c382b@gmail.com> <2e95e8fe-7cf0-493f-bd0a-9fff0956baaa@gmail.com> <7d715757cc2dfd71019d106b01c69aed@bastelstu.be> Content-Language: fr In-Reply-To: <7d715757cc2dfd71019d106b01c69aed@bastelstu.be> From: nyamsprod@gmail.com (Ignace Nyamagana Butera) This is a multi-part message in MIME format. --------------0XuS2YN6wfArPOY8OScqrI1w Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 30/03/2025 14:42, Tim Düsterhus wrote: > Hi > > Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera: >> Hi Máté, >> >>    for RFC 3986: >> https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then >>    this string is parsed and validated. Unfortunately, I recently >>    realized that this approach may leave room for some kind of parsing >>    confusion attack, namely when the scheme is for example "https", the >>    authority is empty, and the path is "example.com >> ". This will result in a https://example.com >>    URI. I believe a similar bug is not possible with the rest of the >>    components because they have their delimiters. So possibly some >>    other solution will be needed, or maybe adding some additional >>    validation (?). >> >> This is not correct according to RFC3986 >> https://datatracker.ietf.org/doc/html/rfc3986#section-3 >> >> >> *When authority is present, the path must either be empty or begin >> with a slash ("/") character. When authority is not present, the path >> cannot begin with two slash characters ("//"). * >> >> So in your example it should throw an Uri\InvalidUriException 🙂 for >> RFC3986 and in case of the WhatwgUrl algorithm it should trigger a >> soft error and correct the behaviour for the http(s) schemes. >> This is also one of the many reasons why at least for RFC3986 the >> path component can never be `null` but that's another discussion. >> Like I said having a `fromComponenta` named constructor would allow >> the "removal" of the need for a UriBuilder (in your future section) >> and would IMHO be useful outside of the context of the http(s) scheme >> but I can understand it being left out of the current implementation >> it might be brought back for future improvements. > > I just tested this with the implementation and it also appears to not > yet be correct: > >     var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL >     var_dump((new > Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); // > string(11) "example.com" >     var_dump((new > Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); > // string(19) "https://example.com" > > and > >     var_dump((new > Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); // > string(3) "foo" > > Best regards > Tim Düsterhus Hi Tim and Maté upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) : The reg-name syntax allows percent-encoded octets in order to represent non-ASCII registered names in a uniform way that is independent of the underlying name resolution technology. Non-ASCII characters must first be encoded according to UTF-8 [STD63 ], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters. URI producing applications must not use percent-encoding in host unless it is used to represent a UTF-8 character sequence. When a non-ASCII registered name represents an internationalized domain name intended for resolution via the DNS, the name must be transformed to the IDNA encoding [RFC3490 ] prior to name lookup. From this we can infer that: - Host encoding can only happen for UTF-8 sequence but in your example "ex%61mple.com" is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??). - That when available IDNA is preferred to percent-encoded sequences Best regards Ignace Nyamagana Butera --------------0XuS2YN6wfArPOY8OScqrI1w Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit


On 30/03/2025 14:42, Tim Düsterhus wrote:
Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:
Hi Máté,

   for RFC 3986:
   https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
   this string is parsed and validated. Unfortunately, I recently
   realized that this approach may leave room for some kind of parsing
   confusion attack, namely when the scheme is for example "https", the
   authority is empty, and the path is "example.com
   <http://example.com>". This will result in a https://example.com
   URI. I believe a similar bug is not possible with the rest of the
   components because they have their delimiters. So possibly some
   other solution will be needed, or maybe adding some additional
   validation (?).

This is not correct according to RFC3986 https://datatracker.ietf.org/doc/html/rfc3986#section-3


*When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path component can never be `null` but that's another discussion. Like I said having a `fromComponenta` named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I just tested this with the implementation and it also appears to not yet be correct:

    var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); // string(11) "example.com"
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); // string(19) "https://example.com"

and

    var_dump((new Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); // string(3) "foo"

Best regards
Tim Düsterhus

Hi Tim and Maté upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) :

 The reg-name syntax allows percent-encoded octets in order to
  represent non-ASCII registered names in a uniform way that is
   independent of the underlying name resolution technology.  Non-ASCII
   characters must first be encoded according to UTF-8 [STD63], and then
   each octet of the corresponding UTF-8 sequence must be percent-
   encoded to be represented as URI characters.  URI producing
   applications must not use percent-encoding in host unless it is used
   to represent a UTF-8 character sequence.  When a non-ASCII registered
   name represents an internationalized domain name intended for
   resolution via the DNS, the name must be transformed to the IDNA
   encoding [RFC3490] prior to name lookup.

From this we can infer that:

- Host encoding can only happen for UTF-8 sequence but in your example "ex%61mple.com" is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??).

- That when available IDNA is preferred to percent-encoded sequences

Best regards

Ignace Nyamagana Butera


--------------0XuS2YN6wfArPOY8OScqrI1w--