Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:127029 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 3BF6B1A00BC for ; Wed, 2 Apr 2025 20:42:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1743626382; bh=zWi5xXSUtwpcKgomci+I/n/ib+9wnFkbb6w83eF1KBg=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=V2Iw+OB92gaVz4uBfKdSXM4f96UqRuyCDPtaRctUmGvSBW3+pNAlY9/U3bXRdDVwq Ke0sodXvo7R3r4AxeGNiTspE5kzTTIPZtMWAuX9F9BT9EqVJ2Wtc2t6tXwPNJvZBMd 9GLnpOZXcJ0fTrl5P/sTMWgh63+2za+EKx0sJL3F06xcOdfC5Q0qir/daypPyJmWqP UT+dpihd6zZrFLskQrLj2M82n7PXW2WU/w9LAB6qioHZqMIWlmBGEcQyVmnxpM8vyi PPEAlkjHpf1Mpm+IZqnrPcUOqGnutiGRVoU609fELokmyJ2jqC91L4rdX8EmgFltYs 5uYX2nmZU5aDA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A67C4180080 for ; Wed, 2 Apr 2025 20:39:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=BAYES_20,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 2 Apr 2025 20:39:41 +0000 (UTC) Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-47690a4ec97so1589731cf.2 for ; Wed, 02 Apr 2025 13:42:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743626527; x=1744231327; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=2wRh/FJkikhOeqxCTiEuKC9Hq16x03KuymXAn34MLdo=; b=WFUFXXK94EBrLH57Frxi8u8+gsY27T/liuGErZTl56IDSr3U///ppetK4kR+YSl4eQ fwrqvROSUakapNtmUeK29F6ylIGhjFXQ/Zq1OWDuMQ/H9MVq5Ks1oBL2zDiS8+IJV10/ Ci00fsX+EzaWQYWroxOu5ikE6fphjKVE6g9ao5/+Jt8EpFntICCvUyadfiufaU24HF5E 0R9i3JtsOf+M71iYEuvvW72AZKoaix2ITNFmch8uI4WdoP9pORMWQV/l6R9+w37QI2ft Kq6aZrXI1zt6GhvBoU2IVAzM9H6iwuCuJj7VW/nx+ui9cynjnrc39uZ57GF6DW/04inS 8sgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743626527; x=1744231327; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=2wRh/FJkikhOeqxCTiEuKC9Hq16x03KuymXAn34MLdo=; b=QSp2zJ45MVukIUSz17cI4ag0Txu+n9aTZDWR03+WoDoGoT7YYtEI9wcCW/ZfMt+hHt /tI3zZsHUFfkIHU6LdvOJDF0BBgjrfGUzRZXOMWkNEuEVVmHcR5PvCQidtcYq95mjJmh dVKX5c6fNfHtFhBxSr2GPVJVjqwsTsFFdj53z0hZPYEfOCcMiDJwmwvLkQwfGbESIgRe IjmcOpyUoYkNCPqAiMh8xOB06OAtOwLQWsc8dHzElzEUajUhQDmJtYiol5w4NzE8+J5p vztcSFNuR1S3Wv1AkFrIH5YvOpssTQOLxvDcN+bZCb8CsYiF4VhmgkHaxgMU/nDquqQO WICQ== X-Forwarded-Encrypted: i=1; AJvYcCWDIG04r72QrE3Ty3QMo3yD0wCEnFqkyHedbK2uc6pt3Ay66D2Zqfd2ISUhI/wU8MnTy6O9se/LxFc=@lists.php.net X-Gm-Message-State: AOJu0Yx6RQxeQqrsak7nO0NFhsY5fjKMQCuj9ooxK+97kJfnxV1DKQ2B 69Mx4VTIu5gAZZHPqPhjrNUsmE8a90ZqOLiyHqMi/pbZjHy+7mYT2jZ+FFfOx46xM4J3R/MuYKc 57uXqu74Cken8MIFX0oa4tV42+m4= X-Gm-Gg: ASbGncuZmsaWxbon/+1AIwNFnZ8EzKl5D5So6QRzAcUuyDYV5rcSWRVXUgDTRt7O2Qc tS0Nk+8EVl5v96Gin1/Upayr1VJMr7rlUygzpEoIkeNSSSHEV8Bu4dMPecTv4KX0vQJ006YkFao OOB3tC/AvLuC3AZuGBSywDlknDuPc= X-Google-Smtp-Source: AGHT+IH8dFSadfQ1R/5Umsjpa2XTt2sV7+r74AQSDM+CFzxxk/giZMmnwDouhTa4SXQodlzT7XZgBtsaeyaPOpwtSk4= X-Received: by 2002:a05:622a:1106:b0:476:98d6:13ff with SMTP id d75a77b69052e-478f6d3a09emr123272641cf.33.1743626527092; Wed, 02 Apr 2025 13:42:07 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <9bf11a89-39d9-457b-b0ea-789fd07d7370@gmail.com> <6430b9ed-638d-4247-9fa9-d1a9148c382b@gmail.com> <2e95e8fe-7cf0-493f-bd0a-9fff0956baaa@gmail.com> <7d715757cc2dfd71019d106b01c69aed@bastelstu.be> In-Reply-To: Date: Wed, 2 Apr 2025 22:41:55 +0200 X-Gm-Features: AQ5f1Jpvdjc0UfX0jWyvsptJqvQO1c3gV4dGqYcu35v1s-y4UIeeC609oYND9do Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: Ignace Nyamagana Butera Cc: =?UTF-8?Q?Tim_D=C3=BCsterhus?= , PHP Internals List Content-Type: multipart/alternative; boundary="000000000000c4b5af0631d1af60" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --000000000000c4b5af0631d1af60 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ignace, > upon further inspection and verification of RFC3986 I also see an issue > with the example used for normalization in the RFC. According to RFC3986 = ( > https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) : > > The reg-name syntax allows percent-encoded octets in order to > represent non-ASCII registered names in a uniform way that is > independent of the underlying name resolution technology. Non-ASCII > characters must first be encoded according to UTF-8 [STD63 ], and then > each octet of the corresponding UTF-8 sequence must be percent- > encoded to be represented as URI characters. URI producing > applications must not use percent-encoding in host unless it is used > to represent a UTF-8 character sequence. When a non-ASCII registered > name represents an internationalized domain name intended for > resolution via the DNS, the name must be transformed to the IDNA > encoding [RFC3490 ] prior to n= ame lookup. > > From this we can infer that: > > - Host encoding can only happen for UTF-8 sequence but in your example "e= x% > 61mple.com" is used which is not conforming to the rules (ie it should > throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg > URL it will get correctly converted with a soft error (??). > Oh, that's a very interesting catch again. If your interpretation is correct, then I think it must also be some bug with the parser library, but I have to dig into the code first, or reach out to its author. :) I have some suspicion though that the "URI producing applications" part may not apply for this case, at least I have a hard-time to decide what this expression really means. The RFC also uses "URI reference parsers" that is really a straightforward name, while "URI producers" isn't. For example, there is a paragraph in the RFC: > URI producers and normalizers should omit the ":" delimiter that separates host from port if the port component is empty. Some schemes do not allow the userinfo and/or port subcomponents. Clearly, omitting ":" is not done during parse-time, but when a URI (reference) is produced. So I find it possible that "URI producing" mean when the URI string is created, not when the URI is parsed, although the RFC usually uses URI and URI reference consistently. So I'm not sure. Maybe it's a typo, and it should have been "URI normalizers". Regards, M=C3=A1t=C3=A9 --000000000000c4b5af0631d1af60 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi I= gnace,
=C2=A0

=C2=A0upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2= .2) :

 The reg-name syntax allo=
ws percent-encoded octets in order to
  represent non-ASCII registered names in a uniform way that is
   independent of the underlying name resolution technology.  Non-ASCII
   characters must first be encoded according to UTF-8 [STD63], and the=
n
   each octet of the corresponding UTF-8 sequence must be percent-
   encoded to be represented as URI characters.  URI producing
   applications must not use percent-encoding in host unless it is used
   to represent a UTF-8 character sequence.  When a non-ASCII registered
   name represents an internationalized domain name intended for
   resolution via the DNS, the name must be transformed to the IDNA
   encoding [RFC3490] prior to name lookup.

From this we can infer that:

- Host encoding can only happen for UTF-8 sequence but in your example "ex%61mp= le.com" is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??).

Oh, that's a v= ery interesting catch again. If your=C2=A0interpretation is correct, then I= think it must also be some bug
with the parser library, but I ha= ve to dig into the code first, or reach out to its author. :)
I have some suspicion though that the "URI producing=C2=A0applications" part may no= t apply for this case, at least I have a hard-time
to decide what= this expression really means. The RFC also uses "URI reference parser= s" that is really
a straightforward name, while "URI pr= oducers" isn't. For example, there is a paragraph in the RFC:

> URI producers and normalizers should omit the &qu= ot;:" delimiter that separates host from port if the port component is= empty. Some schemes do not allow the userinfo and/or port subcomponents.

Clearly, omitting ":" is not done during = parse-time, but when a URI (reference) is produced. So I find it possible t= hat
"URI producing" mean when the URI string is created= , not when the URI is parsed, although the RFC usually
uses URI a= nd URI reference consistently. So I'm not sure. Maybe it's a typo, = and it should have been "URI normalizers".

Regards,
M=C3=A1t=C3=A9

--000000000000c4b5af0631d1af60--