Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125255 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 3221B1A00BD for ; Mon, 26 Aug 2024 07:41:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1724658182; bh=7KiXKn1jbSDeCCm3FKMTUBgQTSGpGs5VK2TnStKtWTA=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=kEqk+jlzvno+JXx4YHCFfe9m5pwZdjBNH15FjTbhyOfJOCBDeCgsJRmjRCk4F9Ok0 1aOqbZHk1WEwCHrW9WLXJJVCUI+oLT20TfeIKvDll6ErKE55IwMdOiaypwPOg+CXAl PyeAKOtuguRZ0DCTV3HFCzbjbfBsymRVM7OmaBrV2gPOioP+7teB828J6fhyJf+LGj MtOGY6BxV47rcKg5XD0q4RJVLnRoc2Dl4X1ajTvwXKKbY+vveOzQbP+oH0keOdkQy+ kvFpILtQ7E7hffIRAiuic9Lv41K9zzEXu9DAfPsjwZnHqRg4obp8kSfFNCrFosgULB 9YFQ9OspaQgtA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 99C4418006D for ; Mon, 26 Aug 2024 07:43:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.9 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 26 Aug 2024 07:43:01 +0000 (UTC) Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-451b7e1d157so21599911cf.3 for ; Mon, 26 Aug 2024 00:41:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724658068; x=1725262868; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=7KiXKn1jbSDeCCm3FKMTUBgQTSGpGs5VK2TnStKtWTA=; b=PY/luXrSBJeHlVHkqd8pKgRhuVzlVYZSPXiwtP5YnkrckZXp7gcuvJm+PE1/uBL8/9 P/LKYykMmLvIH2qRtWbCwMUy4fURYIGAKKJSd8pM5yHMA5wrKH4UNjlYXQhCotZPgQtL r25kiuKDyzKa2zQfmWuq7MA44qS0dHoHiLvZZpqyjIEs5eX+4MKv5n4+zARGHn8bBQ9V S3HXASq12oRP2ImOLaB8cs3nTAFbtIBsn9XgFNV6dm0STgrl8o8tTepsB5SvMAby3Nar LVl/7FuTTIWGwO7vWB6Kz2BLu7t+CbctziqARa0TrPE+7Ktk7ZPBreK000PlQbHt4aIL ABKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724658068; x=1725262868; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=7KiXKn1jbSDeCCm3FKMTUBgQTSGpGs5VK2TnStKtWTA=; b=c7KrLl4vcrel5yiS5zJWXQG454TNPAXsZiPfZwnBZOD5Z6ZhHSlxdPQGd57ZdbT8gj HUtnOHC2+7ph3FGNv4Fvl3n+PZf5iXvPWe2izOKSbXkyR4VGwgZGhpQKRRzLyLaLa3ik hBShYoeXE0dUC09JqkWB9QAPgpUUGAp9jXapy4eCRE5KerB7/XZCrDh2SGnT6ZDXmrfm /tfElTypwgWJOQ74nVA6fvQpiOk882B7PVXwTZplTbx9V6KK2Z3u4p9pgmzw2HaRJg5Y Z9Zh7oUVd3Z0pbV3smLlvEsMKR7e4NG/z7c4JyHbskAHPVx2eDC0iQWLErny5mCd2zCq ilJA== X-Forwarded-Encrypted: i=1; AJvYcCVSJUplQLk4BlUdGlMo24IFD8XdydIPuFO5CKPMm8Wpm/8lc60ri9+/dZp5Ad9dzRe9Yx34rlZUo8k=@lists.php.net X-Gm-Message-State: AOJu0Yxze8eSnNcofwLnvMlajdzvSZd4pLLDQ4BAblNC/2vGdTmCYLM8 3BRuRPOBw0Y1TARg1sSboI8x4/xGQP+gk4FCpquPBFhc+9tbNwr47N/NGi1423Vz37avppLr/Ah HX5x4Lrp79eguHMJD8Br6gd0vvWiUS9ihwOE= X-Google-Smtp-Source: AGHT+IG3iRLjiJ3+c48mEpx124qb1gqvTfzj68QU3TutINAXp3GFu5iCf4buQfm7ddh+xVeeM+y2XV6JFoUSyDby+oM= X-Received: by 2002:a05:622a:1bab:b0:451:a9a2:9ad7 with SMTP id d75a77b69052e-4550966a7d5mr123089541cf.33.1724658067580; Mon, 26 Aug 2024 00:41:07 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: In-Reply-To: Date: Mon, 26 Aug 2024 09:40:56 +0200 Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: ignace nyamagana butera Cc: Niels Dossche , internals@lists.php.net Content-Type: multipart/alternative; boundary="0000000000007a2cb10620913f3f" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --0000000000007a2cb10620913f3f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ignace, Niels, Sorry for being silent for so long, I was working hard on the implementation besides some summer activities :) I can say that I had really good progress in the last month and now I think (hope) that I managed to address most of the concerns/suggestions people mentioned in this thread. To summarize the most important changes: - The uriparser library is now used for parsing URIs based on RFC 3986. - I renamed the extension to "uri" in favor of "url" in order to make the name more generic and to express the new use-case. - There is no Url\UrlParser class anymore. The Uri\Uri class now includes the relevant factory methods. - Uri/Uri is now an abstract class which is implemented by 2 concrete classes: Uri\Rfc3986Uri and Uri\WhatwgUri. - WhatWG URL parsing now returns the exact error code according to the specification (although a reference parameter is used for now - but this is TBD) - As suggested by Niels, it's now possible to plug an URI parsing implementation into PHP. A new uri.default_handler INI option is also added= . Currently, integration is only implemented for FILTER_VALIDATE_URL though. The approach also makes it possible to register additional 3rd party libraries for parsing URIs (like ADA URL). - It looks like that performance significantly improved according to the rough benchmarks performed in CI. Please re-read the RFC as it shares a bit more details than my quick summary above: https://wiki.php.net/rfc/url_parsing_api There are some questions I still didn't manage to find an answer for though. Most importantly, the URI parser libraries used don't support modification of the URI. That's why I had to get rid of the "wither" methods for now which were originally part of the API. I think it's unfortunate, and I'll try to do my best to reclaim them. Additionally, due to technical reasons, extending the Uri\Uri class in userland is only possible if all the methods are overridden by the child. It's because I had to use "computed" properties in the implementation (roughly, they are stored in an internal C struct unlike regular properties). That's why it may be better if userland code could use (and possibly implement) an Uri\Uri interface instead. In one of my previous emails, I had some concerns that RFC 3986 and WhatWg spec can really share the same interface (they do in my current implementation despite that they are different classes). I still share this concern because WhatWg specifies the "user" and "password" URL components, while RFC 3986 only specifies the notion of "userinfo" (which is usually just user:password, but it's not necessarily the case as far as I understood). The RFC implementation of the RFC 3986 parser currently splits the 'userinfo' component at the ":" character, but doing so doesn't seem very spec compliant. Arnaud suggested that it would be better if the query parameters could be retrieved both escaped and unescaped after parsing. I haven't had time to investigate the possibilities, but my gut feeling is that it's only possible to achieve with some custom code. Arnaud also had questions regarding canonization. Currently, it's not performed when calling the __toString() method, because only uriparser library supports this feature, and I didn't want to diverge the two implementations. I'm not even sure that it's a good idea to always do it so I'm thinking about the possibility to selectively enable this feature (i.e. adding a separate "toCanonizedString" method). Regards, M=C3=A1t=C3=A9 --0000000000007a2cb10620913f3f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Ignace, Niels,
<= br>
Sorry for being silent for so long, I was working hard on the= implementation besides some summer activities :) I can say that I had
really good progress in the last month and now I think (hope) that I = managed to address most of the concerns/suggestions people mentioned
<= div>in this thread.=C2=A0To summariz= e the most important changes:

- T= he uriparser library is now used for parsing URIs based on RFC 3986.=
- I renamed=C2=A0the extension to "uri" = in favor of "url" in order to make=C2=A0the=C2=A0name more generi= c and to express the new use-case.
- There is no Url\UrlPa= rser class anymore. The Uri\Uri class now includes the relevant factory met= hods.
- Uri/Uri is now an abstract class which is implemented by = 2 concrete classes: Uri\Rfc3986Uri=C2=A0and=C2=A0Uri\WhatwgUri.
-= WhatWG URL parsing now returns the exact error code according to the speci= fication (although a reference parameter is used for now - but this is TBD)=
- As suggested by Niels, it's now possible to plug an URI pa= rsing implementation into PHP. A new uri.default_handler INI option is also= added.
Currently, integrat= ion is only implemented for FILTER_VALIDATE_URL though. The approach also m= akes it possible to register additional 3rd party
libraries for=C2=A0parsing URIs (like ADA URL).
- It looks like that performance significantly improved according to= =C2=A0the rough benchmarks performed in CI.

Please= re-read the RFC as it shares a bit more details than my quick summary abov= e:=C2=A0https://wiki.p= hp.net/rfc/url_parsing_api

There are some ques= tions I still didn't manage to find an answer for though. Most importan= tly, the URI parser libraries used don't support modification
of the URI. That's why I had to get rid of the "wither" meth= ods for now which were originally part of the API. I think it's unfortu= nate, and I'll try to do my
best to reclaim them.
<= br>
Additionally, due to technical reasons, extending the Uri\Uri= class in userland is only possible if all the methods are overridden by th= e child. It's because
I had to=C2=A0use "computed" properties in the implementation (ro= ughly, they are stored in an internal C struct unlike regular properties). = That's why it may be
better if userland code could use (and possibly implement) an Uri\Uri= interface instead.

In one of my = previous emails, I had some concerns that RFC 3986 and WhatWg spec can real= ly share the same interface (they do in my current implementation
despite that they are differ= ent classes). I still share this concern because WhatWg specifies the "= ;user" and "password" URL components, while RFC 3986<= /div>
only specifies the notion = of "userinfo" (which is usually just user:password, but it's = not necessarily the case as far as I understood). The RFC implementation
of the RFC 3986 parser currently splits the 'userinfo= 9; component at the ":" character, but doing so doesn't seem = very spec compliant.

Arnaud suggested that it woul= d be better if the query parameters could be retrieved both escaped and une= scaped after parsing. I haven't had time to investigate
the p= ossibilities, but my gut feeling is that it's only possible to achieve = with some custom code. Arnaud also had questions regarding canonization. Cu= rrently,
it's not performed when calling the __toString() met= hod, because only uriparser library supports this feature, and I didn't= want to diverge the two implementations.
I'm not even sure t= hat it's a good idea to always do it so=C2=A0I'm thinking about=C2= =A0the=C2=A0possibility to selectively enable this feature (i.e. adding a s= eparate "toCanonizedString"
method).

Regards,
M=C3=A1t=C3=A9
--0000000000007a2cb10620913f3f--