Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:127292 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id 2AC671A00BC for ; Mon, 5 May 2025 21:32:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1746480632; bh=Zbp0jjZct9wEmUk2gaimYeigNTDr3TpyVUl2fr2l7vI=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=fXnwRQ5CqF76F39xPOt5IeemOLU2hmI75GchrDdwKoYNMNOC39q3U+IR3mk/kpMT/ vsTUVmy8puyKKjh2+mgNgyngeZrSy1IBrC/8UmmUy59pFG6yNufTGzVlSMGlAYdEQH oamCNQuQGg+nIR1zQIqMUwqmkbPMTKJf7SYNCOTdCpk7KRAct/NNoNWR7sb4rUjsGF V3OfBHons94UjoxsMWNgsEvMjdvKNSvyE5kbY91Eo/GRd67NZe9kYoV7VeJYikGASs bg/1v4ySUxlLGL9XTalbYEQt2Qe/fN9kMLhYK/OJjz7zYkShkeCU6u4FdwuDo3T3Nd aQMoVkydS5ChQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 764D4180077 for ; Mon, 5 May 2025 21:30:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.9 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: Error (Cannot connect to unix socket '/var/run/clamav/clamd.ctl': connect: Connection refused) X-Envelope-From: Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 5 May 2025 21:30:31 +0000 (UTC) Received: by mail-qt1-f179.google.com with SMTP id d75a77b69052e-47663aeff1bso55175361cf.0 for ; Mon, 05 May 2025 14:32:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746480764; x=1747085564; darn=lists.php.net; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Zbp0jjZct9wEmUk2gaimYeigNTDr3TpyVUl2fr2l7vI=; b=mG7V5x2dRGotcFKWA1Qcqp0K4AyxpAYFWOFLcc65ydqwX2jWmgh29VA08YMedEKI33 8SI6vOCjaQloDQpdDmuMgDx5alL8HGOna0dFgzgILzV33G2/cL79tolQTfMW01pMMO/u 09TxEFDkenwl5dpCSG1KLq6pHaFm3a1hRpKM8vrKvL5kfr5gHbmQsj88f2YxAkcxl8Tf /8QoPbn32/PhNntEuMuTWkzx2tmuFru8dAehxI61UTkV1FqzYpIv6qAEJgZhYPfk0YPM uzyTdTT4f3f/OPN2bi3qvd9DtHyE2MZJZd14Z0X/UDzdpKYZM3mROzN1J5Spd0vsDWHH gMoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746480764; x=1747085564; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Zbp0jjZct9wEmUk2gaimYeigNTDr3TpyVUl2fr2l7vI=; b=V9ZXrE1cfVdUsIlny4XH8KGuYEywchGhK2X/Bxge6O6zVmC2Eg6b6zaMcp/UnBmlZg gZeu8XYB+Pn+NlY1KI1GBnD6p22ZeQgvJ2mWixidcrZnJEQpaYN96eWbGID+gIwC3942 fzBGTyYFi0BJjn71oIim9nwfqEmrM+SOqCwyztteha+PEQirysdNWlryop72SDDXRm/R sv9nLFGlHvMZYYpkdO6ZNpe23gwOYbwYe/3YDAfjPN4ClfIEmzhA4iU9I9SbRhwv5nVZ ITJCirzDEJeqZdDY4KmaBqx3q1cQ6xmBRgQRIBaBcw5YPox1qQKW4wKRS31E2TV5MpJj 9W4g== X-Gm-Message-State: AOJu0YyOtWg+8E7QJrRFpIkAnnHXf/MZuFuqrHMfC9Njx9swSsHo6ZmG Vn1PBtlUU5z3pva/wjsLgs1j5igV5X+nfLDUAqjF3pwvKDuq+D4JBSjFpVahQmLJNlyArRqwztW GkmbToivH88NV665//87QEaQflx7Kqm8w X-Gm-Gg: ASbGncsWIYUQ5R5n3MLGmwFZrgJzW9R7QjRSDmhyydoA+OB7V7Qjz+cuZueQsyQEIxQ SbNdW5+sIyhFeikkzvQGJX1LBgoLMOUY3PW9iUjrMv0XhvnOSFq/Fll5FLf6dOCJ3fIopKcSsRY 6uath5WRfDrQWki6smwoNPKA== X-Google-Smtp-Source: AGHT+IHXZqymb8mUYPl0q4QosMVGSBaBbLxkzj3ilk2O+fs1enYz9RdVMBckcfnDdpolNljwy6r1bgbN5p1r2XL7To8= X-Received: by 2002:a05:622a:1108:b0:477:6f1f:690b with SMTP id d75a77b69052e-4910bdc94a5mr5603271cf.5.1746480764436; Mon, 05 May 2025 14:32:44 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <9bf11a89-39d9-457b-b0ea-789fd07d7370@gmail.com> <6430b9ed-638d-4247-9fa9-d1a9148c382b@gmail.com> <1FD11284-D682-4CB7-893F-D74A1904610D@pmjones.io> <1A7E42FA-27EA-404B-85EF-25190AFFDE79@pmjones.io> In-Reply-To: <1A7E42FA-27EA-404B-85EF-25190AFFDE79@pmjones.io> Date: Mon, 5 May 2025 23:32:33 +0200 X-Gm-Features: ATxdqUFcigKij_woR-9B4wYfmKKApMPl5moqgaA3y1OsbfvwPG4MCma8BzCmguU Message-ID: Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API To: "Paul M. Jones" Cc: PHP Internals List Content-Type: multipart/alternative; boundary="00000000000092458806346a3d1b" From: kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=) --00000000000092458806346a3d1b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Paul, I would not presume that the dedicated value objects are what "makes the > [Rowbot] library much slower" than the RFC -- instead, my first intuition is that the *parsing* operations are slower in userland > than in C, and are primarily responsible for the comparative slowness. Speedwise, creation of multiple objects from the parsed results would be a > rounding error compared to the parsing itself. > Yes, I may have arrived at the wrong conclusion based on the right factors: the Rowbot library uses objects for not just representing the components, but even the parser states and other things, whereas in the C library, parsing is just an enormous switch-case. I know that instantiating objects doesn't take a lot of time, but I guess the performance difference between a very nicely written, full OO PHP code and an optimized C code will start to be very much noticeable with a larger iteration number. Anyway, I shouldn't have tried to compare the performance of the two solutions, since it's really not a fair comparison, and not the main point. I think that's fair. The main thing that stands out to me is not the > Scheme, Host, etc. value objects, but that the RFC presents no UrlRecord = -- which is very definitely part the WHATWG-URL specification. That is, from > reading the spec, I'd expect to see a UrlRecord, and a Url composed from = it. > I believe the UrlRecord is a minor detail of the specification that is possible to omit without sacrificing anything useful: having a record in addition to the URL class doesn't bring much to the table. For similar reasons, the RFC doesn't implement the WHATWG getters either, and the pure components are exposed instead (the "Component retrieval" section writes about this). So the RFC does not entirely implement the API prescribed by the WHATWG URL spec, however it accurately follows the parsing details -- which is the main benefit in my opinion. > Meanwhile, AFAICT, neither Rowbot nor the RFC provide a percent *en*codin= g > mechanism, for consumers to put together properly-encoded values. Have I missed it in the RFC, or is it somehow not necessary, or something > else? > Percent-encoding is usually automatically done for WHATWG (even if soft errors may be triggered during the process), so it was not a top priority for me just yet. But I definitely want to include some sort of percent-encoding support in the followup I plan. But in any case, thanks for raising awareness of this topic. Because it is part of the WHATWG-URL spec, I think it deserves first-class > treatment in this RFC ... > Having yet another class in the proposal would open the possibility for a whole lot of new discussion. We should draw the line somewhere in order not to waste everyone's time, or the PHPFoundation's budget any longer, should the RFC fail for any reason. And I just draw the line here, since it's a nice to have feature, and we have a meaningful set of functionality even without it. > Which leads to my last point: I would really like to see at least two > separate RFCs here. They be a lot easier to review and critique that way: > > - one for dealing with URIs as they exist now, especially one that the > honors the ways-of-working that exist in userland; and, > - one for dealing with WHATWG-URL in its entirety, with all its > differences (some subtle, some not) from URIs. > > I can see arguments for either one being the "base" on which the other > would build. > I may have agreed to pursue two separate RFCs a few months earlier, but not anymore, around the very end. Although I should mention that the original RFC tried to deal with WHATWG URLs only, RFC 3986 URIs were added later, due to public demand. Possibly I should have stepped in around the time when I included RFC 3986 support. However, I have to mention that working on both specifications parallelly helped me understand a lot of the subtle differences between the two specifications, and after bringing these differences to the surface, the final API design could reflect and tackle them. Regards, M=C3=A1t=C3=A9 --00000000000092458806346a3d1b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Paul,

I would not presume that the dedicated value objects are what "makes t= he [Rowbot] library much slower" than the RFC -- instead,=C2=A0
my first intuition = is that the *parsing* operations are slower in userland than in C, and are = primarily responsible for the comparative slowness.=C2=A0
Speedwise, creation of multipl= e objects from the parsed results would be a rounding error compared to the= parsing itself.

Yes, I may have arrive= d at the wrong conclusion based on the right factors: the Rowbot library us= es objects for not just representing the components,
but even the= parser states and other things, whereas=C2=A0in the C library, parsing is = just an enormous switch-case. I know that instantiating objects doesn't=
take a lot=C2=A0of time, but I guess the performance difference = between a very nicely written, full OO PHP code and an optimized C code wil= l start to be
very much noticeable with a larger iteration number= . Anyway, I shouldn't have tried to compare the performance of the two = solutions, since it's really not
a fair comparison, and not t= he main point.

I think that's fair. The main thing that stands out to me is not the Sc= heme, Host, etc. value objects, but that the RFC presents no UrlRecord --= =C2=A0
which = is very definitely part the WHATWG-URL specification. That is, from reading= the spec, I'd expect to see a UrlRecord, and a Url composed from it.

I believe the UrlRecord is a minor detai= l of the specification that is possible to omit without sacrificing=C2=A0an= ything useful: having a record in addition
to the=C2=A0URL class = doesn't bring much to the table. For similar reasons, the RFC doesn'= ;t implement the WHATWG getters either, and the pure
components a= re exposed instead (the "Component retrieval" section writes abou= t this). So the RFC does not entirely implement the API prescribed=C2=A0by = the
WHATWG URL spec, however it accurately follows the parsing de= tails -- which is the main benefit in my opinion.
=C2=A0
Meanwhile, AFAICT, neither Rowbot nor the RFC provide a percent *en*coding = mechanism, for consumers to put together properly-encoded values.=C2=A0
Have I missed it= in the RFC, or is it somehow not necessary, or something else?

Percent-encoding is usually automatically done for= WHATWG (even if soft errors may be triggered during the process), so it wa= s not a top priority for me just=C2=A0yet.
But I definitely want = to include some sort of percent-encoding support in the followup I plan. Bu= t in any case, thanks for raising awareness of this topic.

Because it is part of the WHATWG-URL spec, I think it deserves first-class = treatment in this RFC ...

Having yet an= other class in the proposal would open the possibility for a whole lot of n= ew discussion. We should draw the line somewhere in order not
to = waste everyone's=C2=A0time, or the PHPFoundation's budget any longe= r, should the RFC fail for any=C2=A0reason. And I just draw the line here,= =C2=A0since it's a nice=C2=A0to have
feature, and we have a m= eaningful=C2=A0set of functionality even without it.
=C2=A0
=
Which leads to my last point: I would really like to see at least two separ= ate RFCs here. They be a lot easier to review and critique that way:

- one for dealing with URIs as they exist now, especially one that the hono= rs the ways-of-working that exist in userland; and,
- one for dealing with WHATWG-URL in its entirety, with all its differences= (some subtle, some not) from URIs.

I can see arguments for either one being the "base" on which the = other would build.

I may have agreed to= pursue two separate RFCs a few months earlier, but not anymore,=C2=A0aroun= d the very end. Although I should mention that the original
RFC t= ried to deal with WHATWG URLs only, RFC 3986 URIs were added later, due to = public=C2=A0demand. Possibly I should have stepped in around the time
=
when I included RFC 3986 support. However, I have to mention that work= ing on both specifications parallelly helped me understand a lot of the sub= tle
differences between the two specifications, and after bringin= g these=C2=A0differences to the surface, the final API design could=C2=A0re= flect and tackle them.

Regards,
M=C3=A1t= =C3=A9

--00000000000092458806346a3d1b--