Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:126948 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id EE5C01A00BC for ; Tue, 25 Mar 2025 23:53:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1742946643; bh=xp6MEIE52iAcJci4PGdw57QgR+i1h2Tj6ar1NnGRSu8=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=iX7Xh8PjrxmxqCe9FgtmXrfCvH7tBcZB7ZZyrPTS//1HMRT567sJTnVjQO9Tvj2OI 5x5ME1C7hiJxAeP8enVu1TcAcCiFvPOChfHyErwuaNa2VK7hJ1zl5XxJn2RSm1ZBrW 6Hs5lcRldcIQSTk6Vc+g1ri4vzaWNXrdsBqeD8yAB+qtotGsWqHC8GndyVqm16rqQF ir8bEIhZjo00Ht+J+gjB6OsZRNUMj1E/Nre56woM+aYkOM+yMar7oe8gEiOMCXyLqX i6GgJgb4Vud0eyyFHVFni7jpcwpL7snMu0mmg+ABJBuT2qpzW982mMm6kz/NtIE199 CxklcgiS/EQEg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id B526318006A for ; Tue, 25 Mar 2025 23:50:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,HTML_MESSAGE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 25 Mar 2025 23:50:42 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id 3A63B3C0265 for ; Tue, 25 Mar 2025 23:53:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=content-type:content-type:x-mailer:mime-version:references :in-reply-to:message-id:subject:subject:from:from:date:date; s= automattic1; t=1742946790; bh=xp6MEIE52iAcJci4PGdw57QgR+i1h2Tj6a r1NnGRSu8=; b=i4P6Y59HbMp/qZvws0iAOVKSMyUe4kO+0Vfpnh8L0p6ymDj4lG oCFznnP833GKnbGX+ch5B2CIzBMDBFIekkLC4gcdBrRrHo2hlVFrTbnB7J84RwLw jABxRsN8LOMrs6UdJ5P0rkBCnVZfgD3SGDvRchCBvwVRr5TW1Th/ScaTK9Fkuiq4 nsfoXboXE0og4gRmym8Ml12Jlg6c2lx1ekhe1IyCVUT4U9XMskvbyd8R0toJTEDF rV0lpj7J6/1wCo0pEtZOcuNN38kxMogGpo3jW2JtKC1YtHm4ne7n1INNDNpj/1q0 +cQZakAdN/NXFd7LC9MfYTujOT23y4KjuQ1A== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RgXMrM5cIY1D for ; Tue, 25 Mar 2025 23:53:10 +0000 (UTC) Received: from smtp-gw2.dfw.automattic.com (smtp-gw2.dfw.automattic.com [192.0.95.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id C18393C016D for ; Tue, 25 Mar 2025 23:53:10 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="mA1UWMZF"; dkim-atps=neutral Received: from smtp-gw2.dfw.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id B2D2FA036D for ; Tue, 25 Mar 2025 23:53:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1742946790; bh=xp6MEIE52iAcJci4PGdw57QgR+i1h2Tj6ar1NnGRSu8=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=mA1UWMZFF4ovWEnxfDYYSVzr8SlTDcp7ug32ko6pdeVH8uftwJYjagM1Kysvsh7iY EWVLLdMXZh8bCf/O8moD/rc7QfKPaPAKPfYrxaHPnURkbqlfhH7Zr77oPMR59w+ITr Qw3i8UVdv6r5z5BienAjMpTXH1ttCRNgCw+28m9KARY7PgOCI/uBehsNzUjCawdZYD ymO94bz08K8uOKZA61bkvNTxNOBvQCDPKnv+mGCw0dwkTD2DYQyPGSL+GgdxcgnNE4 Uc7q4uCuzRbDuPw0d82IZMbF+UxgvF7DYQ7fycv8WVi8NTXrXZMSfVlrlKtxgxNpEP kgznmzGx1tlCw== Received: from mail-pj1-f70.google.com (mail-pj1-f70.google.com [209.85.216.70]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw2.dfw.automattic.com (Postfix) with ESMTPS id A7B87A0350 for ; Tue, 25 Mar 2025 23:53:10 +0000 (UTC) Received: by mail-pj1-f70.google.com with SMTP id 98e67ed59e1d1-2ff64898e2aso506733a91.1 for ; Tue, 25 Mar 2025 16:53:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742946790; x=1743551590; h=mime-version:references:in-reply-to:message-id:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=vkED1zPAR3F7zOFd7DDTHsglDRVBButvuatfOsoxBSg=; b=HiVq53WCj7SeohgXd0m5D4iAGWENVsWb+AvFuFyI11Ti8IqmFbbY1KWhN+WWcLkbkj pEfJyxEvqPJU6Y4rALu7XXAg2KmIlBshQZTY4OHuwOn1Dnirw2oorEsKj2/a1xlWfMYK gVrBVE48X2C4pqhGbg8xqeA7R70YoKcXjkoyZNzbKHw/MBluFRjTSZDhrmBNVZMLFlwn 2e08DbdcRdOYn6zzMvyGrx0aWaBC8AZUc+WmjS0Je7t3EBjOP4KXhJDcr8zKHoXjhjyD zmensICE+6FIQ7oiitxC91DfkEGmoTdcrx1/kk/yfGKefAy+y2oHU3GHaNGyw/LqwuL3 g5dg== X-Gm-Message-State: AOJu0YxxPwZoRgDtQBgvyHHJx9NAn26Yo8drgU5B7MoMOLH/2Wqko+h6 zk4D23/LonLbvRvL3XMZWkMGh4j9oOfyKiHUI9SCj+8hNtpv327aegFxcAjW+y0F5PlnsCnoAg4 0zgpipmOpgKEvg8kkd+lWNYscKvO4j6553WhCVk/3dFc8/JwgjnO01Fc= X-Gm-Gg: ASbGnctdlnamLmNkeNsKaovoEYh0r/5NuRxdt3SYMrwT25VnCXqcBkLduA/MJShcH2h j06yVvEPFPag4WK/1Sm6B9+NjfCi0vuys224Cf8ZZevoY01DYo3uYCBP9cQ9WuhT9Eww3DlpeR+ lDiz4FGDihQsvBQU83OViFFOAs6DW5MPBzkMARWbN+2GCz08b1WLicbBtf33ML04DM6vhfAXOMX i6gqk27fc5RSF/rX6ArsB0ThRCjf8NFQK6xHYBpRd1vkRk5etaXCbXRg8nGSlaJ1tuofVNDHARg VSQuUjHf7mBjQWOTip17QPKudbcs74EO6bha5qvPMVkYF/Qa5q8yRw9toxE= X-Received: by 2002:a17:90b:3a08:b0:2fa:2268:1af4 with SMTP id 98e67ed59e1d1-303788d914emr2569404a91.7.1742946789833; Tue, 25 Mar 2025 16:53:09 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHM6+iZeGFJxhjAD6gl73slKrSCRJOUuL+4K4K0PWpQt+kfi3Yc0i/7Uz7L+OlM4alt70V/Bg== X-Received: by 2002:a17:90b:3a08:b0:2fa:2268:1af4 with SMTP id 98e67ed59e1d1-303788d914emr2569373a91.7.1742946789409; Tue, 25 Mar 2025 16:53:09 -0700 (PDT) Received: from [169.254.128.61] ([2600:381:bd29:da1e:7dc1:9997:c254:57f8]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-301bf5a1d06sm16561350a91.29.2025.03.25.16.53.07 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 25 Mar 2025 16:53:08 -0700 (PDT) Date: Tue, 25 Mar 2025 16:53:08 -0700 (PDT) X-Google-Original-Date: 25 Mar 2025 16:53:03 -0700 X-Google-Original-From: Dennis Snell To: =?UTF-8?Q?M=C3=A1t=C3=A9_Kocsis?= Cc: Internals Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API Message-ID: <70BF6861-883A-4B69-9AF4-2EE9031B1922@automattic.com> In-Reply-To: References: <1BCB4144-231D-45EA-A914-98EE8F0F503A@automattic.com> <8E614C9C-BA85-45D8-9A4E-A30D69981C5D@automattic.com> <044E7A8E-B79D-44DB-B572-102A80CDFC3C@automattic.com> Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 X-Mailer: Unibox (443:24.3.0) Content-Type: multipart/alternative; boundary="=_F1FAFBD5-55B6-4D79-B789-EBC35164BCEA" From: dennis.snell@automattic.com (Dennis Snell) --=_F1FAFBD5-55B6-4D79-B789-EBC35164BCEA Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > On Mar 25, 2025, at 4:06 PM, Dennis Snell w= rote: >=20 >=20 >> On Mar 25, 2025, at 3:23 PM, M=C3=A1t=C3=A9 Kocsis wrote: >>=20 >>=20 >> Hi Dennis, >>=20 >>=20 >>> I am myself also a bit lost on the countless names that I tried out in = the implementation, but I think I had toHumanFriendlyString() and toDisplay= FriendlyString() methods at some point. These then ended up being toString(= ) and toDisplayString() after some iterations. I would be ok with renaming = getHost() and toString() so=C2=A0that their names suggest they don't use ID= NA,=C2=A0but I'd clearly need a good enough=C2=A0suggestion, since neither = "MachineFriendly", nor "NonDisplayable" sound like the best alternative for= me. I was also considering using getIdnaHost() and toIdnaString(), but I r= ealized these are the worst looking names I have come up with so=C2=A0far. >>>=20 >>>=20 >>>=20 >>=20 >> What about getPunycodeHost(), getUnicodeHost(), toPunycodeString(), toUn= icodeString()? Or getAsciiHost() and toAsciiString() may also work. These a= re the best names I managed to come up with so far. >>=20 >>=20 >> In the=C2=A0meantime, I renamed RFC 3986's toString() methods too accord= ing to another suggestion: >> - toString() became toRawString() >> - toNormalizedString() became toString() >>=20 >>=20 >> The new names mirror exactly what their getter counterparts do. >>=20 >>=20 >> M=C3=A1t=C3=A9=C2=A0 >>=20 >>=20 >=20 > Hi M=C3=A1t=C3=A9, >=20 >=20 > I=E2=80=99ve been pondering these names for the past week and a half and = I couldn=E2=80=99t think of anything, but at first glance I like getUnicode= Host() and getAsciiHost(). These communicate=C2=A0a little bit the nuance, = though they aren=E2=80=99t totally in-your-face (which in this case I wish = there were a more obvious pair that is). >=20 >=20 > Other pairs I was toying with but don=E2=80=99t like are: > =C2=A0- getPrintHost() / getDataHost() > =C2=A0- getDisplayHost() / getAPIHost() > =C2=A0- getDisplayHost() / getEncodedHost() > =C2=A0- getDisplayHost() / getEscapedHost() >=20 >=20 > (the same pairs would apply to the other methods, like toDisplayString() = / toEncodedString()) >=20 >=20 > This seems to be taking a lot of effort and time, but thank you still for= engaging with it =E2=80=94 naming is hard! But it=E2=80=99s worth it. >=20 >=20 Just for fun I have tossed this into DeepSeek-R1 671B >=C2=A0WHATWG URLs have two representations: one for humans and one for mac= hines. The reason for having two is that URLs may have IDNA domains which a= re punycode encoded and there are security issues around showing that to hu= amns. For example, if a person reads "https://xn--google.com" they may assu= me that the domain belongs to Google, when in fact it points to "https://= =E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com". You are a modern programming lan= guage designer working on a standard library to expose a URL parser and you= want the interface of this library to educate developers on where to use t= he appropriate representation. Given a URL object $u of class URL, propose = two methods for converting that URL to a string. The name of the methods sh= ould communicate their use, and when a developer searches for the right met= hod to get the string form, they should not be presented with a non-prefixe= d and prefixed pair like toString() and toHumanString(). Instead, the metho= ds names should form a kind of symmetric pair like toEncodedString() and to= DisplayString(). Use your knowledge of WHATWG URL nuances, browser security= issues, human developers making typical mistakes, and propose at least ten= pairs of words that could be used for returning these two different repres= entations. A few of the ideas that it returned which stuck out were: =C2=A0- toDataString() / toViewString() and getDataHost() / getViewHost() =C2=A0- toSerializedString() / toReadableString() and getSerializedHost() /= getReadableHost() =C2=A0- toProcessingString() / toSafeDisplayString() and getProcessingHost(= ) / getSafeDisplayHost() After checking in the Gecko source code, I sadly only found helper methods = which take a URL/URI and transform them: =C2=A0- prepareUrlForDisplay() =C2=A0- unEscapeURIForUI() Node seems to punt on this by providing `URL.format()` with a `{ unicode: b= oolean }` option. These all seem to miss the mark, in my opinion, because o= f how easy it is to assume that `toString()` or `.host` is what you= =E2=80=99re after. Thanks for entertaining the extra follow-up here. Warmly, Dennis Snell --=_F1FAFBD5-55B6-4D79-B789-EBC35164BCEA Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable




On Mar 25, 2025, at 4:06 PM, Dennis Snell <dennis.snell@automattic.= com> wrote:

On Mar 25, 2025, at 3:23 PM, M=C3=A1t=C3=A9 Kocsis <kocsismate90@gm= ail.com> wrote:

Hi Dennis,

I am myself also a bit lost on the cou= ntless names that I tried out in the implementation, but I think I had toHu= manFriendlyString() and toDisplayFriendlyString() methods at some point. Th= ese then ended up being toString() and toDisplayString() after some iterati= ons. I would be ok with renaming getHost() and toString() so=C2=A0that thei= r names suggest they don't use IDNA,=C2=A0but I'd clearly need a good enoug= h=C2=A0suggestion, since neither "MachineFriendly", nor "NonDisplayable" so= und like the best alternative for me. I was also considering using getIdnaH= ost() and toIdnaString(), but I realized these are the worst looking names = I have come up with so=C2=A0far.

What about getPunycodeHost(), getUnicodeHost(), toPunycodeString(), to= UnicodeString()? Or getAsciiHost() and toAsciiString() may also work. These= are the best names I managed to come up with so far.

In the=C2=A0meantime, I renamed RFC 3986's toString() methods too acco= rding to another suggestion:
- toString() became toRawString()
- toNormalizedString() became toString()

The new names mirror exactly what their getter counterparts do.

M=C3=A1t=C3=A9=C2=A0

Hi M=C3=A1t=C3=A9,

I=E2=80=99ve been pondering these names for the past week= and a half and I couldn=E2=80=99t think of anything, but at first glance I= like getUnicodeHost() and getAsciiHost(). These communicate=C2=A0a = little bit the nuance, though they aren=E2=80=99t totally in-your-face (whi= ch in this case I wish there were a more obvious pair that is).

Other pairs I was toying with but don=E2=80=99t like are:=
=C2=A0- getPrintHost() / getDataHost()
=C2=A0- getDisplayHost() / getAPIHost()
=C2=A0- getDisplayHost() / getEncodedHost()
=C2=A0- getDisplayHost() / getEscapedHost()

(the same pairs would apply to the other methods, like to= DisplayString() / toEncodedString())

This seems to be taking a lot of effort and time, but tha= nk you still for engaging with it =E2=80=94 naming is hard! But it=E2=80=99= s worth it.

Just for fun I have tossed this into DeepSeek-R1 671B

>=C2=A0WHATWG URLs have two representations: one for humans and one= for machines. The reason for having two is that URLs may have IDNA domains= which are punycode encoded and there are security issues around showing th= at to huamns. For example, if a person reads "https://xn--google.com" they = may assume that the domain belongs to Google, when in fact it points to "ht= tps://=E4=95=AE=E4=95=B5=E4=95=B6=E4=95=B1.com". You are a modern programmi= ng language designer working on a standard library to expose a URL parser a= nd you want the interface of this library to educate developers on where to= use the appropriate representation. Given a URL object $u of class URL, pr= opose two methods for converting that URL to a string. The name of the meth= ods should communicate their use, and when a developer searches for the rig= ht method to get the string form, they should not be presented with a non-p= refixed and prefixed pair like toString() and toHumanString(). Instead, the= methods names should form a kind of symmetric pair like toEncodedString() = and toDisplayString(). Use your knowledge of WHATWG URL nuances, browser se= curity issues, human developers making typical mistakes, and propose at lea= st ten pairs of words that could be used for returning these two different = representations.

A few of the ideas that it returned which stuck out were:

=C2=A0- toDataString() / toViewString() and getDataHost() / getViewHos= t()
=C2=A0- toSerializedString() / toReadableString() and getSerializedHos= t() / getReadableHost()
=C2=A0- toProcessingString() / toSafeDisplayString() and getProcessing= Host() / getSafeDisplayHost()

After checking in the Gecko source code, I sadly only found helper met= hods which take a URL/URI and transform them:

=C2=A0- prepareUrlForDisplay()
=C2=A0- unEscapeURIForUI()

Node seems to punt on this by providing `URL.format()` with a `{ unico= de: boolean }` option. These all seem to miss the mark, in my opinion, beca= use of how easy it is to assume that `toString()` or `.host` is what you= =E2=80=99re after.

Thanks for entertaining the extra follow-up here.

Warmly,
Dennis Snell
--=_F1FAFBD5-55B6-4D79-B789-EBC35164BCEA--