[RFC] [Under Discussion] New Curl URL API

3 years ago by Craig Francis — view source

unread

Here is the RFC that reflects our current conversations.

https://wiki.php.net/rfc/curl-url-api

Feel free to give any feedback, concern or support :-)

Thanks Pierrick,

I think this is a good approach to add the URL functionality to PHP 8.2 with a fairly simple CurlUrl object, just like CurlFile. This would allow developers to avoid the parsing vulnerability issues (as in, using two different libraries that can parse a URL in different ways)... then the completely new, and fully OOP version, would be done at a later date, after a lot more discussion / planning.

Craig

3 years ago by Derick Rethans — view source

unread

Hi,

Following our discussions we had on the subject of the new Curl URL API,
and other curl improvements. I decided to only focus on adding the new Curl
URL API and put aside all other improvements. Here is the RFC that reflects
our current conversations.

https://wiki.php.net/rfc/curl-url-api

Feel free to give any feedback, concern or support :-)

I've two things:

The new CurlUrl class should probably be immutable from the start. It was my biggest mistake not to do that with DateTime.
What happens if the curl library available on the system does not have the features and functions that this new class relies on? I would expect the class to not be available either, but the RFC does not mention that.

cheers
Derick

3 years ago by Pierrick Charron — view source

unread

Hi Derick,

The new CurlUrl class should probably be immutable from the start. It
was my biggest mistake not to do that with DateTime.

Thanks for sharing your lessons learned. But I still see some use cases
where mutable objects are easier to use. From the experience you had with
DateTime, do you think that having CurlUrl being immutable and providing
a MutableCurlUrl would make sense ? I see some cases where you "navigate"
on a website using the same CurlHandle and just want to manipulate the
MutableCurlUrl handle to change urls.

What happens if the curl library available on the system does not have
the features and functions that this new class relies on? I would expect
the class to not be available either, but the RFC does not mention that.

Good point. As you expected, if the functions are not available in libcurl,
the class will not be available. Same thing for each constant/feature. The
extension will "adapt" to the curl version. I will add this to the RFC.

Pierrick

3 years ago by Hans Henrik Bergan — view source

unread

any particular reason CurlUrl::getPort() defaults to 0 rather than one of
the valid options? (that being CurlUrl::DEFAULT_PORT
and CurlUrl::NO_DEFAULT_PORT )

Hi Derick,

The new CurlUrl class should probably be immutable from the start. It
was my biggest mistake not to do that with DateTime.

Thanks for sharing your lessons learned. But I still see some use cases
where mutable objects are easier to use. From the experience you had with
DateTime, do you think that having CurlUrl being immutable and providing
a MutableCurlUrl would make sense ? I see some cases where you "navigate"
on a website using the same CurlHandle and just want to manipulate the
MutableCurlUrl handle to change urls.

What happens if the curl library available on the system does not have
the features and functions that this new class relies on? I would expect
the class to not be available either, but the RFC does not mention that.

Good point. As you expected, if the functions are not available in libcurl,
the class will not be available. Same thing for each constant/feature. The
extension will "adapt" to the curl version. I will add this to the RFC.

Pierrick

3 years ago by Hans Henrik Bergan — view source

unread

nitpicking but I kind-of doubt the description for CurlUrl::NO_DEFAULT_PORT
is correct, quote:

Instructs the method to return null if the port matches the default port
for the scheme.

makes it sound like these would return null: http://localhost:80/
https://localhost:443/ ftps://localhost:22/

Is that correct? I would imagine it returns null if the port isn't
specified, rather than null if the port when specified matches the default
port?

On Wed, 22 Jun 2022 at 16:46, Hans Henrik Bergan divinity76@gmail.com
wrote:

any particular reason CurlUrl::getPort() defaults to 0 rather than one of
the valid options? (that being CurlUrl::DEFAULT_PORT
and CurlUrl::NO_DEFAULT_PORT )

Hi Derick,

The new CurlUrl class should probably be immutable from the start. It
was my biggest mistake not to do that with DateTime.

Thanks for sharing your lessons learned. But I still see some use cases
where mutable objects are easier to use. From the experience you had with
DateTime, do you think that having CurlUrl being immutable and providing
a MutableCurlUrl would make sense ? I see some cases where you
"navigate"
on a website using the same CurlHandle and just want to manipulate the
MutableCurlUrl handle to change urls.

What happens if the curl library available on the system does not have
the features and functions that this new class relies on? I would expect
the class to not be available either, but the RFC does not mention that.

Good point. As you expected, if the functions are not available in
libcurl,
the class will not be available. Same thing for each constant/feature. The
extension will "adapt" to the curl version. I will add this to the RFC.

Pierrick

3 years ago by Hans Henrik Bergan — view source

unread

(dammit, mixed sftp:// with ftps:// there, ignore that, i meant sftp:// )

On Wed, 22 Jun 2022 at 16:53, Hans Henrik Bergan divinity76@gmail.com
wrote:

nitpicking but I kind-of doubt the description
for CurlUrl::NO_DEFAULT_PORT is correct, quote:

Instructs the method to return null if the port matches the default port
for the scheme.

makes it sound like these would return null: http://localhost:80/
https://localhost:443/ ftps://localhost:22/

Is that correct? I would imagine it returns null if the port isn't
specified, rather than null if the port when specified matches the default
port?

On Wed, 22 Jun 2022 at 16:46, Hans Henrik Bergan divinity76@gmail.com
wrote:

any particular reason CurlUrl::getPort() defaults to 0 rather than one of
the valid options? (that being CurlUrl::DEFAULT_PORT
and CurlUrl::NO_DEFAULT_PORT )

Hi Derick,

The new CurlUrl class should probably be immutable from the start. It
was my biggest mistake not to do that with DateTime.

Thanks for sharing your lessons learned. But I still see some use cases
where mutable objects are easier to use. From the experience you had with
DateTime, do you think that having CurlUrl being immutable and
providing
a MutableCurlUrl would make sense ? I see some cases where you
"navigate"
on a website using the same CurlHandle and just want to manipulate the
MutableCurlUrl handle to change urls.

What happens if the curl library available on the system does not
have
the features and functions that this new class relies on? I would
expect
the class to not be available either, but the RFC does not mention
that.

Good point. As you expected, if the functions are not available in
libcurl,
the class will not be available. Same thing for each constant/feature.
The
extension will "adapt" to the curl version. I will add this to the RFC.

Pierrick

3 years ago by Pierrick Charron — view source

unread

HI Hans,

any particular reason CurlUrl::getPort() defaults to 0 rather than one of

the valid options? (that being CurlUrl::DEFAULT_PORT
and CurlUrl::NO_DEFAULT_PORT )

This is because the default is none of those 2 behaviours, If the port
wasn't set it will return null, but if the port is the default port for the
scheme it will still return it.

makes it sound like these would return null: http://localhost:80/

https://localhost:443/ ftps://localhost:22/

Is that correct? I would imagine it returns null if the port isn't
specified, rather than null if the port when specified matches the default
port?

That's correct, if you use CurlUrl::NO_DEFAULT_PORT. The behaviour you
were expecting is the default one ($flags = 0)

3 years ago by Pierrick Charron — view source

unread

Hi all,

The new CurlUrl class should probably be immutable from the start. It
was my biggest mistake not to do that with DateTime.

After thinking about it and some discussions, I followed Derick's
recommendation and therefore changed the RFC to make the CurlUrl class
immutable. All the setters were replaced by new with* methods.
For example setHost is now withHost and will return a new object with the
host modified. This will prevent confusing behavior where the CurlUrl
object would be unintentionally modified after being attached to a
CurlHandle.

Pierrick

3 years ago by Levi Morrison via internals — view source

unread

Hi all,

The new CurlUrl class should probably be immutable from the start. It
was my biggest mistake not to do that with DateTime.

After thinking about it and some discussions, I followed Derick's
recommendation and therefore changed the RFC to make the CurlUrl class
immutable. All the setters were replaced by new with* methods.
For example setHost is now withHost and will return a new object with the
host modified. This will prevent confusing behavior where the CurlUrl
object would be unintentionally modified after being attached to a
CurlHandle.

Pierrick

It's clear people do not agree on how we should be designing the APIs
for 3rd party extensions. However, let me redraw attention to the
introduction of the RFC:

One of the goal of this API is to tighten a problematic vulnerable area
for applications where the URL parser library would believe one thing
and libcurl another. This could and has sometimes led to security
problems.

Designing another API on top of what libcurl provides could make the
problem worse. I am fine with these kinds of adjustments:

Using exceptions instead of return codes.
Using enums instead of constants if it makes sense (it may not if
they are bitwise-or'd together, which is pretty common for C libs).
Renaming things that have keyword or reserved word conflicts.

I am not fine with designing an immutable, with* style API that
doesn't mirror the underlying library at all. At least, not by itself;
I'd be okay with having both, where the nicer API is built on top of
the lower level, but what is nicer is subjective. As this thread
shows, designing a nicer API will have quite a bit more discussion and
disagreement than "exposing" or "porting" libcurl's API.

3 years ago by Rowan Tommins — view source

unread

On Thu, 30 Jun 2022 at 23:21, Levi Morrison via internals <
internals@lists.php.net> wrote:

On Thu, Jun 30, 2022 at 10:48 AM Pierrick Charron pierrick@php.net
wrote:

One of the goal of this API is to tighten a problematic vulnerable area
for applications where the URL parser library would believe one thing
and libcurl another. This could and has sometimes led to security
problems.

Designing another API on top of what libcurl provides could make the
problem worse.

My concern is rather the opposite: if it is not obvious to someone writing
PHP code how the API should be used, there is a higher chance of them
using it incorrectly, and introducing errors.

A good example of this mindset is the password_* functions: they do not
expose the options of underlying implementations, they present an
easy-to-use opinionated set of behaviours, that solves a common use case
for their audience. They allow users to "fall into the pit of success",
because the easy thing is also the correct thing.

However, to me at least, it's not entirely clear who the target audience
for this API is, what tasks it should be making easy, and what correct
usage looks like. Just providing a bunch of functions, in whatever form,
doesn't provide security unless users can understand how to use them
securely.

Regards,

Rowan Tommins
[IMSoP]

3 years ago by tim@bastelstu.be — view source

unread

Hi

My concern is rather the opposite: if it is not obvious to someone writing
PHP code how the API should be used, there is a higher chance of them
using it incorrectly, and introducing errors.

A good example of this mindset is the password_* functions: they do not
expose the options of underlying implementations, they present an
easy-to-use opinionated set of behaviours, that solves a common use case
for their audience. They allow users to "fall into the pit of success",
because the easy thing is also the correct thing.

The difference to me is that "curl" is a well-established library with
an existing API. By faithfully representing the original API (e.g. the
naming of the options/flags), I can leverage the existing documentation
or Stack Overflow Q&A to learn what exactly a specific option does.

This would be different for an "url" or "http" or "uri" extension. That
one might use cURL currently and might use
https://github.com/uriparser/uriparser in a future version if there are
good arguments to change the underlying library. It would just guarantee
you that it can process URIs correctly according to whatever API one
creates.

The password_* family uses such a generic term: "password". It does not
pretend to be "libargon2".

In other words: If it says "curl" on the box, then I expect to have
"curl" within the box.

However, to me at least, it's not entirely clear who the target audience
for this API is, what tasks it should be making easy, and what correct
usage looks like. Just providing a bunch of functions, in whatever form,
doesn't provide security unless users can understand how to use them
securely.

Best regards
Tim Düsterhus

3 years ago by Levi Morrison via internals — view source

unread

Hi,

Following our discussions we had on the subject of the new Curl URL API,
and other curl improvements. I decided to only focus on adding the new Curl
URL API and put aside all other improvements. Here is the RFC that reflects
our current conversations.

https://wiki.php.net/rfc/curl-url-api

Feel free to give any feedback, concern or support :-)

Regards,
Pierrick

IMO, this should mirror the low-level curl url API very directly. The
basis of my opinion is that we do not own libcurl; we are merely
adapting it for use in PHP. We cannot anticipate changes in their
design, nor do we have authority to do so if we feel something should
change. Touching it as little as possible makes it easier to track
upstream changes, etc.

Based on that, I think the naming should be closer to libcurl.:

CurlUrl::URL_ENCODE should be CurlUrl::URLENCODE
CurlUrl::URL_DECODE should be CurlUrl::URLDECODE

And so on, only differing if necessary because something is a reserved
word. The API should be as exact as possible to what libcurl provides.
The "helpers" getHost, getPassword, etc should be removed and should
expose curl_url_get more directly.

Of course, it should be object based instead of resource based, but that's it.

A nicer API can be built on top of it, but I don't think that's the
role this particular API should play.

3 years ago by Rowan Tommins — view source

unread

IMO, this should mirror the low-level curl url API very directly. The
basis of my opinion is that we do not own libcurl; we are merely
adapting it for use in PHP. We cannot anticipate changes in their
design, nor do we have authority to do so if we feel something should
change. Touching it as little as possible makes it easier to track
upstream changes, etc.

Based on that, I think the naming should be closer to libcurl.:

CurlUrl::URL_ENCODE should be CurlUrl::URLENCODE

CurlUrl::URL_DECODE should be CurlUrl::URLDECODE

And so on, only differing if necessary because something is a reserved
word. The API should be as exact as possible to what libcurl provides.
The "helpers" getHost, getPassword, etc should be removed and should
expose curl_url_get more directly.

Of course, it should be object based instead of resource based, but that's it.

A nicer API can be built on top of it, but I don't think that's the
role this particular API should play.

For the record, I disagree with literally everything you've said here.

PHP indeed does not own libcurl, and nor does libcurl own PHP. We are targeting a different audience, and have a different set of facilities available to us. We have our own documentation, so there is no reason a user of PHP should know or care what the underlying implementation looks like, any more than they should know how the memory allocation works.

If the underlying library adds a feature, it will be as easy to add a new method as a new constant. If the underlying library changes behaviour, we will want to make our own decision on whether that meets our compatibility policy, and whether to emulate the older behaviour (or indeed emulate the newer behaviour on systems with an older library).

Twenty years ago, maybe PHP programmers were used to it being a veneer over C. I don't think that is or should be the expectation today, unless you're using FFI.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Pierrick Charron — view source

unread

Hi all, and thanks Levi for your feedback,

If you look at the first thread (Discussion about new Curl URL API and
ext/curl improvements) you'll see that this was my first approach. I even
proposed to "OOPify" the actual CurlHandle & co objects with "simple"
methods like $ch->setOpt(). Anyone who reads the actual API can see this is
the actual "philosophy" of the extension. It was designed (on purpose or
not) as a thin wrapper over curl. But lets focus on URL API only !

With previous thread answers, I was under the impression that I really was
the only one liking the approach of following the cURL api as much as
possible and leaving the more high level API to things like Guzzle & co
(for example by adding an Implementation of PSR-7 UriInterface using this
API) since all the others parts of the API are done this way. I think it's
important to have this new Curl URL API in 8.2 since it fixes security
issues and that's of course not ideal but I would rather have an
implementation that would not be my first choice, than not having it at all.

What do you think ? I would love more people to give feedback on what they
are expecting, if they don't care if they prefer one approach or the other
and of course why ? I was thinking about doing a vote on this, but I'm not
sure it's a good idea. What do you all think ?

Regards,
Pierrick

Le jeu. 23 juin 2022, à 12 h 49, Levi Morrison levi.morrison@datadoghq.com
a écrit :

On Tue, Jun 21, 2022 at 10:38 PM Pierrick Charron pierrick@php.net
wrote:

Hi,

Following our discussions we had on the subject of the new Curl URL API,
and other curl improvements. I decided to only focus on adding the new
Curl
URL API and put aside all other improvements. Here is the RFC that
reflects
our current conversations.

https://wiki.php.net/rfc/curl-url-api

Feel free to give any feedback, concern or support :-)

Regards,
Pierrick

IMO, this should mirror the low-level curl url API very directly. The
basis of my opinion is that we do not own libcurl; we are merely
adapting it for use in PHP. We cannot anticipate changes in their
design, nor do we have authority to do so if we feel something should
change. Touching it as little as possible makes it easier to track
upstream changes, etc.

Based on that, I think the naming should be closer to libcurl.:

CurlUrl::URL_ENCODE should be CurlUrl::URLENCODE

CurlUrl::URL_DECODE should be CurlUrl::URLDECODE

And so on, only differing if necessary because something is a reserved
word. The API should be as exact as possible to what libcurl provides.
The "helpers" getHost, getPassword, etc should be removed and should
expose curl_url_get more directly.

Of course, it should be object based instead of resource based, but that's
it.

A nicer API can be built on top of it, but I don't think that's the
role this particular API should play.

3 years ago by Jeffrey Dafoe — view source

unread

Hello Pierrick,

A thin wrapper would be the most flexible. Someone can always write a "friendlier" class using your thin wrapper, as you mentioned, but one cannot easily go the other direction.

-Jeff

-----Original Message-----
From: Pierrick Charron pierrick@php.net
Sent: Friday, June 24, 2022 9:49 AM
To: Levi Morrison levi.morrison@datadoghq.com
Cc: PHP internals internals@lists.php.net
Subject: Re: [PHP-DEV] [RFC] [Under Discussion] New Curl URL API

Hi all, and thanks Levi for your feedback,

If you look at the first thread (Discussion about new Curl URL API and ext/curl improvements) you'll see that this was my first approach. I even proposed to "OOPify" the actual CurlHandle & co objects with "simple"
methods like $ch->setOpt(). Anyone who reads the actual API can see this is the actual "philosophy" of the extension. It was designed (on purpose or
not) as a thin wrapper over curl. But lets focus on URL API only !

With previous thread answers, I was under the impression that I really was the only one liking the approach of following the cURL api as much as possible and leaving the more high level API to things like Guzzle & co (for example by adding an Implementation of PSR-7 UriInterface using this
API) since all the others parts of the API are done this way. I think it's important to have this new Curl URL API in 8.2 since it fixes security issues and that's of course not ideal but I would rather have an implementation that would not be my first choice, than not having it at all.

What do you think ? I would love more people to give feedback on what they are expecting, if they don't care if they prefer one approach or the other and of course why ? I was thinking about doing a vote on this, but I'm not sure it's a good idea. What do you all think ?

Regards,
Pierrick

Le jeu. 23 juin 2022, à 12 h 49, Levi Morrison levi.morrison@datadoghq.com a écrit :

On Tue, Jun 21, 2022 at 10:38 PM Pierrick Charron pierrick@php.net
wrote:

Hi,

Following our discussions we had on the subject of the new Curl URL
API, and other curl improvements. I decided to only focus on adding
the new
Curl
URL API and put aside all other improvements. Here is the RFC that
reflects
our current conversations.

https://wiki.php.net/rfc/curl-url-api

Feel free to give any feedback, concern or support :-)

Regards,
Pierrick

IMO, this should mirror the low-level curl url API very directly. The
basis of my opinion is that we do not own libcurl; we are merely
adapting it for use in PHP. We cannot anticipate changes in their
design, nor do we have authority to do so if we feel something should
change. Touching it as little as possible makes it easier to track
upstream changes, etc.

Based on that, I think the naming should be closer to libcurl.:

CurlUrl::URL_ENCODE should be CurlUrl::URLENCODE

CurlUrl::URL_DECODE should be CurlUrl::URLDECODE

And so on, only differing if necessary because something is a reserved
word. The API should be as exact as possible to what libcurl provides.
The "helpers" getHost, getPassword, etc should be removed and should
expose curl_url_get more directly.

Of course, it should be object based instead of resource based, but
that's it.

A nicer API can be built on top of it, but I don't think that's the
role this particular API should play.

3 years ago by Rowan Tommins — view source

unread

A thin wrapper would be the most flexible. Someone can always write a
"friendlier" class using your thin wrapper, as you mentioned, but one
cannot easily go the other direction.

I think this argument has some validity when talking about the entire API
of a library like curl. But we're not, we're talking about a tiny bounded
context.

The underlying implementation consists of 6 functions, 4 of which are
trivial:

An argument-less "constructor" https://curl.se/libcurl/c/curl_url.html
An equivalent to PHP "clone" https://curl.se/libcurl/c/curl_url_dup.html
A cleanup method, which would be automatic in any PHP implementation
https://curl.se/libcurl/c/curl_url_cleanup.html
A function for looking up error strings from codes
https://curl.se/libcurl/c/curl_url_strerror.html

That leaves the main getter and setter functions
https://curl.se/libcurl/c/curl_url_get.html and
https://curl.se/libcurl/c/curl_url_set.html which take the same arguments:

A resource handle
The URL part to handle
The new value for set, or an output parameter for get
A set of flags

If we really believe the goal is to expose the C API in PHP, the get
function would look like this:

/**

@return int Error code
*/
function curl_url_get(resource $url, int $what, string &$part, int $flags):
int

That would be ... awful, and I hope nobody's really suggesting that. So we
have to map each part to some concept in PHP, which is what the proposal
does:

The resource handle becomes $this
The $what argument becomes the method name
The output parameter becomes the return value
The error code becomes an exception

There's no "flexibility" which has been lost, and no "future changes" which
are being prevented.

The only other change is a few renamed constants, which I suggested on the
PR. I can see an argument for making them match the library exactly; but
it's the values that actually matter, so I don't see why we shouldn't
choose our own names if they're descriptive of the functionality.

Regards,

Rowan Tommins
[IMSoP]

3 years ago by tim@bastelstu.be — view source

unread

Hi

If you look at the first thread (Discussion about new Curl URL API and
ext/curl improvements) you'll see that this was my first approach. I even
proposed to "OOPify" the actual CurlHandle & co objects with "simple"
methods like $ch->setOpt(). Anyone who reads the actual API can see this is
the actual "philosophy" of the extension. It was designed (on purpose or
not) as a thin wrapper over curl. But lets focus on URL API only !

With previous thread answers, I was under the impression that I really was
the only one liking the approach of following the cURL api as much as
possible and leaving the more high level API to things like Guzzle & co
(for example by adding an Implementation of PSR-7 UriInterface using this
API) since all the others parts of the API are done this way. I think it's
important to have this new Curl URL API in 8.2 since it fixes security
issues and that's of course not ideal but I would rather have an
implementation that would not be my first choice, than not having it at all.

What do you think ? I would love more people to give feedback on what they
are expecting, if they don't care if they prefer one approach or the other
and of course why ? I was thinking about doing a vote on this, but I'm not
sure it's a good idea. What do you all think ?

I agree with Levi. The curl extension should be a thin wrapper over libcurl.

A high level API that is more convenient to use may be provided by PHP,
but it should not be called curl. It should have a generic name with the
fact that curl does the heavy lifting only being an implementation detail.

Best regards
Tim Düsterhus

3 years ago by Kamil Tekiela — view source

unread

Please do not add yet another thin wrapper of an underlying C API. PHP is
not a drop-in replacement of C. PHP is a much higher-level language.
Developers should not have to understand how the underlying library works
to be able to use PHP. We need to move away from thin C wrappers as a
language. PHP should abstract more, not less, of C code.

The new API doesn't have to have exactly the same names as the C library.
Please follow PHP naming conventions and implement OOP-based API.

3 years ago by Larry Garfield — view source

unread

Hi all, and thanks Levi for your feedback,

If you look at the first thread (Discussion about new Curl URL API and
ext/curl improvements) you'll see that this was my first approach. I even
proposed to "OOPify" the actual CurlHandle & co objects with "simple"
methods like $ch->setOpt(). Anyone who reads the actual API can see this is
the actual "philosophy" of the extension. It was designed (on purpose or
not) as a thin wrapper over curl. But lets focus on URL API only !

With previous thread answers, I was under the impression that I really was
the only one liking the approach of following the cURL api as much as
possible and leaving the more high level API to things like Guzzle & co
(for example by adding an Implementation of PSR-7 UriInterface using this
API) since all the others parts of the API are done this way. I think it's
important to have this new Curl URL API in 8.2 since it fixes security
issues and that's of course not ideal but I would rather have an
implementation that would not be my first choice, than not having it at all.

What do you think ? I would love more people to give feedback on what they
are expecting, if they don't care if they prefer one approach or the other
and of course why ? I was thinking about doing a vote on this, but I'm not
sure it's a good idea. What do you all think ?

Regards,
Pierrick

The root question, I think, is who the consumer is for this RFC?

Is the consumer PHP developers? Then it should be a good PHP API.

Is the consumer the Guzzle team and everyone else just uses Guzzle and ignores Curl/CurlUrl? Then it should probably adhere as closely as feasible to Curl's API, awful as it is, and be documented out the wazoo.

There likely isn't time for 8.2 to do the first option, but probably is for the second.

--Larry Garfield

3 years ago by Mark Randall — view source

unread

Feel free to give any feedback, concern or support :-)

This should preferably be namespaced.

Mark Randall

3 years ago by michal.brzuchalski@gmail.com — view source

unread

Hi Pierrick

śr., 22 cze 2022 o 06:38 Pierrick Charron pierrick@php.net napisał(a):

Hi,

Following our discussions we had on the subject of the new Curl URL API,
and other curl improvements. I decided to only focus on adding the new Curl
URL API and put aside all other improvements. Here is the RFC that reflects
our current conversations.

https://wiki.php.net/rfc/curl-url-api

TBH I have a problem with the proposed Curl* classes.
IMO they present a mix of responsibilities that I don't like.

CurlUrl is for me is a mix of Url/Uri object properties well known from
other userland implementations like the PSR one
with Uri specific like the host, scheme, port, path, query, fragment, etc.
but on the other hand, we have flags and options which
purpose is to pass Curl specific things but in IMO wrong place instead (for
instance Guzzle use separate argument in all methods for options and flags.
Secondly, it has some default logic like setScheme allows to put a scheme
but requires the supported one, with an exception that you can ignore
support checking by a flag - IMO this logic belongs to its consumer logic
and is not part of URL class logic and the setPort like above.
Next thing is that it again has all the getters and setters and is not
immutable. Why can't it be a simple constructor with read-only properties,
no getters, no setters? Maybe I miss some discussion about why here.

The same goes for CurlException which looks like is an exception for
handling some encoding of Url and not exactly to the Url properties itself.

But with all that said, if the target audience is only a user of low lever
curl_* functions rather than users of higher abstraction libraries like
Guzzle or Httplug then, I guess I don't care that much.
I'm only afraid that such a mix of responsibilities can lead to bad habits
when it gets to the separation of concerns by developers who get used to it
and won' see the problem as I do.

Cheers,
Michał Marcin Brzuchalski

3 years ago by Rowan Tommins — view source

unread

CurlUrl is for me is a mix of Url/Uri object properties well known from
other userland implementations like the PSR one
with Uri specific like the host, scheme, port, path, query, fragment, etc.
but on the other hand, we have flags and options which
purpose is to pass Curl specific things but in IMO wrong place

Rather than a representation of a URL, think of the class as a
builder for URLs. There are multiple methods because you might want to
build the URL in different orders ("start with this URL but replace the
port", "start with this domain and I'll add the path later", etc). All
the flags are related to how the input should be interpreted, and the
output manipulated, in order to build a correctly formatted URL string.

Maybe it should even be called CurlUrlBuilder? That also fits with the
design of having mutable setters; as Derick pointed out, mutable value
objects are generally a bad idea, so it would make sense to encourage
users to think of this as a way to get one or more strings, rather than
as a result in itself.

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Pierrick Charron — view source

unread

Hi Rowan

Rather than a representation of a URL, think of the class as a
builder for URLs. There are multiple methods because you might want to
build the URL in different orders ("start with this URL but replace the
port", "start with this domain and I'll add the path later", etc). All
the flags are related to how the input should be interpreted, and the
output manipulated, in order to build a correctly formatted URL string.

Sorry if it was really not clear in the RFC since I didn't even talk about
the CURLOPT_CURLU option, but this class is not only there to parse/build
strings for Curl but to give this specific object to Curl instead of a
string representation of an URL.

Maybe it should even be called CurlUrlBuilder? That also fits with the
design of having mutable setters; as Derick pointed out, mutable value
objects are generally a bad idea, so it would make sense to encourage
users to think of this as a way to get one or more strings, rather than
as a result in itself.

Since you can give this object to curl instead of an URL string, I would
not call it CurlUrlBuilder.

Regards,
Pierrick

3 years ago by Pierrick Charron — view source

unread

Hi Michał,

Thanks for your comments. You made me realise that the RFC missed
information on a crucial part which is the new CURLOPT_CURLU option that
tells curl to use the CurlUrl object instead of the "standard" CURLOPT_URL
option. I just added some information on it in the RFC.

The purpose of the CurlUrl class is to be able to build/see/modify an URL
as it is seen by libcurl before/after passing it to your CurlHandle. It is
not meant to be used alone as a representation of an URL.

For example, you may want to do some checks to make sure that the URL you
gave to your CurlHandle is well formatted and that the host or any other
parts are what you want them to be, and that it was not something else
because of some differences on how the URL was parsed. Or you may want to
add/delete/overwrite some parts to sanitize your URL. That's why you have
those setters/getters. This class is definitively not there to replace
PSR-7 UriInterface. I can imagine some Guzzle or other implementation to
build a CurlUrl handle from a UriInterface before giving it to curl.

So this RFC is really targeted to users of curl_* functions.

Le lun. 27 juin 2022, à 02 h 29, Michał Marcin Brzuchalski <
michal.brzuchalski@gmail.com> a écrit :

Hi Pierrick

śr., 22 cze 2022 o 06:38 Pierrick Charron pierrick@php.net napisał(a):

Hi,

Following our discussions we had on the subject of the new Curl URL API,
and other curl improvements. I decided to only focus on adding the new
Curl
URL API and put aside all other improvements. Here is the RFC that
reflects
our current conversations.

https://wiki.php.net/rfc/curl-url-api

TBH I have a problem with the proposed Curl* classes.
IMO they present a mix of responsibilities that I don't like.

CurlUrl is for me is a mix of Url/Uri object properties well known from
other userland implementations like the PSR one
with Uri specific like the host, scheme, port, path, query, fragment, etc.
but on the other hand, we have flags and options which
purpose is to pass Curl specific things but in IMO wrong place instead
(for instance Guzzle use separate argument in all methods for options and
flags.
Secondly, it has some default logic like setScheme allows to put a
scheme but requires the supported one, with an exception that you can
ignore support checking by a flag - IMO this logic belongs to its consumer
logic and is not part of URL class logic and the setPort like above.
Next thing is that it again has all the getters and setters and is not
immutable. Why can't it be a simple constructor with read-only properties,
no getters, no setters? Maybe I miss some discussion about why here.

The same goes for CurlException which looks like is an exception for
handling some encoding of Url and not exactly to the Url properties itself.

But with all that said, if the target audience is only a user of low lever
curl_* functions rather than users of higher abstraction libraries like
Guzzle or Httplug then, I guess I don't care that much.
I'm only afraid that such a mix of responsibilities can lead to bad habits
when it gets to the separation of concerns by developers who get used to it
and won' see the problem as I do.

Cheers,
Michał Marcin Brzuchalski

3 years ago by Rowan Tommins — view source

unread

The purpose of the CurlUrl class is to be able to build/see/modify an URL
as it is seen by libcurl before/after passing it to your CurlHandle. It is
not meant to be used alone as a representation of an URL.

For example, you may want to do some checks to make sure that the URL you
gave to your CurlHandle is well formatted and that the host or any other
parts are what you want them to be, and that it was not something else
because of some differences on how the URL was parsed. Or you may want to
add/delete/overwrite some parts to sanitize your URL. That's why you have
those setters/getters. This class is definitively not there to replace
PSR-7 UriInterface. I can imagine some Guzzle or other implementation to
build a CurlUrl handle from a UriInterface before giving it to curl.

This leaves me a bit baffled, frankly.

If I've got a URL, which is already a string, what code would I write to
"do some checks" on it, outside of a unit test?

If I'm using CurlUrl to "add/delete/overwrite some parts" how is that
not "using it alone as a representation of an URL"?

If I'm writing a PSR-7 object, am I only supposed to use CurlUrl when
interfacing with curl, and generate the string myself for other
purposes? If the implementation I come up with differs from curl's, how
does the user know which is the "real" URL?

Regards,

--
Rowan Tommins
[IMSoP]

3 years ago by Pierrick Charron — view source

unread

Hi Rowan

If I've got a URL, which is already a string, what code would I write to
"do some checks" on it, outside of a unit test?

That's just an example with an old version of PHP, but let's say you have
some code that makes requests but only to a specific list of servers, so
you want to analyze the URL and check if the host is in a whitelist. If the
provided URL is "http://127.0.0.1:11211#@google.com:80/"; and that you used
PHP <= 7.0.13 your parse_url function would tell you that the domain you're
trying to request is google.com so everything is fine, but in fact when the
call to curl is made, curl would call 127.0.0.1. This one was fixed but the
problem could still occur if the parser is not the same as the one used in
the requester.

If I'm using CurlUrl to "add/delete/overwrite some parts" how is that
not "using it alone as a representation of an URL"?

What I meant here was that if you're not using curl, you have no advantage
of using this class alone to parse since the requester you're using could
handle the URL differently.

If I'm writing a PSR-7 object, am I only supposed to use CurlUrl when
interfacing with curl, and generate the string myself for other
purposes? If the implementation I come up with differs from curl's, how
does the user know which is the "real" URL?

You can use CurlUrl within your implementation of UriInterface but for the
same reason if you're using another request engine than curl, you may have
the same security problem where curl will not parse the same data. If you
want to make sure that your CurlUrl object represents the same thing as
your UriInterface you could build the CurlUrl object part by part using
your UriInterface. When you assign your CurlUrl to your CurlHandle with the
CURLOPT_CURLU option, curl will use the parts directly instead of parsing
the URL again, so you're sure that the host will be the one you set with
CurlUrl::setHost() and so on.

Pierrick

[1]
https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf

3 years ago by Rowan Tommins — view source

unread

That's just an example with an old version of PHP, but let's say you have
some code that makes requests but only to a specific list of servers, so
you want to analyze the URL and check if the host is in a whitelist. If the
provided URL is "http://127.0.0.1:11211#@google.com:80/"; and that you
used PHP <= 7.0.13 your parse_url function would tell you that the domain
you're trying to request is google.com so everything is fine, but in fact
when the call to curl is made, curl would call 127.0.0.1. This one was
fixed but the problem could still occur if the parser is not the same as
the one used in the requester.

...

You can use CurlUrl within your implementation of UriInterface but for the
same reason if you're using another request engine than curl, you may have
the same security problem where curl will not parse the same data. If you
want to make sure that your CurlUrl object represents the same thing as
your UriInterface you could build the CurlUrl object part by part using
your UriInterface. When you assign your CurlUrl to your CurlHandle with the
CURLOPT_CURLU option, curl will use the parts directly instead of parsing
the URL again, so you're sure that the host will be the one you set with
CurlUrl::setHost() and so on.

This is actually a lot trickier than it sounds.

Imagine this code, with the bug you gave as an example still present in one
of the libraries we're interacting with:

$url = new MyUrlObject("http://127.0.0.1:11211#@google.com:80/");
var_dump($url->getHostName()); // ???

This won't work, because we don't know which parser to call; it needs to be
something like this:

var_dump($url->getHostName(UrlContext::CURL)); // '127.0.0.1'
var_dump($url->getHostName(UrlContext::BROKEN_PHP)); // 'google.com'

Then we call this:

$url->setHostName("duckduckgo.com");
var_dump($url->getFullUrl(); // ???

Again, the result depends on context:

var_dump($url->getFullUrl( UrlContext::CURL )); // "
http://duckduckgo.com#@google.com:80/";
var_dump($url->getFullUrl( UrlContext::BROKEN_PHP )); // "
http://127.0.0.1:11211#@duckduckgo.com:80/";

But that means we can't actually apply the setHostName change until the
call to getFullUrl(), because we don't know how the original URL will be
parsed. Instead, the object has to internally store a queue of applied
modifications, and then reproduce them, in order, on the underlying
implementation.

Alternatively, we can build our implementation to assume it will always be
used in the context of curl, or for debugging curl, so can just have
getHostName() and setHostName() directly use curl's implementation. Which
leads us back to where we started, of using CurlUrl directly or via a thin
wrapper, as either a URL parser+builder, or as a value object in its own
right.

I don't really know how to make all this nuance clear to users, but that
makes me a bit wary of adding the object to PHP in its current design.

Regards,

Rowan Tommins
[IMSoP]