[RFC] Bug #72811 - Replacing parse_url()

8 years ago by David Walker — view source

unread

Hi all,

A couple weeks back I took a look at 72811[1]. The bug being that
parse_url() didn't accept IPv6 addresses without a scheme, like it did for
IPv4 addresses. I attempted to patch the specific bug within the scope of
how parse_url() was processing URI's. After opening a PR for the
resoution, Yasuo and Christoph both chimed in that perhaps replacing the
implementation with an re2c based parser would be better. We found a
parser[2] that did almost everything necessary. I took it and made it more
strictly adhere to RFC3986[3].

I have updated my original PR[4] and created a RFC[5] that aims to replace
the parsing of parse_url() to be more strict to RFC3986. This will provide
a BC break, as explained in the RFC that at very least warrants some
discussion. We had kicked around the idea on the PR of deprecating
parse_url, and creating a new function with the more-compliant parser, but
oped against it.

I'm looking for discussion on if a total replacement is the preferred way
to go about this, and if, we should be making parse_url() more standards
strict. Since it today has many breaks with RFC3986 that provide
semi-reasonable parsing patterns.

--
Dave

[1] - https://bugs.php.net/bug.php?id=72811
[2] - https://github.com/staskobzar/url_parser_re2c
[3] - https://tools.ietf.org/html/rfc3986
[4] - https://github.com/php/php-src/pull/2079
[5] - https://wiki.php.net/rfc/replace_parse_url

8 years ago by Christoph M. Becker — view source

unread

A couple weeks back I took a look at 72811[1]. The bug being that
parse_url() didn't accept IPv6 addresses without a scheme, like it did for
IPv4 addresses. I attempted to patch the specific bug within the scope of
how parse_url() was processing URI's. After opening a PR for the
resoution, Yasuo and Christoph both chimed in that perhaps replacing the
implementation with an re2c based parser would be better. We found a
parser[2] that did almost everything necessary. I took it and made it more
strictly adhere to RFC3986[3].

I have updated my original PR[4] and created a RFC[5] that aims to replace
the parsing of parse_url() to be more strict to RFC3986. This will provide
a BC break, as explained in the RFC that at very least warrants some
discussion. We had kicked around the idea on the PR of deprecating
parse_url, and creating a new function with the more-compliant parser, but
oped against it.

I'm looking for discussion on if a total replacement is the preferred way
to go about this, and if, we should be making parse_url() more standards
strict. Since it today has many breaks with RFC3986 that provide
semi-reasonable parsing patterns.

[1] - https://bugs.php.net/bug.php?id=72811
[2] - https://github.com/staskobzar/url_parser_re2c
[3] - https://tools.ietf.org/html/rfc3986
[4] - https://github.com/php/php-src/pull/2079
[5] - https://wiki.php.net/rfc/replace_parse_url

Thanks for the RFC, Dave!

I'm all for having a properly implementable URI parser that exactly
follows a specific standard. However, I don't think we can replace
parse_url() with such a parser for BC reasons before PHP 8 (at least).
The parse_url() man page explicitly states:

| Partial URLs are also accepted, parse_url() tries its best to parse
| them correctly.

I'm quite sure that a lot of code relies on this behavior.

So, I basically see two options:

wait until PHP 8 (whenever that'll be released) and switch the
implementation of parse_url() then – what might delay the adoption
of PHP 8
add a new function in PHP 7.2 (maybe called parse_uri()), and
perhaps deprecate parse_url() at the same time

--
Christoph M. Becker

8 years ago by David Walker — view source

unread

On Thu, Oct 6, 2016 at 11:19 AM Christoph M. Becker cmbecker69@gmx.de
wrote:

So, I basically see two options:

wait until PHP 8 (whenever that'll be released) and switch the
implementation of parse_url() then – what might delay the adoption
of PHP 8
add a new function in PHP 7.2 (maybe called parse_uri()), and
perhaps deprecate parse_url() at the same time

I'd probably side more with the former, but as a hybrid? As a stickler for
naming parse_url() also seems to parses URN's correctly. imo, the proper
name would be parse_uri() for something that can correctly parse any URI
per the RFC.

Would it be plausible to blend the options by [sorry if this is a faux pas
I'm not familiar with yet]

PHP 7.2+
- Adding parse_uri() as the new RFC compliant parser
- Deprecating the functionality of parse_url() with notice that parser
  will change to that of parse_uri()
PHP 8.0
- Alias parse_url() to be parse_uri()
- Deprecate parse_url() for the name (or just let it exist as a alias
  forever)

Obviously having two UR* parsers for the long-term would be a poor option,
this might enable people to migrate to the new name & parser, before PHP 8
becomes a reality.

--
Dave

8 years ago by Stephen Reay — view source

unread

Could the new URL parser be exposed via a third parameter to parse_url, which defaults to false/off in 7.2 (or whenever its added) but then defaults to true in 8.0?

Introducing a new core function to effectively fix a bug seems like the wrong approach to me (and what happens if a new URL/URI related RFC is published, do we get a third function?)

Additionally, would(could) this same url parser be used for FILTER_VALIDATE_URL, which currently states (in the docs) that it matches RFC2396, and indeed it appears not to accept an IPv6 host segment.

Thanks for all you do guys!

Cheers

Stephen

On Thu, Oct 6, 2016 at 11:19 AM Christoph M. Becker cmbecker69@gmx.de
wrote:

So, I basically see two options:

wait until PHP 8 (whenever that'll be released) and switch the
implementation of parse_url() then – what might delay the adoption
of PHP 8

add a new function in PHP 7.2 (maybe called parse_uri()), and
perhaps deprecate parse_url() at the same time

I'd probably side more with the former, but as a hybrid? As a stickler for
naming parse_url() also seems to parses URN's correctly. imo, the proper
name would be parse_uri() for something that can correctly parse any URI
per the RFC.

Would it be plausible to blend the options by [sorry if this is a faux pas
I'm not familiar with yet]

PHP 7.2+

Adding parse_uri() as the new RFC compliant parser

Deprecating the functionality of parse_url() with notice that parser
will change to that of parse_uri()

PHP 8.0

Alias parse_url() to be parse_uri()

Deprecate parse_url() for the name (or just let it exist as a alias
forever)

Obviously having two UR* parsers for the long-term would be a poor option,
this might enable people to migrate to the new name & parser, before PHP 8
becomes a reality.

--
Dave

8 years ago by David Walker — view source

unread

On Thu, Oct 6, 2016 at 10:13 PM Stephen Reay php-lists@koalephant.com
wrote:

Could the new URL parser be exposed via a third parameter to parse_url,
which defaults to false/off in 7.2 (or whenever its added) but then
defaults to true in 8.0?

I, personally, would be opposed to this. Firstly, it doesn't alert users
to the previous functionality being non-standards compliant. Secondly it
allows the previous parser to exist in it's current state for a longer
period of time, then in 8.0 exist as a function parameter. The goal should
be to drop the standards-uncompliant version at some point.

Additionally, would(could) this same url parser be used for
FILTER_VALIDATE_URL, which currently states (in the docs) that it matches
RFC2396, and indeed it appears not to accept an IPv6 host segment.

I haven't yet looked at to how filter validation exists in conjunction with
parse_url(), however as noted (
http://php.net/manual/en/filter.filters.validate.php#110411) the
FILTER_VALIDATE_URL, may be strict to URL's not URN's. Maybe, a future
scope could be to look at filter_var() and how it uses that filter type.
Maybe could add a proper FILTER_VALIDATE_URI, or something.

However that would be outside the scope of this RFC for right now, I
believe.

--
Dave

8 years ago by Marco Pivetta — view source

unread

On Thu, Oct 6, 2016 at 10:13 PM Stephen Reay php-lists@koalephant.com
wrote:

Could the new URL parser be exposed via a third parameter to parse_url,
which defaults to false/off in 7.2 (or whenever its added) but then
defaults to true in 8.0?

I, personally, would be opposed to this. Firstly, it doesn't alert users
to the previous functionality being non-standards compliant. Secondly it
allows the previous parser to exist in it's current state for a longer
period of time, then in 8.0 exist as a function parameter. The goal should
be to drop the standards-uncompliant version at some point.

Additionally, would(could) this same url parser be used for
FILTER_VALIDATE_URL, which currently states (in the docs) that it matches
RFC2396, and indeed it appears not to accept an IPv6 host segment.

I haven't yet looked at to how filter validation exists in conjunction with
parse_url(), however as noted (
http://php.net/manual/en/filter.filters.validate.php#110411) the
FILTER_VALIDATE_URL, may be strict to URL's not URN's. Maybe, a future
scope could be to look at filter_var() and how it uses that filter type.
Maybe could add a proper FILTER_VALIDATE_URI, or something.

However that would be outside the scope of this RFC for right now, I
believe.

--
Dave

Like with any software rewrite project, remember that software rewrites
usually fail.

It's probably better to deprecate the function, make a new one, then code
the newer implementation there, and let users migrate.

This is necessary to mitigate the risk of BC breaks.

Marco Pivetta

http://twitter.com/Ocramius

http://ocramius.github.com/

8 years ago by michal@brzuchalski.com — view source

unread

How about complete rewrite with OOP? It could be implemented using Objects
like DateTime does.
I've got working implementation in userland https://github.com/madkom/uri it
maybe not be finished yet but supports parsing URI with IPv4, IPv6 and
Hostnames.
It was also going to parse query arguments from URI depending on how to
parse multiple arguments:

some languages parses ?arg1=1&arg1=2 as an array, PHP parses only last,
?arg1[]=1&arg1=3 some parses adding element 3 to "arg1" array some
replaces)

This implementation also supports UriTemplates ("
http://localhost/{module}/action";) and UriReferences
("/some/reference?arg1=2#fragment").

2016-10-07 8:38 GMT+02:00 Marco Pivetta ocramius@gmail.com:

On Thu, Oct 6, 2016 at 10:13 PM Stephen Reay php-lists@koalephant.com
wrote:

Could the new URL parser be exposed via a third parameter to parse_url,
which defaults to false/off in 7.2 (or whenever its added) but then
defaults to true in 8.0?

I, personally, would be opposed to this. Firstly, it doesn't alert users
to the previous functionality being non-standards compliant. Secondly it
allows the previous parser to exist in it's current state for a longer
period of time, then in 8.0 exist as a function parameter. The goal
should
be to drop the standards-uncompliant version at some point.

Additionally, would(could) this same url parser be used for
FILTER_VALIDATE_URL, which currently states (in the docs) that it
matches
RFC2396, and indeed it appears not to accept an IPv6 host segment.

I haven't yet looked at to how filter validation exists in conjunction
with
parse_url(), however as noted (
http://php.net/manual/en/filter.filters.validate.php#110411) the
FILTER_VALIDATE_URL, may be strict to URL's not URN's. Maybe, a future
scope could be to look at filter_var() and how it uses that filter type.
Maybe could add a proper FILTER_VALIDATE_URI, or something.

However that would be outside the scope of this RFC for right now, I
believe.

--
Dave

Like with any software rewrite project, remember that software rewrites
usually fail.

It's probably better to deprecate the function, make a new one, then code
the newer implementation there, and let users migrate.

This is necessary to mitigate the risk of BC breaks.

Marco Pivetta

http://twitter.com/Ocramius

http://ocramius.github.com/

--
regards / pozdrawiam,

Michał Brzuchalski
brzuchalski.com

8 years ago by michal@brzuchalski.com — view source

unread

2016-10-07 11:21 GMT+02:00 Michał Brzuchalski michal@brzuchalski.com:

How about complete rewrite with OOP? It could be implemented using Objects
like DateTime does.
I've got working implementation in userland https://github.com/madkom/uri it
maybe not be finished yet but supports parsing URI with IPv4, IPv6 and
Hostnames.
It was also going to parse query arguments from URI depending on how to
parse multiple arguments:

some languages parses ?arg1=1&arg1=2 as an array, PHP parses only last,

?arg1[]=1&arg1=3 some parses adding element 3 to "arg1" array some
replaces)

This implementation also supports UriTemplates ("
http://localhost/{module}/action";) and UriReferences
("/some/reference?arg1=2#fragment").

My proposed impl supports parsing and building, I can see there is possible
to move functionalities from few global functions into one OO written
package.

2016-10-07 8:38 GMT+02:00 Marco Pivetta ocramius@gmail.com:

On Thu, Oct 6, 2016 at 10:13 PM Stephen Reay php-lists@koalephant.com
wrote:

Could the new URL parser be exposed via a third parameter to
parse_url,
which defaults to false/off in 7.2 (or whenever its added) but then
defaults to true in 8.0?

I, personally, would be opposed to this. Firstly, it doesn't alert
users
to the previous functionality being non-standards compliant. Secondly
it
allows the previous parser to exist in it's current state for a longer
period of time, then in 8.0 exist as a function parameter. The goal
should
be to drop the standards-uncompliant version at some point.

Additionally, would(could) this same url parser be used for
FILTER_VALIDATE_URL, which currently states (in the docs) that it
matches
RFC2396, and indeed it appears not to accept an IPv6 host segment.

I haven't yet looked at to how filter validation exists in conjunction
with
parse_url(), however as noted (
http://php.net/manual/en/filter.filters.validate.php#110411) the
FILTER_VALIDATE_URL, may be strict to URL's not URN's. Maybe, a future
scope could be to look at filter_var() and how it uses that filter type.
Maybe could add a proper FILTER_VALIDATE_URI, or something.

However that would be outside the scope of this RFC for right now, I
believe.

--
Dave

Like with any software rewrite project, remember that software rewrites
usually fail.

It's probably better to deprecate the function, make a new one, then code
the newer implementation there, and let users migrate.

This is necessary to mitigate the risk of BC breaks.

Marco Pivetta

http://twitter.com/Ocramius

http://ocramius.github.com/

--
regards / pozdrawiam,

Michał Brzuchalski
brzuchalski.com

--
regards / pozdrawiam,

Michał Brzuchalski
brzuchalski.com

8 years ago by Yasuo Ohgaki — view source

unread

Hi Michal,

On Fri, Oct 7, 2016 at 6:47 PM, Michał Brzuchalski
michal@brzuchalski.com wrote:

2016-10-07 11:21 GMT+02:00 Michał Brzuchalski michal@brzuchalski.com:

How about complete rewrite with OOP? It could be implemented using Objects
like DateTime does.
I've got working implementation in userland https://github.com/madkom/uri it
maybe not be finished yet but supports parsing URI with IPv4, IPv6 and
Hostnames.
It was also going to parse query arguments from URI depending on how to
parse multiple arguments:

some languages parses ?arg1=1&arg1=2 as an array, PHP parses only last,

?arg1[]=1&arg1=3 some parses adding element 3 to "arg1" array some
replaces)

This implementation also supports UriTemplates ("
http://localhost/{module}/action";) and UriReferences
("/some/reference?arg1=2#fragment").

My proposed impl supports parsing and building, I can see there is possible
to move functionalities from few global functions into one OO written
package.

This is about internal C module implementation. re2c is one of the
best way to write manageable BNF definition implementation. We use
re2c many places already. Therefore, re2c implementation is the way to
go. IMHO.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

8 years ago by michal@brzuchalski.com — view source

unread

2016-10-07 12:55 GMT+02:00 Yasuo Ohgaki yohgaki@ohgaki.net:

Hi Michal,

On Fri, Oct 7, 2016 at 6:47 PM, Michał Brzuchalski
michal@brzuchalski.com wrote:

2016-10-07 11:21 GMT+02:00 Michał Brzuchalski michal@brzuchalski.com:

How about complete rewrite with OOP? It could be implemented using
Objects
like DateTime does.
I've got working implementation in userland
https://github.com/madkom/uri it
maybe not be finished yet but supports parsing URI with IPv4, IPv6 and
Hostnames.
It was also going to parse query arguments from URI depending on how to
parse multiple arguments:

some languages parses ?arg1=1&arg1=2 as an array, PHP parses only
last,

?arg1[]=1&arg1=3 some parses adding element 3 to "arg1" array some
replaces)

This implementation also supports UriTemplates ("
http://localhost/{module}/action";) and UriReferences
("/some/reference?arg1=2#fragment").

My proposed impl supports parsing and building, I can see there is
possible
to move functionalities from few global functions into one OO written
package.

This is about internal C module implementation. re2c is one of the
best way to write manageable BNF definition implementation. We use
re2c many places already. Therefore, re2c implementation is the way to
go. IMHO.

Yes I understand that implementation in C would be different, all I wanted
to point is nice OO implementation
with type hints and return types so there is no mixed return type value
used like in parse_url() and there
are always classes for each component of URL not only string and you don't
know what to do with it
Also avoiding mixed return type allows static code analyzers to detect
errors in code and all those OO benefits.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

--
regards / pozdrawiam,

Michał Brzuchalski
brzuchalski.com

8 years ago by Yasuo Ohgaki — view source

unread

Hi David,

PHP 7.2+

Adding parse_uri() as the new RFC compliant parser

Deprecating the functionality of parse_url() with notice that parser
will change to that of parse_uri()

PHP 8.0

Alias parse_url() to be parse_uri()

Deprecate parse_url() for the name (or just let it exist as a alias
forever)

Obviously having two UR* parsers for the long-term would be a poor option,
this might enable people to migrate to the new name & parser, before PHP 8
becomes a reality.

Current URL parse code is not maintainable, not standard compliant as well.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

8 years ago by Nikita Popov — view source

unread

Hi all,

A couple weeks back I took a look at 728111. The bug being that
parse_url() didn't accept IPv6 addresses without a scheme, like it did for
IPv4 addresses. I attempted to patch the specific bug within the scope of
how parse_url() was processing URI's. After opening a PR for the
resoution, Yasuo and Christoph both chimed in that perhaps replacing the
implementation with an re2c based parser would be better. We found a
parser[2] that did almost everything necessary. I took it and made it more
strictly adhere to RFC3986[3].

I have updated my original PR[4] and created a RFC[5] that aims to replace
the parsing of parse_url() to be more strict to RFC3986. This will provide
a BC break, as explained in the RFC that at very least warrants some
discussion. We had kicked around the idea on the PR of deprecating
parse_url, and creating a new function with the more-compliant parser, but
oped against it.

I'm looking for discussion on if a total replacement is the preferred way
to go about this, and if, we should be making parse_url() more standards
strict. Since it today has many breaks with RFC3986 that provide
semi-reasonable parsing patterns.

--
Dave

1 - https://bugs.php.net/bug.php?id=72811
[2] - https://github.com/staskobzar/url_parser_re2c
[3] - https://tools.ietf.org/html/rfc3986
[4] - https://github.com/php/php-src/pull/2079
[5] - https://wiki.php.net/rfc/replace_parse_url

Are you aware of the WHATWG URL standard 1? Quoting the first goal
statement:

Align RFC 3986 and RFC 3987 with contemporary implementations and
obsolete them in the process. (E.g., spaces, other "illegal" code points,
query encoding, equality, canonicalization, are all concepts not entirely
shared, or defined.) URL parsing needs to become as solid as HTML parsing.

Basically this is the standard that describes how URL parsing actually
works in the wild, in browser implementations. In particular it also
includes a description of URL parsing in algorithmic form, including
specific directions as to which errors are fatal and which are not.

Also quoting from the goals:

Standardize on the term URL. URI and IRI are just confusing. In practice
a single algorithm is used for both so keeping them distinct is not helping
anyone. URL also easily wins the search result popularity contest.

For this reason, I would recommend against introducing the term "URI"
anywhere. In particular the suggestion from this thread to use parse_uri()
for this functionality seems like it will cause a lot of confusion.

The URL standard also specifies the interface of the URL object used by
JavaScript and I think we should consider whether we may want to simply
adopt this (object-oriented) interface (potentially with adjustments for
PHP specifics).

I think an important part of this interface is that the URL is constructed
using URL(url [, base]), where "base" is the base URL against which
relative URLs are resolved. This base URL is required for parsing
non-absolute URLs. To me this makes a lot of sense and I think it makes it
much clearer how "incomplete" URLs are being treated.

While we're at it, what's the state of IDN? May this be the time to
properly support it?

Nikita

8 years ago by David Walker — view source

unread

Hi Nikita,

Are you aware of the WHATWG URL standard [1]? Quoting the first goal
statement:

Align RFC 3986 and RFC 3987 with contemporary implementations and
obsolete them in the process. (E.g., spaces, other "illegal" code points,
query encoding, equality, canonicalization, are all concepts not entirely
shared, or defined.) URL parsing needs to become as solid as HTML parsing.

I was not. I assume that WHATWG ought to supersede the IETF standards on
the subject. I can obviously make an implementation follow the standards
and algorithms set out in this doc.

Also quoting from the goals:

Standardize on the term URL. URI and IRI are just confusing. In practice
a single algorithm is used for both so keeping them distinct is not helping
anyone. URL also easily wins the search result popularity contest.

For this reason, I would recommend against introducing the term "URI"
anywhere. In particular the suggestion from this thread to use parse_uri()
for this functionality seems like it will cause a lot of confusion.

Duly noted.

The URL standard also specifies the interface of the URL object used by
JavaScript and I think we should consider whether we may want to simply
adopt this (object-oriented) interface (potentially with adjustments for
PHP specifics).

I think an important part of this interface is that the URL is constructed
using URL(url [, base]), where "base" is the base URL against which
relative URLs are resolved. This base URL is required for parsing
non-absolute URLs. To me this makes a lot of sense and I think it makes it
much clearer how "incomplete" URLs are being treated.

If we go the route of making URL it's own object, and expose an
object-oriented interface, are we leading it to be more of a total URL
builder, of sorts? Like:

$url = new URL();
$url->setScheme('http');
$url->setHost('example.org');
$url->setPath('/test.php');
var_dump($url->build()); // outputs: http://example.org/test.php

OR, would it, at the end of the day be an object that takes a string, and
you just call getter's on it that would be akin to the current flags you
pass into parse_url()?

On both accounts, if we're to go forward with the Object model of URL,
would this want to be broken into it's own ext/url module, like how date
exists? Or retain it in ext/standard?

Cheers

Dave

8 years ago by guilhermeblanco@gmail.com — view source

unread

I'd suggest URL to be immutable and have a URLBuilder (obtainable through
URL::createBuilder()) for that...

Hi Nikita,

Are you aware of the WHATWG URL standard [1]? Quoting the first goal
statement:

Align RFC 3986 and RFC 3987 with contemporary implementations and
obsolete them in the process. (E.g., spaces, other "illegal" code points,
query encoding, equality, canonicalization, are all concepts not entirely
shared, or defined.) URL parsing needs to become as solid as HTML
parsing.

I was not. I assume that WHATWG ought to supersede the IETF standards on
the subject. I can obviously make an implementation follow the standards
and algorithms set out in this doc.

Also quoting from the goals:

Standardize on the term URL. URI and IRI are just confusing. In
practice
a single algorithm is used for both so keeping them distinct is not
helping
anyone. URL also easily wins the search result popularity contest.

For this reason, I would recommend against introducing the term "URI"
anywhere. In particular the suggestion from this thread to use
parse_uri()
for this functionality seems like it will cause a lot of confusion.

Duly noted.

The URL standard also specifies the interface of the URL object used by
JavaScript and I think we should consider whether we may want to simply
adopt this (object-oriented) interface (potentially with adjustments for
PHP specifics).

I think an important part of this interface is that the URL is
constructed
using URL(url [, base]), where "base" is the base URL against which
relative URLs are resolved. This base URL is required for parsing
non-absolute URLs. To me this makes a lot of sense and I think it makes
it
much clearer how "incomplete" URLs are being treated.

If we go the route of making URL it's own object, and expose an
object-oriented interface, are we leading it to be more of a total URL
builder, of sorts? Like:

$url = new URL();
$url->setScheme('http');
$url->setHost('example.org');
$url->setPath('/test.php');
var_dump($url->build()); // outputs: http://example.org/test.php

OR, would it, at the end of the day be an object that takes a string, and
you just call getter's on it that would be akin to the current flags you
pass into parse_url()?

On both accounts, if we're to go forward with the Object model of URL,
would this want to be broken into it's own ext/url module, like how date
exists? Or retain it in ext/standard?

Cheers

Dave

--
Guilherme Blanco
Lead Architect at E-Block

8 years ago by Larry Garfield — view source

unread

I think an important part of this interface is that the URL is constructed
using URL(url [, base]), where "base" is the base URL against which
relative URLs are resolved. This base URL is required for parsing
non-absolute URLs. To me this makes a lot of sense and I think it makes it
much clearer how "incomplete" URLs are being treated.

If we go the route of making URL it's own object, and expose an
object-oriented interface, are we leading it to be more of a total URL
builder, of sorts? Like:

$url = new URL();
$url->setScheme('http');
$url->setHost('example.org');
$url->setPath('/test.php');
var_dump($url->build()); // outputs: http://example.org/test.php

OR, would it, at the end of the day be an object that takes a string, and
you just call getter's on it that would be akin to the current flags you
pass into parse_url()?

On both accounts, if we're to go forward with the Object model of URL,
would this want to be broken into it's own ext/url module, like how date
exists? Or retain it in ext/standard?

Cheers

Dave

Be aware that a user-space definition for a URL object already exists as
part of PSR-7:

http://www.php-fig.org/psr/psr-7/#3-5-psr-http-message-uriinterface

A core-provided mutable and incompatible object would be problematic.

What would be useful would be to have a C-level function (parse_url() or
otherwise) that can generate a very well-known and standardized array
structure (ie, better than parse_url()s now) that a UriInterface
implementation could trivially wrap. Basically, a way to simplify this
existing code:

https://github.com/zendframework/zend-diactoros/blob/master/src/Uri.php#L435

And move the conditionals and filter*() sub-calls to C. (Right now they
play games with regexes and hope.)

--Larry Garfield

8 years ago by David Walker — view source

unread

On Mon, Oct 10, 2016 at 1:22 PM Larry Garfield larry@garfieldtech.com
wrote:

Be aware that a user-space definition for a URL object already exists as
part of PSR-7:

http://www.php-fig.org/psr/psr-7/#3-5-psr-http-message-uriinterface

A core-provided mutable and incompatible object would be problematic.

What would be useful would be to have a C-level function (parse_url() or
otherwise) that can generate a very well-known and standardized array
structure (ie, better than parse_url()s now) that a UriInterface
implementation could trivially wrap. Basically, a way to simplify this
existing code:

https://github.com/zendframework/zend-diactoros/blob/master/src/Uri.php#L435

And move the conditionals and filter*() sub-calls to C. (Right now they
play games with regexes and hope.)

Hi Larry,

I guess I'm not sure why having a RFC/WHATWG compliant parser would be
problematic with regard to PSR-7. It would be the application developers
responcibility to take a standardized output and populate their object that
implements UriInterface. WHATWG does seem to mitigate the need of some of
the filter*() calls, but certain ones would still desire to be
application-specific.

Although WHATWG does not specify that the URL object has a getAll()-esque
method, it could be beneficial to have something that returns a structure
similar to what parse_url() does today. It could also be beneficial to
just have URL implement ArrayAccess so you wouldn't have to bother with
getting a specific array back, and can just access what you need.

--
Dave

8 years ago by Larry Garfield — view source

unread

On Mon, Oct 10, 2016 at 1:22 PM Larry Garfield larry@garfieldtech.com
wrote:

Be aware that a user-space definition for a URL object already exists as
part of PSR-7:

http://www.php-fig.org/psr/psr-7/#3-5-psr-http-message-uriinterface

A core-provided mutable and incompatible object would be problematic.

What would be useful would be to have a C-level function (parse_url() or
otherwise) that can generate a very well-known and standardized array
structure (ie, better than parse_url()s now) that a UriInterface
implementation could trivially wrap. Basically, a way to simplify this
existing code:

https://github.com/zendframework/zend-diactoros/blob/master/src/Uri.php#L435

And move the conditionals and filter*() sub-calls to C. (Right now they
play games with regexes and hope.)

Hi Larry,

I guess I'm not sure why having a RFC/WHATWG compliant parser would be
problematic with regard to PSR-7. It would be the application developers
responcibility to take a standardized output and populate their object that
implements UriInterface. WHATWG does seem to mitigate the need of some of
the filter*() calls, but certain ones would still desire to be
application-specific.

Although WHATWG does not specify that the URL object has a getAll()-esque
method, it could be beneficial to have something that returns a structure
similar to what parse_url() does today. It could also be beneficial to
just have URL implement ArrayAccess so you wouldn't have to bother with
getting a specific array back, and can just access what you need.

--
Dave

It's not that having an RFC-compliant parser in C is problematic. Quite
the opposite. It's the representation it produces back to user-land
code. Viz, right now the most common PSR-7 implementation uses
parse_url() internally, which as noted is somewhat buggy and
incomplete. If PHP natively provided a better parser that a PSR-7
implementation could use, that's good for everyone.

What would not be helpful is for PHP to natively provide, essentially, a
competitor to PSR-7's Uri object. The raw data parsing can/should live
in C, while the main user-space representation is defined in
user-space. That's the same point that was made for HTTP headers
overall a while back; PHP already has the ability in C to read a stream
and parse it out into headers, a GET array, a POST array, etc. It uses
it for the super-globals. Exposing that capability to user-space would
allow for more efficient and flexible implementations of PSR-7 or similar.

I fully expect that in a few years PSR-7 will be updated and supplanted
with something that leverages newer PHP features, and we would want to
make that transition as smooth as possible. That means having a clear
stack of complementary functionality, not competing "polished"
functionality that would then have to be mapped back and forth in a
clumsy fashion.

--Larry Garfield

8 years ago by Christoph M. Becker — view source

unread

Are you aware of the WHATWG URL standard [1]? Quoting the first goal
statement:

Align RFC 3986 and RFC 3987 with contemporary implementations and
obsolete them in the process. (E.g., spaces, other "illegal" code points,
query encoding, equality, canonicalization, are all concepts not entirely
shared, or defined.) URL parsing needs to become as solid as HTML parsing.

I was not. I assume that WHATWG ought to supersede the IETF standards on
the subject. I can obviously make an implementation follow the standards
and algorithms set out in this doc.

That might be hard, because WHATWG has "living standards", i.e. they can
change over time. If we state that our new functionality conforms to
WHATWG's URL standard we have to always apply the latest changes even
into revisions, potentially causing BC breaks.

--
Christoph M. Becker

8 years ago by David Walker — view source

unread

On Thu, Oct 13, 2016 at 10:54 AM Christoph M. Becker cmbecker69@gmx.de
wrote:

On Fri, Oct 7, 2016 at 4:37 AM Nikita Popov nikita.ppv@gmail.com
wrote:

Are you aware of the WHATWG URL standard [1]? Quoting the first goal
statement:

Align RFC 3986 and RFC 3987 with contemporary implementations and
obsolete them in the process. (E.g., spaces, other "illegal" code
points,
query encoding, equality, canonicalization, are all concepts not
entirely
shared, or defined.) URL parsing needs to become as solid as HTML
parsing.

I was not. I assume that WHATWG ought to supersede the IETF standards on
the subject. I can obviously make an implementation follow the standards
and algorithms set out in this doc.

That might be hard, because WHATWG has "living standards", i.e. they can
change over time. If we state that our new functionality conforms to
WHATWG's URL standard we have to always apply the latest changes even
into revisions, potentially causing BC breaks.

We could say that we support WHATWG@f88f96 and support the previous 2, or
3, keeping in line with some BC ability. But yes, it would be a nuisance
and overly complex to keep updating the parser for frivolous changes (like
that commit) which re-makes fragments an optional part of a URL.

I'd be more apt to stick to a single 3987/3988 compatible parser, which
ought to be future compatible with WHATWG. It would just lack any of the
standardization terms, object models, and rigid-parser definitions. That's
to say that WHATWG maintains the requirements of the 2RFC and just expands
on them.

This also plays with what Larry is saying by keeping the parsing, and
object-side of things separate. WHATWG does have a very nice layout of the
URL/SearchParams, and how they can play with eachother. I'd think that end
of the spec should be left for userland classes, in the event a new PSR
wants to propose implementing the WHATWG format of URL/SearchParams/Hosts,
etc.

--
Dave

[RFC] Bug #72811 - Replacing parse_url()

-- regards / pozdrawiam,

-- regards / pozdrawiam,

-- regards / pozdrawiam,

-- regards / pozdrawiam,

Cheers

Cheers

Cheers

--
regards / pozdrawiam,

--
regards / pozdrawiam,

--
regards / pozdrawiam,

--
regards / pozdrawiam,