Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:96283 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 9351 invoked from network); 7 Oct 2016 10:37:30 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 7 Oct 2016 10:37:30 -0000 Authentication-Results: pb1.pair.com header.from=nikita.ppv@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=nikita.ppv@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.213.180 as permitted sender) X-PHP-List-Original-Sender: nikita.ppv@gmail.com X-Host-Fingerprint: 209.85.213.180 mail-yb0-f180.google.com Received: from [209.85.213.180] ([209.85.213.180:33660] helo=mail-yb0-f180.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 8D/66-23443-7EA77F75 for ; Fri, 07 Oct 2016 06:37:27 -0400 Received: by mail-yb0-f180.google.com with SMTP id e20so14760510ybb.0 for ; Fri, 07 Oct 2016 03:37:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=qas6Ywb5sFEIXkmeQVt9e8QoPKI03zlkzS5Hw+2Sfmw=; b=umaq1GeLVDPz0zv07gy3uBP00r78qvoRFeYBqwN6VAD6w+qtAeb513UtutfuxWbrH1 WPvlG8qGxKLZvA1fExyjZdv01FZkFq0BY/UdONGggJG46fglqpHOVXCj5+yGB0/U6eS5 CbC2i5EutZSOCZl9lMNMupljalVuLacpte6cJ8qYmKNuG0z9x7B6SVkJGv+W8fl2/PZv o2rIiLHuAOaOwfzvjB3K6Unii8sZvUAJ4SHT864zbZgikofvA1xKWbqeTifc/AlLl6qX Bzqz2K+kh0xjtp8Mi/O2HjMs0xU9TLqh7BMxCVrdUMTatHygaV4uiLGD2OVqsw3mn3Da 79eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=qas6Ywb5sFEIXkmeQVt9e8QoPKI03zlkzS5Hw+2Sfmw=; b=Ej0H/trzjytff5przY8QfaRJY4sSNtArJC80Bbebj5Jls/T6ZLiE2r6vsvRdhHBqRq dHfLDicxJW9cu3eVWyBi057d87Cqr2g/NPOYSacU6hQ3VwF/ZFuBvOTTNFYdQMRdxdec GQunIkljoj1D21gEFdUuH4GtmRD/vr9YdjH0g0k0HKnUrqkML2H7G82dzEKNIjwgodbU 94MNKCmL0Qvs0OsWY46aTq/6YPDUAfA+hs4nM6fZ6JpW1Lzh9t5+ZdulCYAL2gmZuWAl wpD5RSX724ipm0dqpIQPxGGeR5gKmSdhLIaYYLPcUvH80cw5aUZNXJYBIJkegonZhMPd 3UDQ== X-Gm-Message-State: AA6/9Rm00PaBpkxQMzYmE9e+x3+KaHSlI2vjvWyMv32t0pmrouVKhQXhRGP5JwNH5WGvs43Yf27c3p2GI9x9Ig== X-Received: by 10.37.70.196 with SMTP id t187mr14767238yba.121.1475836644106; Fri, 07 Oct 2016 03:37:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.13.204.129 with HTTP; Fri, 7 Oct 2016 03:37:23 -0700 (PDT) In-Reply-To: References: Date: Fri, 7 Oct 2016 12:37:23 +0200 Message-ID: To: David Walker Cc: PHP internals Content-Type: multipart/alternative; boundary=001a113f453ceb296f053e43fec2 Subject: Re: [PHP-DEV] [RFC] Bug #72811 - Replacing parse_url() From: nikita.ppv@gmail.com (Nikita Popov) --001a113f453ceb296f053e43fec2 Content-Type: text/plain; charset=UTF-8 On Tue, Oct 4, 2016 at 8:14 PM, David Walker wrote: > Hi all, > > A couple weeks back I took a look at 72811[1]. The bug being that > parse_url() didn't accept IPv6 addresses without a scheme, like it did for > IPv4 addresses. I attempted to patch the specific bug within the scope of > how parse_url() was processing URI's. After opening a PR for the > resoution, Yasuo and Christoph both chimed in that perhaps replacing the > implementation with an re2c based parser would be better. We found a > parser[2] that did almost everything necessary. I took it and made it more > strictly adhere to RFC3986[3]. > > I have updated my original PR[4] and created a RFC[5] that aims to replace > the parsing of parse_url() to be more strict to RFC3986. This will provide > a BC break, as explained in the RFC that at very least warrants some > discussion. We had kicked around the idea on the PR of deprecating > parse_url, and creating a new function with the more-compliant parser, but > oped against it. > > I'm looking for discussion on if a total replacement is the preferred way > to go about this, and if, we should be making parse_url() more standards > strict. Since it today has many breaks with RFC3986 that provide > semi-reasonable parsing patterns. > > -- > Dave > > [1] - https://bugs.php.net/bug.php?id=72811 > [2] - https://github.com/staskobzar/url_parser_re2c > [3] - https://tools.ietf.org/html/rfc3986 > [4] - https://github.com/php/php-src/pull/2079 > [5] - https://wiki.php.net/rfc/replace_parse_url > Are you aware of the WHATWG URL standard [1]? Quoting the first goal statement: > Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process. (E.g., spaces, other "illegal" code points, query encoding, equality, canonicalization, are all concepts not entirely shared, or defined.) URL parsing needs to become as solid as HTML parsing. Basically this is the standard that describes how URL parsing actually works in the wild, in browser implementations. In particular it also includes a description of URL parsing in algorithmic form, including specific directions as to which errors are fatal and which are not. Also quoting from the goals: > Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest. For this reason, I would recommend against introducing the term "URI" anywhere. In particular the suggestion from this thread to use parse_uri() for this functionality seems like it will cause a lot of confusion. The URL standard also specifies the interface of the URL object used by JavaScript and I think we should consider whether we may want to simply adopt this (object-oriented) interface (potentially with adjustments for PHP specifics). I think an important part of this interface is that the URL is constructed using URL(url [, base]), where "base" is the base URL against which relative URLs are resolved. This base URL is required for parsing non-absolute URLs. To me this makes a lot of sense and I think it makes it much clearer how "incomplete" URLs are being treated. While we're at it, what's the state of IDN? May this be the time to properly support it? Nikita [1]: https://url.spec.whatwg.org/ --001a113f453ceb296f053e43fec2--