Potential RFC: mb_rawurlencode() ?

3 months ago by Paul M. Jones — view source

unread

Hi all,

The discussion around WHATWG-URL on this list, as well as my work coordinating Uri-Interop https://github.com/uri-interop/interface, lead me to think PHP needs a multibyte equivalent of rawurlencode().

Broadly speaking, as far as I can tell:

For an RFC 3986 URI, delimiters need to be percent-encoded, as well as non-ASCII characters.
For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS characters do not.

(There are other details but I think you get the idea.)

The rawurlencode() function does fine for URIs, but not for IRIs. Using rawurlencode() for an IRI will encode multibyte characters when it should leave them alone. For example:

$val = 'fü bar';

$uriPath = '/heads/' . rawurlencode($val) . '/tails/';
assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false

(This might apply to WHATWG-URL component construction as well.)

Have I missed something, either in the specs or in PHP itself?

If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation might look something like the code below.

Thoughts?

function mb_rawurlencode(string $string) : string
{
    $encoded = '';

    foreach (mb_str_split($string) as $char) {
        $encoded .= match ($char) {
            chr(0) => "%00",
            chr(1) => "%01",
            chr(2) => "%02",
            chr(3) => "%03",
            chr(4) => "%04",
            chr(5) => "%05",
            chr(6) => "%06",
            chr(7) => "%07",
            chr(8) => "%08",
            chr(9) => "%09",
            chr(10) => "%0A",
            chr(11) => "%0B",
            chr(12) => "%0C",
            chr(13) => "%0D",
            chr(14) => "%0E",
            chr(15) => "%0F",
            chr(16) => "%10",
            chr(17) => "%11",
            chr(18) => "%12",
            chr(19) => "%13",
            chr(20) => "%14",
            chr(21) => "%15",
            chr(22) => "%16",
            chr(23) => "%17",
            chr(24) => "%18",
            chr(25) => "%19",
            chr(26) => "%1A",
            chr(27) => "%1B",
            chr(28) => "%1C",
            chr(29) => "%1D",
            chr(30) => "%1E",
            chr(31) => "%1F",
            chr(127) => "%7F",
            "!" => '%21',
            "#" => '%23',
            "$" => '%24',
            "%" => '%25',
            "&" => '%26',
            "'" => '%27',
            "(" => '%28',
            ")" => '%29',
            "*" => '%2A',
            "+" => '%2B',
            "," => '%2C',
            "/" => '%2F',
            ":" => '%3A',
            ";" => '%3B',
            "=" => '%3D',
            "?" => '%3F',
            "[" => '%5B',
            "]" => '%5D',
            default => $char,
        };
    }

    return $encoded;
}

-- pmj

3 months ago by tim@bastelstu.be — view source

unread

Am 2025-03-18 18:48, schrieb Paul M. Jones:

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false

From my reading of RFC 3987 that result is incorrect. The space is
neither listed as iunreserved, not as sub-delims, thus isn't a valid
ipchar. Thus the space needs to be encoded as %20 for IRIs as well.
The same mistake applies to the reference userland implementation below.

Best regards
Tim Düsterhus

3 months ago by Paul M. Jones — view source

unread

Hi Tim & all,

Am 2025-03-18 18:48, schrieb Paul M. Jones:

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false

From my reading of RFC 3987 that result is incorrect. The space is neither listed as iunreserved, not as sub-delims, thus isn't a valid ipchar. Thus the space needs to be encoded as %20 for IRIs as well. The same mistake applies to the reference userland implementation below.

Agreed; the naive implementation would need to less naive and pay closer attention to the ABNF for ucschar and ipchar in the spec.

Along those lines, I think there might need to be two additional changes/additions to help with encoding for RFC 3987 and WHATWG-URL component values:

http_build_query() would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and corresponding logic (or entirely new functions); and
parse_str() would need a corresponding mb_parse_str().

-- pmj

3 months ago by youkidearitai — view source

unread

---------- Forwarded message ---------
From: youkidearitai youkidearitai@gmail.com
Date: 2025年3月20日(木) 14:41
Subject: Re: [PHP-DEV] Potential RFC: mb_rawurlencode() ?
To: Paul M. Jones pmjones@pmjones.io

2025年3月19日(水) 2:52 Paul M. Jones pmjones@pmjones.io:

Hi all,

The discussion around WHATWG-URL on this list, as well as my work coordinating Uri-Interop https://github.com/uri-interop/interface, lead me to think PHP needs a multibyte equivalent of rawurlencode().

Broadly speaking, as far as I can tell:

For an RFC 3986 URI, delimiters need to be percent-encoded, as well as non-ASCII characters.

For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS characters do not.

(There are other details but I think you get the idea.)

The rawurlencode() function does fine for URIs, but not for IRIs. Using rawurlencode() for an IRI will encode multibyte characters when it should leave them alone. For example:
$val = 'fü bar';

$uriPath = '/heads/' . rawurlencode($val) . '/tails/';
assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false
(This might apply to WHATWG-URL component construction as well.)

Have I missed something, either in the specs or in PHP itself?

If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation might look something like the code below.

Thoughts?
function mb_rawurlencode(string $string) : string
{
    $encoded = '';

    foreach (mb_str_split($string) as $char) {
        $encoded .= match ($char) {
            chr(0) => "%00",
            chr(1) => "%01",
            chr(2) => "%02",
            chr(3) => "%03",
            chr(4) => "%04",
            chr(5) => "%05",
            chr(6) => "%06",
            chr(7) => "%07",
            chr(8) => "%08",
            chr(9) => "%09",
            chr(10) => "%0A",
            chr(11) => "%0B",
            chr(12) => "%0C",
            chr(13) => "%0D",
            chr(14) => "%0E",
            chr(15) => "%0F",
            chr(16) => "%10",
            chr(17) => "%11",
            chr(18) => "%12",
            chr(19) => "%13",
            chr(20) => "%14",
            chr(21) => "%15",
            chr(22) => "%16",
            chr(23) => "%17",
            chr(24) => "%18",
            chr(25) => "%19",
            chr(26) => "%1A",
            chr(27) => "%1B",
            chr(28) => "%1C",
            chr(29) => "%1D",
            chr(30) => "%1E",
            chr(31) => "%1F",
            chr(127) => "%7F",
            "!" => '%21',
            "#" => '%23',
            "$" => '%24',
            "%" => '%25',
            "&" => '%26',
            "'" => '%27',
            "(" => '%28',
            ")" => '%29',
            "*" => '%2A',
            "+" => '%2B',
            "," => '%2C',
            "/" => '%2F',
            ":" => '%3A',
            ";" => '%3B',
            "=" => '%3D',
            "?" => '%3F',
            "[" => '%5B',
            "]" => '%5D',
            default => $char,
        };
    }

    return $encoded;
}
-- pmj

Hi, Paul.

I think signature is below:

function mb_rawurlencode(string $string, string $encode): string {}

Because the mbstring function is other than Unicode (ISO-8859-1 to
ISO-8859-16, Shift_JIS, EUC-* etc).
Other than that I don't know yet

Oops, I missing to send to internals.
Sorry resend this is.

Yuya

--

Yuya Hamada (tekimen)

3 months ago by Paul M. Jones — view source

unread

Hi all,

---------- Forwarded message ---------
From: youkidearitai youkidearitai@gmail.com
Date: 2025年3月20日(木) 14:41
Subject: Re: [PHP-DEV] Potential RFC: mb_rawurlencode() ?
To: Paul M. Jones pmjones@pmjones.io

2025年3月19日(水) 2:52 Paul M. Jones pmjones@pmjones.io:
If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation might look something like the code below.

Thoughts?
function mb_rawurlencode(string $string) : string
Hi, Paul.

I think signature is below:
function mb_rawurlencode(string $string, string $encode): string {}

Ah yes, you're right -- probably ?string $encode = null to match with mb_substr().

Oops, I missing to send to internals.
Sorry resend this is.

Not to worry, thank you!

-- pmj

3 months ago by tim@bastelstu.be — view source

unread

Am 2025-03-20 17:46, schrieb Paul M. Jones:

function mb_rawurlencode(string $string, string $encode): string {}
Ah yes, you're right -- probably ?string $encode = null to match with
mb_substr().

I am not sure if that signature makes sense and if the proposed
functionality fits into mbstring for that reason. IRIs are defined as
UTF-8, any other encoding results in invalid output / results that are
not interoperable. As one example paragraph from RFC 3987:

Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was
used in the URI.

The correct solution to me is to build a proper thought-through API as
part of the proposed new Uri namespace and not adding new standalone
functions without a clear vision.

Best regards
Tim Düsterhus

3 months ago by Rowan Tommins [IMSoP] — view source

unread

I am not sure if that signature makes sense and if the proposed
functionality fits into mbstring for that reason. IRIs are defined as
UTF-8, any other encoding results in invalid output / results that are
not interoperable.

This confirms a nagging feeling I had when I first saw the thread: the
name "mb_rawurlencode" implies "do the same things as rawurlencode, but
for multi-byte strings", but that's not what is being proposed.

Notably, a similar feature is actually slated for removal; to quote
https://www.php.net/manual/en/migration82.deprecated.php#migration82.deprecated.mbstring

Usage of the QPrint, Base64, Uuencode, and HTML-ENTITIES 'text
encodings' is deprecated for all MBString functions. Unlike all the
other text encodings supported by MBString, these do not encode a
sequence of Unicode codepoints, but rather a sequence of raw bytes. It
is not clear what the correct return values for most MBString functions
should be when one of these non-encodings is specified.

The same applies here: if you write mb_rawurlencode($my_string,
'SHIFT-JIS'), does that mean convert what you can to ASCII, and percent
encode the rest for a URI; or does it mean convert to UTF-8, and percent
encode as necessary for an IRI? If the input contains sequences which
are not valid SHIFT-JIS, are those bytes treated as unencodable
(producing errors or substitution characters), or are they directly
percent encoded?

The correct solution to me is to build a proper thought-through API as
part of the proposed new Uri namespace and not adding new standalone
functions without a clear vision.

I completely agree.

For instance, the IRI standard does include an algorithm for converting
a non-Unicode IRI representation to a URI - but it requires a Unicode
Normalization step, which is a complex algorithm not included in
ext/standard or ext/mbstring, only ext/intl. However, a function in the
URI namespace that only handled the UTF-8 input case might still be useful.

Along those lines, I think there might need to be two additional changes/additions to help with encoding for RFC 3987 and WHATWG-URL component values:

http_build_query() would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and corresponding logic (or entirely new functions); and

parse_str() would need a corresponding mb_parse_str().

I haven't followed the other URI thread at all, but isn't replacing the
scattered standard library functions with a consistent API the whole
point of that effort?

parse_str() in particular has a non-descriptive name, and a weird
function signature because it used to directly overwrite variables by name.

As a comparison, we didn't extend the shuffle() function with an
algorithm parameter, we added a shuffleArray() method to the new
Randomizer class.

--
Rowan Tommins
[IMSoP]

3 months ago by Paul M. Jones — view source

unread

Hi Rowan & all,

I am not sure if that signature makes sense and if the proposed functionality fits into mbstring for that reason. IRIs are defined as UTF-8, any other encoding results in invalid output / results that are not interoperable.

This confirms a nagging feeling I had when I first saw the thread: the name "mb_rawurlencode" implies "do the same things as rawurlencode, but for multi-byte strings", but that's not what is being proposed.

[snip]

No argument; my point is more "if we are going to do IRI and WHATWG-URL, we're going to need some additional support functionality around encoding component values for them." How that is achieved is up for grabs. If this discussion has revealed a tentative consensus that it needs to happen, I consider it a success.

Next up: what exactly should the API around this functionality look like? I suggested functions but that's clearly a non-starter; what do we feel is a good alternative, and can it be achieved independently from (but in support of) the URI+WHATWG-URL proposal?

-- pmj

3 months ago by Rowan Tommins [IMSoP] — view source

unread

Next up: what exactly should the API around this functionality look like? I suggested functions but that's clearly a non-starter; what do we feel is a good alternative, and can it be achieved independently from (but in support of) the URI+WHATWG-URL proposal?

As I say, I haven't followed the previous conversation at all, but from a glance at the RFC, it seems the proposed classes are called "Url"/"Uri", not "UrlParser"/"UriParser", so could maybe be expanded to creating from parts. I don't know where exactly IRIs should fit in, but maybe as a new object in the same hierarchy?

There's also definitely a place for standalone functions for handling specific jobs on fragments of URIs. It would actually be really great to have a replacement for parse_str which didn't carry the baggage of old PHP versions - no by-reference output, no name mangling of keys (at least not by default). http_build_query isn't as urgently in need of replacement, but a clean start could default the separator to '&' rather than pulling from an INI setting.

Whether each function should take an enum flag for encoding variants, or be split into a family of similar functions, I don't know. At the moment, http_build_query accepts constants, (raw)urlencode is split into two functions, and parse_str doesn't give any option.

I don't want to have to memorise a bunch of RFC numbers in order to know whether spaces will be encoded as plus signs, but maybe we can find something more descriptive than "raw" to distinguish them.

In short, I wouldn't start from the point of "how do we extend current functions to handle IRIs?", I'd start from the point of "what functions do we need for handling URI/URL/IRI parts, and what variations of each?"

Rowan Tommins
[IMSoP]