Hi all,
The discussion around WHATWG-URL on this list, as well as my work coordinating Uri-Interop https://github.com/uri-interop/interface, lead me to think PHP needs a multibyte equivalent of rawurlencode()
.
Broadly speaking, as far as I can tell:
- For an RFC 3986 URI, delimiters need to be percent-encoded, as well as non-ASCII characters.
- For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS characters do not.
(There are other details but I think you get the idea.)
The rawurlencode()
function does fine for URIs, but not for IRIs. Using rawurlencode()
for an IRI will encode multibyte characters when it should leave them alone. For example:
$val = 'fü bar';
$uriPath = '/heads/' . rawurlencode($val) . '/tails/';
assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true
$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false
(This might apply to WHATWG-URL component construction as well.)
Have I missed something, either in the specs or in PHP itself?
If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation might look something like the code below.
Thoughts?
function mb_rawurlencode(string $string) : string
{
$encoded = '';
foreach (mb_str_split($string) as $char) {
$encoded .= match ($char) {
chr(0) => "%00",
chr(1) => "%01",
chr(2) => "%02",
chr(3) => "%03",
chr(4) => "%04",
chr(5) => "%05",
chr(6) => "%06",
chr(7) => "%07",
chr(8) => "%08",
chr(9) => "%09",
chr(10) => "%0A",
chr(11) => "%0B",
chr(12) => "%0C",
chr(13) => "%0D",
chr(14) => "%0E",
chr(15) => "%0F",
chr(16) => "%10",
chr(17) => "%11",
chr(18) => "%12",
chr(19) => "%13",
chr(20) => "%14",
chr(21) => "%15",
chr(22) => "%16",
chr(23) => "%17",
chr(24) => "%18",
chr(25) => "%19",
chr(26) => "%1A",
chr(27) => "%1B",
chr(28) => "%1C",
chr(29) => "%1D",
chr(30) => "%1E",
chr(31) => "%1F",
chr(127) => "%7F",
"!" => '%21',
"#" => '%23',
"$" => '%24',
"%" => '%25',
"&" => '%26',
"'" => '%27',
"(" => '%28',
")" => '%29',
"*" => '%2A',
"+" => '%2B',
"," => '%2C',
"/" => '%2F',
":" => '%3A',
";" => '%3B',
"=" => '%3D',
"?" => '%3F',
"[" => '%5B',
"]" => '%5D',
default => $char,
};
}
return $encoded;
}
-- pmj
Hi
Am 2025-03-18 18:48, schrieb Paul M. Jones:
$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false
From my reading of RFC 3987 that result is incorrect. The space is
neither listed as iunreserved
, not as sub-delims
, thus isn't a valid
ipchar
. Thus the space needs to be encoded as %20 for IRIs as well.
The same mistake applies to the reference userland implementation below.
Best regards
Tim Düsterhus
Hi Tim & all,
Am 2025-03-18 18:48, schrieb Paul M. Jones:
$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // falseFrom my reading of RFC 3987 that result is incorrect. The space is neither listed as
iunreserved
, not assub-delims
, thus isn't a validipchar
. Thus the space needs to be encoded as %20 for IRIs as well. The same mistake applies to the reference userland implementation below.
Agreed; the naive implementation would need to less naive and pay closer attention to the ABNF for ucschar
and ipchar
in the spec.
Along those lines, I think there might need to be two additional changes/additions to help with encoding for RFC 3987 and WHATWG-URL component values:
-
http_build_query()
would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and corresponding logic (or entirely new functions); and -
parse_str()
would need a correspondingmb_parse_str()
.
-- pmj