It was suggested I post this here.
In PHP, the character sequence "://" separates the protocol name from
the protocol-specific part of a stream name. Clearly, the intention is
that these stream names are URLs (i.e., URIs that actually provide a
location for the identified resource). However, the URI specification
(RFC 3986) states that the scheme delimiter is merely ":", and that
"://" is only applicable for some URI formats. There are URLs in
common use that do not use "://" (e.g., mailto:), and in fact support
for "zlib:" is a hardwired exception in the present code.
May I propose that the parser which parses out the scheme from the rest
of the URL look only for the initial ":" in the stream name, rather than
"://".
Existing uses of stream wrappers will continue to function, since the
name of the scheme won't actually change, and it's the wrapper author's
responsibility to parse the rest of the URL anyway; but it will become
possible to correctly write, e.g., "mailto:eric@example.com" instead of
"mailto://eric@example.com", or (to use the example used in the manual
to describe stream_wrapper_register()
) "var:myvar" instead of "var://myvar".
This would also make use of parse_url()
more consistent as, for example,
parse_url('var:myvar') will put the name of myvar into the path element
of the returned array, instead of mistakenly putting it in the host element.
MLO
You know, I could have sworn that I only looked for a : as the separator.
I seem to remember a discussion about this in the past, but don't
recall the details.
Would you mind searching the archives for the old php-dev mailing list
to see if you can find anything else on this matter?
--Wez.
It was suggested I post this here.
In PHP, the character sequence "://" separates the protocol name from
the protocol-specific part of a stream name. Clearly, the intention is
that these stream names are URLs (i.e., URIs that actually provide a
location for the identified resource). However, the URI specification
(RFC 3986) states that the scheme delimiter is merely ":", and that
"://" is only applicable for some URI formats. There are URLs in
common use that do not use "://" (e.g., mailto:), and in fact support
for "zlib:" is a hardwired exception in the present code.May I propose that the parser which parses out the scheme from the rest
of the URL look only for the initial ":" in the stream name, rather than
"://".Existing uses of stream wrappers will continue to function, since the
name of the scheme won't actually change, and it's the wrapper author's
responsibility to parse the rest of the URL anyway; but it will become
possible to correctly write, e.g., "mailto:eric@example.com" instead of
"mailto://eric@example.com", or (to use the example used in the manual
to describestream_wrapper_register()
) "var:myvar" instead of "var://myvar".This would also make use of
parse_url()
more consistent as, for example,
parse_url('var:myvar') will put the name of myvar into the path element
of the returned array, instead of mistakenly putting it in the host element.MLO
You know, I could have sworn that I only looked for a : as the separator.
Not according to php_stream_locate_url_wrapper():
for (p = path; isalnum((int)*p) || *p == '+' || *p == '-' || *p == '.';
p++) n++;
if ((*p == ':') && (n > 1) && !strncmp("://", p, 3)) {
protocol = path;
} else if (strncasecmp(path, "zlib:", 5) == 0) {
...
}
Which of course says: "It it ain't [a-zA-Z0-9+.-]+://, and it ain't zlib:,
then it ain't a wrapper."
I seem to remember a discussion about this in the past, but don't
recall the details.
The only thing that positively leaps to mind is the ability to
read-from/write-to alternate data streams under win32. But I suppose that
could be worked around easily enough. If nothing else an explicit file://
scheme when that's needed would remedy the problem. Remind me what the
*nixes do with ':' ?
-Sara
It's not 'nix you want to worry about, but systems where : identifies
a drive or device; win32, vms (?), (and Amiga, if that still counts :)
Could be more trouble than its worth, BC wise.
--Wez.
The only thing that positively leaps to mind is the ability to
read-from/write-to alternate data streams under win32. But I suppose that
could be worked around easily enough. If nothing else an explicit file://
scheme when that's needed would remedy the problem. Remind me what the
*nixes do with ':' ?-Sara
It's not 'nix you want to worry about, but systems where : identifies
a drive or device; win32, vms (?), (and Amiga, if that still counts :)
Doi, drive letters....
So yeah that can be worked around with a:
#ifdef PHP_WIN32
if (p = path+1 && isalpha(*path) && (*p == '/' || p == '\')) { / win32
drive */ }
#endif
But then you've also got the ADS workaround:
#ifdef PHP_WIN32
{
char *q;
for(q = p + 1; isnum(*q); q++);
if (!q) { / filename with ADS identifier */ }
}
#endif
Could be more trouble than its worth, BC wise.
And god knows what else for other platforms which give ':' meaning. It
comes down to "How ugly does one want to get?".
-Sara
If there's something that looks like a scheme (i.e., a well-formed
sequence of
characters followed by ':'),
see if it's registered;
if it is,
the appropriate wrapper should be used.
Otherwise,
on platforms where ':' has significance,
try it again as a file path.
Otherwise, it fails due to an absent stream wrapper.
This limits problems to users who are trying to access a child directory
of the
current path which happens to have the same name as a registered scheme. The
problems will consist of the stream failing because the URL it's
receiving is
bogus. People in such a situation can use the file: scheme explicitly to
disambiguate (assuming they can't have a directory whose name starts with
a double slash!).
Alternately,
on problem platforms,
if the string is ambiguous,
see if it is well-formed as a file path.
If it is,
try it as such.
If it's not, or it fails,
see if it starts with a registered scheme name and if so,
try that.
It would be easier to check if a string is a well-formed file path than
it is to
check if it's a valid URL according to some arbitrary scheme (impossible in
general).
Assuming no-one tries to register a one-letter scheme, the Windows build
can get
away with seeing if the "scheme" is only one letter long, and if it is,
assume
that it's a drive letter.
I dunno; Windows users use '/' as the directory separator in file://
URLs, since
it's supposed to be up to the application to map a URL to the actual
resource
('' is frequently tolerated however) in whatever platform-specific
manner is
appropriate; does the same hold for platforms that use ':' as the directory
separator in their file: URLs? Currently the standard for file URLs is still
RFC1738; this is supposed to be updated at some stage, however. It
explicitly
gives the VMS example of mapping
DISK$USER:[MY.NOTES]NOTE123456.TXT
on vms.host.edu to the URL
file://vms.host.edu/disk$user/my/notes/note12345.txt
and noting that to refer to the local machine the host part can be either
'localhost' or the empty string.
If there's something that looks like a scheme (i.e., a well-formed
sequence of
characters followed by ':'),
see if it's registered;
if it is,
the appropriate wrapper should be used.
Otherwise,
on platforms where ':' has significance,
try it again as a file path.
Otherwise, it fails due to an absent stream wrapper.
This is flawed, as we'd then need to introduct sanity checks to
prevent registering a handler over the top of an existing drive on
windows. Similarly, we'd then be touching the filesystem before
deciding if we should touch the filesystem, which is just wrong (and
slow).
This limits problems to users who are trying to access a child directory
of the
current path which happens to have the same name as a registered scheme. The
problems will consist of the stream failing because the URL it's
receiving is
bogus. People in such a situation can use the file: scheme explicitly to
disambiguate (assuming they can't have a directory whose name starts with
a double slash!).
Keep in mind that windows tends to treat / as , and that \ is
certainly a valid prefix for accessing remote servers over SMB and
also acts as the prefix for such things as windows block device
access.
Alternately,
on problem platforms,
if the string is ambiguous,
see if it is well-formed as a file path.
If it is,
try it as such.
If it's not, or it fails,
see if it starts with a registered scheme name and if so,
try that.
It would be easier to check if a string is a well-formed file path than
it is to
check if it's a valid URL according to some arbitrary scheme (impossible in
general).
Really?
Can you guarantee that your user-space code to sniff out the path is
going to work 100% of the time on all platforms?
Assuming no-one tries to register a one-letter scheme, the Windows build
can get
away with seeing if the "scheme" is only one letter long, and if it is,
assume
that it's a drive letter.
It's not just 1 letter. There are 3 letter special device names too.
DISK$USER:[MY.NOTES]NOTE123456.TXT
on vms.host.edu to the URL
file://vms.host.edu/disk$user/my/notes/note12345.txt
and noting that to refer to the local machine the host part can be either
'localhost' or the empty string.
The file:// protocol is a load of rubbish, because it neglects to
specify how remote file access should work.
What do we really gain from this added complexity?
All I forsee is a bunch of butt-ugly consistency checks all over the
place, to handle platforms and cases where : has significance to the
filesystem. Not only that, but it will take a lot of trial and error
to make sure things are working properly.
Is there a clear win for PHP, that outweighs the strong risk of
breaking PHP until all the edge cases have been resolved?
--Wez.
Wez Furlong wrote:
Really?
Can you guarantee that your user-space code to sniff out the path is
going to work 100% of the time on all platforms?
Who said user-space? I meant in the implementation of fopen()
.
It's not just 1 letter. There are 3 letter special device names too.
Natch; yeah. You still can't create a directory named "aux". Or "com1".
The file:// protocol is a load of rubbish, because it neglects to
specify how remote file access should work.
It explicitly states as much.
"The file URL scheme is unusual in that it does not specify an Internet
protocol or access method for such files; as such, its utility in
network protocols between hosts is limited."
Is there a clear win for PHP, that outweighs the strong risk of
breaking PHP until all the edge cases have been resolved?
Probably not. It would have been easier to fix it earlier on before
kludging file paths and URLs together in the same namespace without
thinking through the consequences, but that opportunity is long lost. I
guess it will just have to be another case where we have to say "Screw
the standards. They're too difficult to implement." Ah well, it's only a
couple of characters. tel://+1-816-555-1212 it is, then. And explain to
users why some URLs need to be massaged (in user space) before PHP can
recognise them. It's not a bug, it's a feature.
MLO
Really?
Can you guarantee that your user-space code to sniff out the path is
going to work 100% of the time on all platforms?Who said user-space? I meant in the implementation of
fopen()
.
fopen()
is implemented in user-space, meaning, not kernel-space.
Only the kernel knows what logic it will really really use to resolve
a valid path.
Emulating that code for each supported platform on which PHP runs is
plain stupid.
The file:// protocol is a load of rubbish, because it neglects to
specify how remote file access should work.It explicitly states as much.
"The file URL scheme is unusual in that it does not specify an Internet
protocol or access method for such files; as such, its utility in
network protocols between hosts is limited."
s/limited/useless/
Is there a clear win for PHP, that outweighs the strong risk of
breaking PHP until all the edge cases have been resolved?Probably not. It would have been easier to fix it earlier on before
kludging file paths and URLs together in the same namespace without
thinking through the consequences, but that opportunity is long lost.
You might regard it as a kludge, but it's actually a corner-stone of
PHP development.
I guess it will just have to be another case where we have to say "Screw
the standards. They're too difficult to implement." Ah well, it's only a
couple of characters.
I suggest that you go and re-read RFC 1738, section 3.1, Common
Internet Scheme Syntax. We support that, because the original
wrappers implementation was solely for "URL schemes that involve the
direct use of an IP-based protocol to a specified host on the
Internet".
If you read on further, section 3.5 states "The mailto URL scheme is
used to designate the Internet mailing address of an individual or
service. No additional information other than an Internet mailing
address is present or implied."
In other words, there is no defined mapping to a streaming data
source, which is what the wrappers layer in PHP is built for.
tel://+1-816-555-1212 it is, then. And explain to
users why some URLs need to be massaged (in user space) before PHP can
recognise them. It's not a bug, it's a feature.
Do you think that by trying to insult PHP you'll motivate the
developers to try and change it to the way you think it should work?
The bottom line is that by searching for ://, we can completely
unambiguously detect Common Internet Scheme Syntax URLs, and do very
simply and very quickly, something that is important in high traffic
web applications.
--Wez.
Wez Furlong wrote:
Really?
Can you guarantee that your user-space code to sniff out the path is
going to work 100% of the time on all platforms?Who said user-space? I meant in the implementation of
fopen()
.
fopen()
is implemented in user-space, meaning, not kernel-space.
Only the kernel knows what logic it will really really use to resolve
a valid path.
Emulating that code for each supported platform on which PHP runs is
plain stupid.
I still don't see why emulation code would be necessary. What happens
when fopen()
is passed an invalid file path now? It's obviously capable
of coping with failure. What is so unimplementable with it coping with
its initial failure by going "Drat. Okay, might it be a URL? Looks like
it has a scheme... let's try it". If you insist, you could wrap that in
#ifdefs so that the code is only compiled on platforms where 'foo:'
might validly appear at the start of a file path.
if(the string might be a URL)
{
#ifdef FILEPATHS_CAN_START_WITH_SCHEMELIKE_SEQUENCE
if(attempt to open as local file succeeds)
return handle
else
#endif
if(scheme is registered && attempt to open as stream succeeds)
return handle
else
return fail
}
else
{
if(attempt to open as local file succeeds)
return handle
else
return fail
}
I suggest that you go and re-read RFC 1738, section 3.1, Common
Internet Scheme Syntax. We support that, because the original
wrappers implementation was solely for "URL schemes that involve the
direct use of an IP-based protocol to a specified host on the
Internet"....
In other words, there is no defined mapping to a streaming data
source, which is what the wrappers layer in PHP is built for.
Sorry, I didn't see any mention of that in the documentation. Just saw
"URL" and assumed that it referred to RFC 2396, which IIRC would have
been the current standard at the time. Nothing about direct use of an
IP-based protocol to a specified host (like a local variable or a
gzip'ed file?). If I could search back through the development
discussion list over the relevant time period I'd've found this?
In short, what you're talking about supporting is (what is now) Section
3.2 of RFC 3986. Thanks for clearing that up.
Do you think that by trying to insult PHP you'll motivate the
developers to try and change it to the way you think it should work?
Just characterising the sort of response I can imagine people not
married to PHP making when they first run into this. Hey, none of this
is the result of any decision of mine. So I wasn't aware of which
interpretation of "URL" you had chosen to use. Now that you've gotten
around to saying which (to me, if not to those other users I mentioned),
I've got my answer.
MLO