Hi everyone
A while ago I wrote an e-mail about RFC1867 (multipart/form-data) not
being parsed by PHP for non-POST requests.
https://externals.io/message/120641
I'd like to announce an RFC that proposes adding a new function called
parse_post_data() to expose the existing functionality to userland, so
that the mechanism can be used for other HTTP verbs.
https://wiki.php.net/rfc/rfc1867-non-post
As opposed to the semantics I suggested in the previous thread, this
proposal returns the parsed result instead of populating it directly
to the superglobals, and it accepts an optional input stream.
Let me know if you have any feedback.
Ilija
Hi
Regarding the cleanup of the files, perhaps the files could be read into
a php://temp
stream
(https://www.php.net/manual/en/wrappers.php.php#wrappers.php.memory)?
While this would cause the function to be incompatible with $_FILES, I
think it would make for a much nicer API and it would also automatically
solve the cleanup problem.
Best regards
Tim Düsterhus
Hi Tim
Regarding the cleanup of the files, perhaps the files could be read into
aphp://temp
stream
(https://www.php.net/manual/en/wrappers.php.php#wrappers.php.memory)?While this would cause the function to be incompatible with $_FILES, I
think it would make for a much nicer API and it would also automatically
solve the cleanup problem.
php://temp would solve auto-cleanup of files nicely. However, whether
they are easier to work with will depend on what you're doing with the
file. The most common action after a file uploads is arguably to move
it to a permanent location using move_uploaded_file()
. With a stream
the obvious way to achieve the same is stream_copy_to_stream()
.
However, as the stream already has a file backing (if big enough, at
least) this copy is unnecessary. Please correct me if there's
something I have missed.
I also would really like to avoid subtle differences between the
automatically and manually invoked files. Given that the overwhelming
majority will not use PHP with something like RoadRunner, I think it
makes more sense to add the special casing (i.e. deleting the files
manually) in the uncommon case, than for everybody else to adapt their
code for the uncommon case.
Ilija
Hi
file. The most common action after a file uploads is arguably to move
it to a permanent location usingmove_uploaded_file()
. With a stream
I'm not sure if this is actually the most common action, at least in
modern applications. Generally there is some kind of processing
pipeline. You might want to strip exif metadata from images, check the
file contents with a virus scanner, recompress images to save disk usage
and in the end upload it to some cloud storage service (e.g. S3).
the obvious way to achieve the same is
stream_copy_to_stream()
.
However, as the stream already has a file backing (if big enough, at
least) this copy is unnecessary. Please correct me if there's
something I have missed.
Especially with the "upload to S3" part, folks will generally already
work with a stream abstraction, e.g. PSR-7 together with a PSR-18 HTTP
client.
Flysystem is probably the most widely used file system abstraction,
being included in Laravel, and it also has a "writeStream" functionality:
https://flysystem.thephpleague.com/docs/usage/filesystem-api/
I also would really like to avoid subtle differences between the
automatically and manually invoked files. Given that the overwhelming
Yes, that's fair. However there's already some inconsistency, because to
my understanding you would need to check for the HTTP method to
determine if calling the function is safe or not, no? As an application
developer, it would be great if I could safely call this new function
also for POST requests.
Best regards
Tim Düsterhus
Regarding the cleanup of the files, perhaps the files could be read into a
php://temp
stream
(https://www.php.net/manual/en/wrappers.php.php#wrappers.php.memory)?
I don't think that reading potentially large files into memory is a
great idea. It is precisely because of the memory issue that PHP's
existing POST handler writes files to disk, instead of leaving their
content available through an entry in $_FILES.
cheers,
Derick
Hi
Regarding the cleanup of the files, perhaps the files could be read into a
php://temp
stream
(https://www.php.net/manual/en/wrappers.php.php#wrappers.php.memory)?I don't think that reading potentially large files into memory is a
great idea. It is precisely because of the memory issue that PHP's
existing POST handler writes files to disk, instead of leaving their
content available through an entry in $_FILES.
To clarify: php://temp
spills over to disk at a certain size (2 MB by
default).
Best regards
Tim Düsterhus
Hi,
It should probably explicitly mention that it uses the same inis like
max_input_vars, max_file_uploads and max_multipart_body_parts.
It's kind of strange function as I can't decide where it should be placed.
I think it might be better as a stream function if it accepts only stream.
It means it could go to stream funcs and be called stream_parse_post_data
instead but not sure about. But not 100% sure about it as it doesn't
exactly fit there. But seems better than html functions (where it's placed
in the current PR) as it has nothing to do with html IMHO.
Cheers
Jakub
Regards
Jakub
Hi,
It should probably explicitly mention that it uses the same inis like
max_input_vars, max_file_uploads and max_multipart_body_parts.It's kind of strange function as I can't decide where it should be placed.
I think it might be better as a stream function if it accepts only stream.
It means it could go to stream funcs and be called stream_parse_post_data
instead but not sure about. But not 100% sure about it as it doesn't
exactly fit there. But seems better than html functions (where it's placed
in the current PR) as it has nothing to do with html IMHO.
Maybe it should go in main/rfc1867.c? That's where the
php_rfc1867_callback lives, and this seems somewhat related to that.
Cheers,
Ben
Hi Jakub
It should probably explicitly mention that it uses the same inis like max_input_vars, max_file_uploads and max_multipart_body_parts.
Indeed, I will mention that. Thank you.
It's kind of strange function as I can't decide where it should be placed. I think it might be better as a stream function if it accepts only stream. It means it could go to stream funcs and be called stream_parse_post_data instead but not sure about. But not 100% sure about it as it doesn't exactly fit there. But seems better than html functions (where it's placed in the current PR) as it has nothing to do with html IMHO.
TBH I have no idea how it landed there. When I created the PoC I just
threw it in somewhere, but it's indeed an odd place. I'll try to find
something better.
I don't think stream is a great place, as the common case of not
providing an input does not operate on streams, but on
sapi_module.read_post() directly.
I also don't think rfc1867.c is a great place as Ben suggested,
because the invoked function is actually agnostic to the exact format.
Instead, we use the existing functionality that chooses the parser
based on the content type, which includes
application/x-www-form-urlencoded.
Ilija
Hi
It should probably explicitly mention that it uses the same inis like
max_input_vars, max_file_uploads and max_multipart_body_parts.
That reminds me of this thread:
https://externals.io/message/118614
I'd love to see some functionality to handle those "input limits" in a
well-defined way (e.g. to be able to render a nice error page within my
application), but this is currently not (easily) possible, because the
error is emitted before my script boots. I'm currently using the
error_get_last()
workaround mentioned in that thread, but that is
everything but great.
Best regards
Tim Düsterhus
Hi Marco
Please note that you have accidentally created a new thread. I'm
responding from the main thread.
Just wanted to mention that maybe this is a great opportunity to create a request_ family and start with request_parse_post_data
Something like request_parse_body() could work. That should satisfy
both your and Michałs request. This naming extends nicely to other
formats if added later on.
This reminded me that it should be specified what happens when the
format is not supported (i.e. anything but multipart or urlencoded
formats). The automatically invoked behavior does nothing and leaves
the input stream unconsumed so that the PHP script can process it. For
the explicitly invoked version we should throw an exception to inform
the user that nothing happened.
Ilija
Hi everyone
Thank you for the feedback so far. I made a handful of changes to the RFC.
- The function is renamed to request_parse_body()
- The function will now throw instead of emitting warnings when hitting limits
- The Configuration section was added show how parsing limits may be
modified per endpoint - The php://input is explained better in relation to multipart parsing
Let me know if you have any more feedback.
Ilija
Hi Ilija,
pt., 13 paź 2023, 13:15 użytkownik Ilija Tovilo tovilo.ilija@gmail.com
napisał:
Hi everyone
On Fri, Oct 6, 2023 at 3:44 PM Ilija Tovilo tovilo.ilija@gmail.com
wrote:Thank you for the feedback so far. I made a handful of changes to the RFC.
- The function is renamed to request_parse_body()
- The function will now throw instead of emitting warnings when hitting
limits- The Configuration section was added show how parsing limits may be
modified per endpoint- The php://input is explained better in relation to multipart parsing
Let me know if you have any more feedback.
Considering the function supports two formats multipart/form-data or
the application/x-www-form-urlencodedand
the fact that the return type and required arguments differ in regards to
format: content-type needed for multipart which returns two arrays packed
into return array aka $_POST and $_FILES and the boundary that is part of
content-type header and the other format that doesn't require content-type
header and returns effectively only one array aka $_POST why not having two
separate functions?
The proposal is to include two separate functions that have clear semantics
and prevent from invoking multipart without boundary or only form with
boundary:
- request_parse_body($input_stream): array - returning just $_POST
and maybe - request_parse_multupart_body($input_stream, string $boundary): array
returning as proposed now two arrays packed $_POST and $_FILES
Why the $inpit_stream accepts null - does it make sense to invoke the
function then?
Hi
Considering the function supports two formats multipart/form-data or
the application/x-www-form-urlencodedand
the fact that the return type and required arguments differ in regards to
format: content-type needed for multipart which returns two arrays packed
into return array aka $_POST and $_FILES and the boundary that is part of
content-type header and the other format that doesn't require content-type
header and returns effectively only one array aka $_POST why not having two
separate functions?
Because then you can't just
[$fields, $files] = request_parse_body($request->getBody(),
$request->getHeader('content-type'));
$request = $request->withParsedBody($fields);
(using a PSR-7 ServerRequestInterface object for the example)
and instead would need to manually parse the 'content-type' header
instead of leaving the heavy lifting to PHP.
The proposal is to include two separate functions that have clear semantics
and prevent from invoking multipart without boundary or only form with
boundary:
- request_parse_body($input_stream): array - returning just $_POST
and maybe- request_parse_multupart_body($input_stream, string $boundary): array
returning as proposed now two arrays packed $_POST and $_FILESWhy the $inpit_stream accepts null - does it make sense to invoke the
function then?
Yes, because it effectively defaults to an optimized version of
'php://input'.
Best regards
Tim Düsterhus
Thank you for the feedback so far. I made a handful of changes to the RFC.
…
Let me know if you have any more feedback.
The only comment I would have is that I probably would be in favour of
not leaving the "config" argument (to over ride per call the
post_max_size settings etc) to a future scope.
Having to do ini_get/ini_set/ini_set(old) to override these settings
seems clunky.
cheers,
Derick
Hi Derick
The only comment I would have is that I probably would be in favour of
not leaving the "config" argument (to over ride per call the
post_max_size settings etc) to a future scope.Having to do ini_get/ini_set/ini_set(old) to override these settings
seems clunky.
As per request I've included an $options parameter to override the
relevant INI values for the duration of the function call.
Ilija
Hi everyone
Thank you for the feedback so far. I made a handful of changes to the RFC.
- The function is renamed to request_parse_body()
- The function will now throw instead of emitting warnings when hitting limits
- The Configuration section was added show how parsing limits may be
modified per endpoint- The php://input is explained better in relation to multipart parsing
Let me know if you have any more feedback.
Ilija
The functionality all seems reasonable to me. I have a few smaller concerns:
-
Like Derick, I think I'd favor including the config overrides now.
-
Lots of request bodies are not forms these days; they're frequently JSON or GraphQL. This function would be useless in those cases; that's fine, but should the name then suggest that it's for form data only? request_parse_form() or similar? I'm just concerned about misleading people into thinking it can parse their JSON bodies, when that's not a thing. (Unless we wanted to provide some kind of callback mechanism, which is probably overkill here.)
-
For an unsupported mime type, I'd recommend a more specific exception than InvalidArgumentException. Give that a custom sub-class that tracks what the actual mime type was that the request had and it rejected.
-
I don't quite grok the "input" section. So if I don't disable the automatic parsing, does that mean request_parse_body() will always fail? Or will it still work, but just be more memory-wasteful? That's not clear to me; I'd prefer if it works but is just memory-wasteful, personally, as that would be more portable for projects that want to use it.
--Larry Garfield
Hi Larry
The functionality all seems reasonable to me. I have a few smaller concerns:
- Like Derick, I think I'd favor including the config overrides now.
I will check if this can be implemented without too many changes.
- Lots of request bodies are not forms these days; they're frequently JSON or GraphQL. This function would be useless in those cases; that's fine, but should the name then suggest that it's for form data only? request_parse_form() or similar? I'm just concerned about misleading people into thinking it can parse their JSON bodies, when that's not a thing. (Unless we wanted to provide some kind of callback mechanism, which is probably overkill here.)
request_parse_body() is indeed not aimed at other content types as-is.
A generic name would allow extending the function to support other
content types in the future, although it's currently unclear whether
that's desirable. E.g. for JSON, people might be confused why there's
a file index in the returned array that is always empty.
- For an unsupported mime type, I'd recommend a more specific exception than InvalidArgumentException. Give that a custom sub-class that tracks what the actual mime type was that the request had and it rejected.
A custom exception class sounds reasonable. The mime-type is contained
in the exception message.
- I don't quite grok the "input" section. So if I don't disable the automatic parsing, does that mean request_parse_body() will always fail? Or will it still work, but just be more memory-wasteful? That's not clear to me; I'd prefer if it works but is just memory-wasteful, personally, as that would be more portable for projects that want to use it.
Whether request_parse_body() can work repeatedly depends on whether
the input has been buffered. application/x-www-form-urlencoded is
buffered and as such works multiple times. multipart/form-data can
be buffered if done manually by opening php://input and reading the
whole input before calling request_parse_body(). Yes, for post this
means one needs to disable enable_post_data_reading.
Since files are stored on disk, buffering files means doubling disk
load. Uploading a 2GB file would require at least 4GB of disk space. I
don't think that's a good trade-off as there shouldn't be a reason to
call this function twice.
Ilija
- Lots of request bodies are not forms these days; they're frequently JSON or GraphQL. This function would be useless in those cases; that's fine, but should the name then suggest that it's for form data only? request_parse_form() or similar? I'm just concerned about misleading people into thinking it can parse their JSON bodies, when that's not a thing. (Unless we wanted to provide some kind of callback mechanism, which is probably overkill here.)
request_parse_body() is indeed not aimed at other content types as-is.
A generic name would allow extending the function to support other
content types in the future, although it's currently unclear whether
that's desirable. E.g. for JSON, people might be confused why there's
a file index in the returned array that is always empty.
Is extending it to other mime types in the future the intent? That's not mentioned in future scope, IIRC. That would be an interesting conversation to have, though probably a distraction from the current scope.
- For an unsupported mime type, I'd recommend a more specific exception than InvalidArgumentException. Give that a custom sub-class that tracks what the actual mime type was that the request had and it rejected.
A custom exception class sounds reasonable. The mime-type is contained
in the exception message.
This is a pet peeve of mine in exception design. With a common exception, the catching code can do nothing other than log the message wholesale. It doesn't have any context beyond the exception type, which is often insufficient. I've had cases where I needed to sscanf()
a PHP exception message in order to get the data out of it that i needed in order to handle it correctly. That's... just kinda gross.
Now that we have readonly properties, any variable part of the message really should be exposed as either a readonly property of the exception, or via a method. One or the other, but it's inadequate to only provide that information in a pre-formatted string that requires fugly parsing.
- I don't quite grok the "input" section. So if I don't disable the automatic parsing, does that mean request_parse_body() will always fail? Or will it still work, but just be more memory-wasteful? That's not clear to me; I'd prefer if it works but is just memory-wasteful, personally, as that would be more portable for projects that want to use it.
Whether request_parse_body() can work repeatedly depends on whether
the input has been buffered. application/x-www-form-urlencoded is
buffered and as such works multiple times. multipart/form-data can
be buffered if done manually by opening php://input and reading the
whole input before calling request_parse_body(). Yes, for post this
means one needs to disable enable_post_data_reading.
Please clarify that explicitly in the RFC, so hopefully we remember to document that very very clearly when the docs are updated later.
--Larry Garfield
Hi everyone
I'd like to announce an RFC that proposes adding a new function called
parse_post_data() to expose the existing functionality to userland, so
that the mechanism can be used for other HTTP verbs.
I took a closer look at RoadRunner regarding file uploads and noticed
that $input_stream will not be useful to it after all. RoadRunner
handles files directly in its Go server by parsing the multipart body,
storing any files to disk, and only transferring the file handles
(along with any post data) to the PHP workers. New SAPIs could instead
tweak sapi_module.read_post() when handling a new request. We can add
these parameters if somebody can present a valid use-case, for now I
opted to remove the $input_stream and $content_type parameters.
Please let me know if you have any more feedback. I will wait at least
2 weeks before going forward with a vote, as this is a bigger change
to the RFC.
Ilija
Hey, I'm not sure if this is bikeshedding, but the concept of parsing
bodies for non-POST requests lands really close to a proposal for adding a
QUERY method to the HTTP standard.
Draft:
https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-safe-method-w-body
Discussion:
https://github.com/httpwg/http-extensions/labels/safe-method-w-body
It's meant to address the recent need for complex querying (GraphQL /
Elastic Search) that necessitates using POST but loses the default caching
of GET.
I think this RFC could serve as the groundwork for supporting QUERY if it's
extended to other MIME types in the future as Larry suggested. But QUERY
probably still has years to go before there is a consensus on it (I think
it's been talked about for 6+ years now)
Hi everyone
On Fri, Oct 6, 2023 at 3:44 PM Ilija Tovilo tovilo.ilija@gmail.com
wrote:I'd like to announce an RFC that proposes adding a new function called
parse_post_data() to expose the existing functionality to userland, so
that the mechanism can be used for other HTTP verbs.I took a closer look at RoadRunner regarding file uploads and noticed
that $input_stream will not be useful to it after all. RoadRunner
handles files directly in its Go server by parsing the multipart body,
storing any files to disk, and only transferring the file handles
(along with any post data) to the PHP workers. New SAPIs could instead
tweak sapi_module.read_post() when handling a new request. We can add
these parameters if somebody can present a valid use-case, for now I
opted to remove the $input_stream and $content_type parameters.Please let me know if you have any more feedback. I will wait at least
2 weeks before going forward with a vote, as this is a bigger change
to the RFC.Ilija
--
To unsubscribe, visit: https://www.php.net/unsub.php
Hi Sam
Hey, I'm not sure if this is bikeshedding, but the concept of parsing bodies for non-POST requests lands really close to a proposal for adding a QUERY method to the HTTP standard.
Draft: https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-safe-method-w-body
Discussion: https://github.com/httpwg/http-extensions/labels/safe-method-w-bodyIt's meant to address the recent need for complex querying (GraphQL / Elastic Search) that necessitates using POST but loses the default caching of GET.
I think this RFC could serve as the groundwork for supporting QUERY if it's extended to other MIME types in the future as Larry suggested. But QUERY probably still has years to go before there is a consensus on it (I think it's been talked about for 6+ years now)
Looking at the RFC, it doesn't seem like multipart is an intended
format for QUERY requests. application/x-www-form-urlencoded is
intended and should work as-is with request_parse_body(). Parsing
anything other than multipart and form-data is possible, but not at
all related to QUERY (and IMO not desirable).
You can implement a QUERY endpoint in PHP today (if your web server
supports it) by reading from php://input. It's worth noting that PHPs
built-in server does not support custom HTTP methods and will return a
501.
I don't think there's anything actionable here. Please clarify if I'm
missing something.
Ilija
Hi
I love this RFC so we can use it from CLI-SAPI
We can't use sapi_module.read_post() from CLI.
https://github.com/joanhey/AdapterMan
This runtime use the CLI-SAPI, but this SAPI is very limited. We can use
parse_str()
easily for 'application/x-www-form-urlencoded' but we need
to replicate in userland for 'multipart/form-data'.
https://github.com/joanhey/AdapterMan/blob/master/src/Http.php#L410-L416
https://github.com/joanhey/AdapterMan/blob/master/src/ParseMultipart.php
The CLI sapi need more access to internal PHP functions.
Amp, React, Revolt, Workerman, Adapterman, ... use only CLI-SAPI.
Also I created a issue https://github.com/php/php-src/issues/12304
Because the headers functions don't work in CLI. We can override that
functions, but later we need an spiral of changes: setcookie, Session,
http_response_code, ...
Best regards
Joan Miquel