I'd like to solicit opinions on how we should treat conversion failures
during HTTP input decoding. There are two issues at hand: fallback
mechanism and application-driven decoding in case of failure. Let's
look at the proposal for the latter one first.
If the decoding of HTTP input fails (and the failure state would be
achieved as soon as even one variable fails), PHP should set an error
flag somewhere that is accessible to the user, via either a global
variable or a function. It should also keep the original request data
around (query string, POST body, and cookie data). The application
should be able to access this data, since the encoding can be passed in
the query string [1]. The application can then check this error flag
and then call a function -- request_decode() perhaps -- to ask PHP to
re-decode the request data based on a this specific encoding. For
example:
if (request_decoding_failed()) {
request_decode(request_get_raw('ei'));
}
We might be able to tie this in with the input filter, but that means
that the input filter will have to be required by PHP. I am open to
other suggestions in this area.
As for the first issue, PHP attempts to decode the input using the
value of the unicode.output_encoding setting, because that is the most
logical choice if we assume that the clients send the data back in the
encoding that the page with the form was in. We could implement a
fallback mechanism where PHP looks at the Accept-Charset header sent by
the client[2]. This header is supposed to indicate what character sets
are acceptable for the response. While this is not the same as
specifying the character set of the request, it might be a good enough
indicator of it. Or we could simply set the error state and let
application figure out what charset it wants to use for decoding.
Thanks for your attention.
-Andrei
[1] http://search.yahoo.com/search?ei=UTF-8&p=php
[2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
-----Original Message-----
From: Andrei Zmievski [mailto:andrei@gravitonic.com]
Sent: 22 June 2006 22:46
To: PHP Internals
Cc: PHP I18N
Subject: [PHP-DEV] RFC: Error handling in HTTP input decodingI'd like to solicit opinions on how we should treat
conversion failures
during HTTP input decoding. There are two issues at hand: fallback
mechanism and application-driven decoding in case of failure. Let's
look at the proposal for the latter one first.If the decoding of HTTP input fails (and the failure state would be
achieved as soon as even one variable fails), PHP should set an error
flag somewhere that is accessible to the user, via either a global
variable or a function. It should also keep the original request data
around (query string, POST body, and cookie data). The application
should be able to access this data, since the encoding can be
passed in
the query string [1]. The application can then check this error flag
and then call a function -- request_decode() perhaps -- to ask PHP to
re-decode the request data based on a this specific encoding. For
example:if (request_decoding_failed()) {
request_decode(request_get_raw('ei'));
}We might be able to tie this in with the input filter, but that means
that the input filter will have to be required by PHP. I am open to
other suggestions in this area.As for the first issue, PHP attempts to decode the input using the
value of the unicode.output_encoding setting, because that is
the most
logical choice if we assume that the clients send the data
back in the
encoding that the page with the form was in. We could implement a
fallback mechanism where PHP looks at the Accept-Charset
header sent by
the client[2]. This header is supposed to indicate what
character sets
https://bugzilla.mozilla.org/show_bug.cgi?id=18643
Maybe of interest, it's the kludge for determining form charsets, after the
charset in the Content-Type header broke too much.
are acceptable for the response. While this is not the same as
specifying the character set of the request, it might be a
good enough
indicator of it. Or we could simply set the error state and let
application figure out what charset it wants to use for decoding.Thanks for your attention.
-Andrei
[1] http://search.yahoo.com/search?ei=UTF-8&p=php
[2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Rasmus and I talked about this some more yesterday, and I think there
is an alternate, better approach.
PHP will attempt to decode the incoming request data as described
below. The variables that it decodes successfully will be put into the
request arrays as Unicode strings, those that fail -- as binary
strings. We will still set a flag indicating that there were problems
during the conversion, but this way the user has access to the raw
input in case of failure. Since we will be pushing the usage of the
input filter extension, we should use it to access the request
parameters (instead of the proposed request_get_raw() function below).
The input filter extension always looks in the raw input data and not
in the request arrays, and input_get_arg() has a 'charset' parameter
that can be specified to tell PHP what charset the incoming data is in.
I think this way we kill both birds with one stone: we give people
access to request arrays data on successful decoding and we also give
them a standard and secure way to get at the data in case on failed
decoding.
Please comment.
-Andrei
I'd like to solicit opinions on how we should treat conversion
failures during HTTP input decoding. There are two issues at hand:
fallback mechanism and application-driven decoding in case of failure.
Let's look at the proposal for the latter one first.If the decoding of HTTP input fails (and the failure state would be
achieved as soon as even one variable fails), PHP should set an error
flag somewhere that is accessible to the user, via either a global
variable or a function. It should also keep the original request data
around (query string, POST body, and cookie data). The application
should be able to access this data, since the encoding can be passed
in the query string [1]. The application can then check this error
flag and then call a function -- request_decode() perhaps -- to ask
PHP to re-decode the request data based on a this specific encoding.
For example:if (request_decoding_failed()) {
request_decode(request_get_raw('ei'));
}We might be able to tie this in with the input filter, but that means
that the input filter will have to be required by PHP. I am open to
other suggestions in this area.As for the first issue, PHP attempts to decode the input using the
value of the unicode.output_encoding setting, because that is the most
logical choice if we assume that the clients send the data back in the
encoding that the page with the form was in. We could implement a
fallback mechanism where PHP looks at the Accept-Charset header sent
by the client[2]. This header is supposed to indicate what character
sets are acceptable for the response. While this is not the same as
specifying the character set of the request, it might be a good enough
indicator of it. Or we could simply set the error state and let
application figure out what charset it wants to use for decoding.Thanks for your attention.
-Andrei
[1] http://search.yahoo.com/search?ei=UTF-8&p=php
[2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html--
PHP Unicode & I18N Mailing List (http://www.php.net/)
Rasmus and I talked about this some more yesterday, and I think there is
an alternate, better approach.
[...snip...]
Love it, but we need a few fairly big warnings: "When unicode_semantics is
enabled, the $_SERVER['UNICODE_HTTP_INPUT_FAILURE'] variable MUST be checked
blah blah blah" put into the manual.
Partial decoding of the request body/string should go a long way towards
easing the issue for those 20% cases where input decoding will just fail.
-Sara
P.S. - Has anyone considered offering up an RFC to IETF or W3C about adding
a header to the spec? Or just asking the nice Firefox folks to blaze the
trail with an X-header? PHP can't be the only web-language dealing with
this issue.