Hello,
Yesterday, Ilia, Andrei and I discussed the possible solutions to solve
the input encoding in php6 (unicode). I will try to describe them here.
I do not go too deep in the details, the goal is to choose one
solution and then propose a patch to test. Our preference goes to
the solution #2.
--
Solution #1:
The idea here is to detect encoding, encode and register the variable
during the request initialization (before the script gets the hand).
Besides the encoding detection, it is how it works in the actual
implementation (all php versions).
- Init
- Parse the request into an array.
- locate charset or use unicode.request_encoding
- filter/decode/register the variable like it is done now
- Runtime
Just like now, the auto_globals (with or without jit) are declared and
ready to be used.
This solution has one advantage, it requires only a few changes in
the engine. The request processing functions need to be changed
to detect the encoding.
The main disadvantages are:
- the lack of flexibility, encoding must be set before the script gets
the hand, using vhost config or htaccess - the possible bad encoding detection will force the user to manually
parse the raw request (when available).
Solution #2: add (true) JIT support for GET/POST/COOKIE/...
Instead of doing all the precessing during the init phase, it will be
done on demand when a input variable is requested, at runtime.
- Init
- don't parse the request but simply store it for later processing
- Runtime
- when a input variable is fetched:
- encoding is defined using unicode.request_encoding
- filter/decode/register the complete array (post,get,...)
The way JIT works has to be changed. It has to process the data
at runtime instead of register them at compile time. This is the only
way to be sure that the users has set the input encoding correctly
(or has the opportunity to set it).
The main advantage of this solution is the absence of magic for
the user. The encoding detection can be checked and/or set in time
by the user before the input processing, it is safe and flexible.
I would also suggest to add a function: filter_input_encoding($type) to
define the encoding type at runtime instead of using ini_set (which is
often disabled).
There is no real technical disadvantages but requires more work and
changes in the engine. But these changes will also bring some more
performance improvements (if (0) $t = $_ENV['foo']; will not trigger
jit).
--
I would like to hear your ideas, opinions and comments. Especially
about the possible changes in the engine. Feel free to ask more
details if my explanations were unclear :)
Regards,
--Pierre
The main disadvantages are:
- the lack of flexibility, encoding must be set before the script gets
the hand, using vhost config or htaccess- the possible bad encoding detection will force the user to manually
parse the raw request (when available).
Also:
- no way to issue an error if conversion fails except by setting a
flag that has to be retrieved with a function - much harder to get to charset if it's at the end of the request
- Init
- don't parse the request but simply store it for later processing
We can still parse the data, we just can't decode it. Parsing would
populate arrays (internal or otherwise) with the binary data that can
later be decoded and filtered in JIT fashion.
- Runtime
- when a input variable is fetched:
- encoding is defined using unicode.request_encoding
Or via charset, or provided by user, etc.
The main advantage of this solution is the absence of magic for
the user. The encoding detection can be checked and/or set in time
by the user before the input processing, it is safe and flexible.
And we can issue errors in consistent fashion.
There is no real technical disadvantages but requires more work and
changes in the engine. But these changes will also bring some more
performance improvements (if (0) $t = $_ENV['foo']; will not trigger
jit).
I guess we need to know how hard it would be to implement runtime JIT
for GET/POST/COOKIE registration.
-Andrei
The main disadvantages are:
- the lack of flexibility, encoding must be set before the script gets
the hand, using vhost config or htaccess- the possible bad encoding detection will force the user to manually
parse the raw request (when available).Also:
- no way to issue an error if conversion fails except by setting a
flag that has to be retrieved with a function- much harder to get to charset if it's at the end of the request
- Init
- don't parse the request but simply store it for later processing
We can still parse the data, we just can't decode it. Parsing would
populate arrays (internal or otherwise) with the binary data that can
later be decoded and filtered in JIT fashion.
I like the lazy approach, if the data is not used, we do nothing.
There is no real technical disadvantages but requires more work and
changes in the engine. But these changes will also bring some more
performance improvements (if (0) $t = $_ENV['foo']; will not trigger
jit).I guess we need to know how hard it would be to implement runtime JIT
for GET/POST/COOKIE registration.
The infrastructure exists already. There is no difficulty to add
get/post/cookie to the jit auto globals but to change how JIT works.
We need to move it to runtime (fetch time), not hard but will need
good tests :)
--Pierre
#2 feels more right to me, fwiw...
There are just too many funky things out there sending wacko charset
headers that aren't matching reality, in my limited experience, that
not letting the application developer set up the conversion will just
end up with a nightmare for developers.
--
Some people have a "gift" link here.
Know what I want?
I want you to buy a CD from some starving artist.
http://cdbaby.com/browse/from/lynch
Yeah, I get a buck. So?
I think #2 is better than #1.
The current implementation of mbstring is based on the solution similar
to #1. It is simple and stable, but, #2 has more flexibility.
Rui
On Thu, 14 Dec 2006 21:59:44 +0100
Pierre pierre.php@gmail.com wrote:
Hello,
Yesterday, Ilia, Andrei and I discussed the possible solutions to solve
the input encoding in php6 (unicode). I will try to describe them here.I do not go too deep in the details, the goal is to choose one
solution and then propose a patch to test. Our preference goes to
the solution #2.--
Solution #1:The idea here is to detect encoding, encode and register the variable
during the request initialization (before the script gets the hand).
Besides the encoding detection, it is how it works in the actual
implementation (all php versions).
- Init
- Parse the request into an array.
- locate charset or use unicode.request_encoding
- filter/decode/register the variable like it is done now
- Runtime
Just like now, the auto_globals (with or without jit) are declared and
ready to be used.This solution has one advantage, it requires only a few changes in
the engine. The request processing functions need to be changed
to detect the encoding.The main disadvantages are:
- the lack of flexibility, encoding must be set before the script gets
the hand, using vhost config or htaccess- the possible bad encoding detection will force the user to manually
parse the raw request (when available).Solution #2: add (true) JIT support for GET/POST/COOKIE/...
Instead of doing all the precessing during the init phase, it will be
done on demand when a input variable is requested, at runtime.
- Init
- don't parse the request but simply store it for later processing
- Runtime
- when a input variable is fetched:
- encoding is defined using unicode.request_encoding
- filter/decode/register the complete array (post,get,...)
The way JIT works has to be changed. It has to process the data
at runtime instead of register them at compile time. This is the only
way to be sure that the users has set the input encoding correctly
(or has the opportunity to set it).The main advantage of this solution is the absence of magic for
the user. The encoding detection can be checked and/or set in time
by the user before the input processing, it is safe and flexible.I would also suggest to add a function: filter_input_encoding($type) to
define the encoding type at runtime instead of using ini_set (which is
often disabled).There is no real technical disadvantages but requires more work and
changes in the engine. But these changes will also bring some more
performance improvements (if (0) $t = $_ENV['foo']; will not trigger
jit).--
I would like to hear your ideas, opinions and comments. Especially
about the possible changes in the engine. Feel free to ask more
details if my explanations were unclear :)Regards,
--Pierre
--
Rui Hirokawa <rui_hirokawa@ybb.ne.jp
The main issue, as I already discussed with Andrei (sorry, our
discussions are stealth since I see him almost every day even though I
try hard to avoid him) is how we handle encoding errors if we jit at
runtime and process the entire array at that time. I agree that this is
architecturally the right approach, but if someone injects some bogus
GET data, for example, even though the app doesn't even try to access
it, it is going to be encoded when the app tries to get at the first GET
arg and at that point there would be an error if that extra GET data was
bogus.
We obviously don't want it to be possible to arbitrarily create errors
like that, but at the same time it needs to be possible for the
application to discover encoding errors. So we probably need to make
the error handling pretty smart. For example, treat errors encoding the
actual entry they are trying to access as more serious than an error
encoding another element that just happened to be encoded at that point.
And then later if they try to access a previously encoded element that
had an error throw the more serious error at that point. Or something
along those lines.
I suppose we could also jit right down to the single element level and
not actually do the entire array on the first access to that GPC array.
-Rasmus
Rui Hirokawa wrote:
I think #2 is better than #1.
The current implementation of mbstring is based on the solution similar
to #1. It is simple and stable, but, #2 has more flexibility.Rui
On Thu, 14 Dec 2006 21:59:44 +0100
Pierre pierre.php@gmail.com wrote:Hello,
Yesterday, Ilia, Andrei and I discussed the possible solutions to solve
the input encoding in php6 (unicode). I will try to describe them here.I do not go too deep in the details, the goal is to choose one
solution and then propose a patch to test. Our preference goes to
the solution #2.--
Solution #1:The idea here is to detect encoding, encode and register the variable
during the request initialization (before the script gets the hand).
Besides the encoding detection, it is how it works in the actual
implementation (all php versions).
- Init
- Parse the request into an array.
- locate charset or use unicode.request_encoding
- filter/decode/register the variable like it is done now
- Runtime
Just like now, the auto_globals (with or without jit) are declared and
ready to be used.This solution has one advantage, it requires only a few changes in
the engine. The request processing functions need to be changed
to detect the encoding.The main disadvantages are:
- the lack of flexibility, encoding must be set before the script gets
the hand, using vhost config or htaccess- the possible bad encoding detection will force the user to manually
parse the raw request (when available).Solution #2: add (true) JIT support for GET/POST/COOKIE/...
Instead of doing all the precessing during the init phase, it will be
done on demand when a input variable is requested, at runtime.
- Init
- don't parse the request but simply store it for later processing
- Runtime
- when a input variable is fetched:
- encoding is defined using unicode.request_encoding
- filter/decode/register the complete array (post,get,...)
The way JIT works has to be changed. It has to process the data
at runtime instead of register them at compile time. This is the only
way to be sure that the users has set the input encoding correctly
(or has the opportunity to set it).The main advantage of this solution is the absence of magic for
the user. The encoding detection can be checked and/or set in time
by the user before the input processing, it is safe and flexible.I would also suggest to add a function: filter_input_encoding($type) to
define the encoding type at runtime instead of using ini_set (which is
often disabled).There is no real technical disadvantages but requires more work and
changes in the engine. But these changes will also bring some more
performance improvements (if (0) $t = $_ENV['foo']; will not trigger
jit).--
I would like to hear your ideas, opinions and comments. Especially
about the possible changes in the engine. Feel free to ask more
details if my explanations were unclear :)Regards,
--Pierre
Pursuant an IRC discussion with Rasmus.
It seems to be that in order to do any sort of error differentiation
we need to have a variable-level JIT decoding/filtering. It needs to
be smart though, because we want to issue errors only on the first
access to the variable. One way to approach this would be to decode/
filter the $_POST['foo'] value when it's accessed and then replace
the $_POST['foo'] with this filtered result so that the next access
gets the value directly, without invoking the JIT mechanism.
-Andrei
The main issue, as I already discussed with Andrei (sorry, our
discussions are stealth since I see him almost every day even though I
try hard to avoid him) is how we handle encoding errors if we jit at
runtime and process the entire array at that time. I agree that
this is
architecturally the right approach, but if someone injects some bogus
GET data, for example, even though the app doesn't even try to access
it, it is going to be encoded when the app tries to get at the
first GET
arg and at that point there would be an error if that extra GET
data was
bogus.We obviously don't want it to be possible to arbitrarily create errors
like that, but at the same time it needs to be possible for the
application to discover encoding errors. So we probably need to make
the error handling pretty smart. For example, treat errors
encoding the
actual entry they are trying to access as more serious than an error
encoding another element that just happened to be encoded at that
point.
And then later if they try to access a previously encoded element
that
had an error throw the more serious error at that point. Or something
along those lines.I suppose we could also jit right down to the single element level and
not actually do the entire array on the first access to that GPC
array.-Rasmus
Rui Hirokawa wrote:
I think #2 is better than #1.
The current implementation of mbstring is based on the solution
similar
to #1. It is simple and stable, but, #2 has more flexibility.Rui
On Thu, 14 Dec 2006 21:59:44 +0100
Pierre pierre.php@gmail.com wrote:Hello,
Yesterday, Ilia, Andrei and I discussed the possible solutions to
solve
the input encoding in php6 (unicode). I will try to describe them
here.I do not go too deep in the details, the goal is to choose one
solution and then propose a patch to test. Our preference goes to
the solution #2.--
Solution #1:The idea here is to detect encoding, encode and register the
variable
during the request initialization (before the script gets the hand).
Besides the encoding detection, it is how it works in the actual
implementation (all php versions).
- Init
- Parse the request into an array.
- locate charset or use unicode.request_encoding
- filter/decode/register the variable like it is done now
- Runtime
Just like now, the auto_globals (with or without jit) are
declared and
ready to be used.This solution has one advantage, it requires only a few changes in
the engine. The request processing functions need to be changed
to detect the encoding.The main disadvantages are:
- the lack of flexibility, encoding must be set before the script
gets
the hand, using vhost config or htaccess- the possible bad encoding detection will force the user to
manually
parse the raw request (when available).Solution #2: add (true) JIT support for GET/POST/COOKIE/...
Instead of doing all the precessing during the init phase, it
will be
done on demand when a input variable is requested, at runtime.
- Init
- don't parse the request but simply store it for later processing
- Runtime
- when a input variable is fetched:
- encoding is defined using unicode.request_encoding
- filter/decode/register the complete array (post,get,...)
The way JIT works has to be changed. It has to process the data
at runtime instead of register them at compile time. This is the
only
way to be sure that the users has set the input encoding correctly
(or has the opportunity to set it).The main advantage of this solution is the absence of magic for
the user. The encoding detection can be checked and/or set in time
by the user before the input processing, it is safe and flexible.I would also suggest to add a function: filter_input_encoding
($type) to
define the encoding type at runtime instead of using ini_set
(which is
often disabled).There is no real technical disadvantages but requires more work and
changes in the engine. But these changes will also bring some more
performance improvements (if (0) $t = $_ENV['foo']; will not trigger
jit).--
I would like to hear your ideas, opinions and comments. Especially
about the possible changes in the engine. Feel free to ask more
details if my explanations were unclear :)Regards,
--Pierre
Hello,
Pursuant an IRC discussion with Rasmus.
It seems to be that in order to do any sort of error differentiation
we need to have a variable-level JIT decoding/filtering. It needs to
be smart though, because we want to issue errors only on the first
access to the variable. One way to approach this would be to decode/
filter the $_POST['foo'] value when it's accessed and then replace
the $_POST['foo'] with this filtered result so that the next access
gets the value directly, without invoking the JIT mechanism.
I'm not sure it is worth the effort given the possible problems like
foreach. Is it possible to add such hooks? like catch an array element
access (auto global)?
This solution looks nice but I'm unsure about its feasibility or
complexity (over designed?).
My initial thought was to decode the GPC (env, server can use this
rule as well) with the first access, no matter if the access is only
for one index $_GET['a']) or for the complete array ($a =$_POST).
The stop unicode error mode will used during the decoding phase (see
README.UNICODE for the error mode explanation). If an error occured,
the error will will be stored and can be fetched using an extra
function, like :
array = input_decoding_error($type);
where $type is one of the GPC filter constant and the returned value
is an array with the input name/error as key/value pairs.
This approach will keep the JIT system simple while having enough
flexibility. If an error occured, it is easy to see which variable was
affected. One does not even need to check it until it is done with the
input decoding process. It will also work nicely with ext/filter, if a
validation failed due to the decoding, the error can be fetched using
this function.
How does it sound?
I also like to hear other persons ideas as they wrote the JIT part or
know better the limitation of the engine (for the element access
hook), Zeev, Andi, Dmitry? :)
--Pierre
Here is a couple of notes about both ideas.
One does not even need to check it until it is done with the
input decoding process. It will also work nicely with ext/filter, if a
validation failed due to the decoding, the error can be fetched using
this function.
Or with ext/filter, we can even make it available directly:
$myvalues = input_filter_array(INPUT_GET, $mydefinition, $decoding_errors);
Another question I have about the per element JIT is about the error
reporting. I think we need the extra function anyway, I cannot imagine
to have notices/warning while checking user inputs (when I use the
language correctly but the input are invalid).
--Pierre
I suppose we could also jit right down to the single element level and
not actually do the entire array on the first access to that GPC
array.
That sounds nifty to this naive reader...
Paranoid Question:
I suppose it's conceivable that somehow, some way, somebody could
abuse this to "pass on" corrupt data from web app to web app, since it
never gets accessed by the JIT decoder... Is that opening up a hole
that PHP needs to worry about?...
Also, I suspect some users would want an up-front "validate all"
function to force the JIT at the top of a web app, especially in
development where it's easier to debug if you know that your inputs
actually match what you think they are.
--
Some people have a "gift" link here.
Know what I want?
I want you to buy a CD from some starving artist.
http://cdbaby.com/browse/from/lynch
Yeah, I get a buck. So?
The second solution has a lot of advantages. And I think it is better even
if it will little bit slower.
I think it can be implemented using overloaded arrays.
Thanks. Dmitry.
-----Original Message-----
From: Pierre [mailto:pierre.php@gmail.com]
Sent: Friday, December 15, 2006 12:00 AM
To: PHP internals
Cc: Zeev Suraski; Andi Gutmans; Dmitry Stogov; Rasmus
Lerdorf; Ilia Alshanetsky
Subject: [PHP-DEV] php6: input encoding, filter and making
JIT really JITHello,
Yesterday, Ilia, Andrei and I discussed the possible
solutions to solve the input encoding in php6 (unicode). I
will try to describe them here.I do not go too deep in the details, the goal is to choose
one solution and then propose a patch to test. Our preference
goes to the solution #2.--
Solution #1:The idea here is to detect encoding, encode and register the
variable during the request initialization (before the script
gets the hand). Besides the encoding detection, it is how it
works in the actual implementation (all php versions).
- Init
- Parse the request into an array.
- locate charset or use unicode.request_encoding
- filter/decode/register the variable like it is done now
- Runtime
Just like now, the auto_globals (with or without jit) are
declared and ready to be used.This solution has one advantage, it requires only a few
changes in the engine. The request processing functions need
to be changed to detect the encoding.The main disadvantages are:
- the lack of flexibility, encoding must be set before the script gets
the hand, using vhost config or htaccess- the possible bad encoding detection will force the user to manually
parse the raw request (when available).Solution #2: add (true) JIT support for GET/POST/COOKIE/...
Instead of doing all the precessing during the init phase, it
will be done on demand when a input variable is requested, at runtime.
- Init
- don't parse the request but simply store it for later processing
- Runtime
- when a input variable is fetched:
- encoding is defined using unicode.request_encoding
- filter/decode/register the complete array (post,get,...)
The way JIT works has to be changed. It has to process the
data at runtime instead of register them at compile time.
This is the only way to be sure that the users has set the
input encoding correctly (or has the opportunity to set it).The main advantage of this solution is the absence of magic
for the user. The encoding detection can be checked and/or
set in time by the user before the input processing, it is
safe and flexible.I would also suggest to add a function:
filter_input_encoding($type) to define the encoding type at
runtime instead of using ini_set (which is often disabled).There is no real technical disadvantages but requires more
work and changes in the engine. But these changes will also
bring some more performance improvements (if (0) $t =
$_ENV['foo']; will not trigger jit).--
I would like to hear your ideas, opinions and comments.
Especially about the possible changes in the engine. Feel
free to ask more details if my explanations were unclear :)Regards,
--Pierre
Hello,
The second solution has a lot of advantages. And I think it is better even
if it will little bit slower.
I think it can be implemented using overloaded arrays.
I thought about using them but as far as I can tell there is a couple
of issues with array overloading, like 2d access or []. Another
problem is is_array($_POST) will fail if it is not a real array, many
apps rely on is_array (even if GPC arrays are always created afaict
or?).
I will go with the solution #2 using JIT and a function. ext/filter
changes will follow shortly after.
Thanks all for your feedbacks.
--Pierre
Hey everyone,
As you can see, the date on quoted message was over 3 weeks ago. I've
been in touch with Pierre on and off since then and the last I heard
(about a week ago) was that he was making progress. However, currently
I don't know what the status is and I have not been able to get any
reply from him. I understand that he might be busy or away, but I
believe that we need to finish this crucial piece of functionality as
soon as possible so that a preview release can be made. Hopefully,
Pierre will see this in the next day or so and let us know what's up,
but if he doesn't, we need someone else to step in and do the work on
this (it involves some engine work most likely). So, please, let's see
how we should proceed.
-Andrei
I thought about using them but as far as I can tell there is a couple
of issues with array overloading, like 2d access or []. Another
problem is is_array($_POST) will fail if it is not a real array, many
apps rely on is_array (even if GPC arrays are always created afaict
or?).I will go with the solution #2 using JIT and a function. ext/filter
changes will follow shortly after.Thanks all for your feedbacks.
--Pierre
Hello Andrei,
Hey everyone,
As you can see, the date on quoted message was over 3 weeks ago. I've
been in touch with Pierre on and off since then and the last I heard
(about a week ago)
Andrei, can you please cool down and consider the full history? The
discussions began right before christmast time (12/18). We settled a
decision one day before season holidays.
was that he was making progress. However, currently
I don't know what the status is and I have not been able to get any
reply from him.
I understand that he might be busy or away, but I
believe that we need to finish this crucial piece of functionality as
soon as possible so that a preview release can be made.
This piece is crucial for PHP6 just like PDO and a couple of other
things (Are you going to send a mail like that for Wez and everyone
who does not bring a patch in time for you or am I the only who
deserves such thing?). But is PHP6 the top priority right now? No. And
sorry, I do not consider the preview release (or whatever the name is)
as a "crucial" release either.
I do my best to work on this change during my (free) time, it brings
nothing to push me or to ask me every single day during my holidays,
mail me and expect an answer one day later and start to complain in
all possible channels for my lack of reply.
Hopefully, Pierre will see this in the next day or so and let us know what's up,
but if he doesn't, we need someone else to step in and do the work on
this (it involves some engine work most likely). So, please, let's see
how we should proceed.
You may start to pushing so much for a preview release. It is counter
productive and does not bring any additional motivation or time.
Now excuse me, it is 2:19 in the morning, I just spent a hour on a php
bug in a stable and existing branch, I need some rest.
--Pierre
Hi Pierre,
Andrei, can you please cool down and consider the full history? The
discussions began right before christmast time (12/18). We settled a
decision one day before season holidays.
I beg to differ, the discussions finished on 12/18.
his piece is crucial for PHP6 just like PDO and a couple of other
things (Are you going to send a mail like that for Wez and everyone
who does not bring a patch in time for you or am I the only who
deserves such thing?).
If you ask Wez and Marcus, they'll tell you I've been bugging them
about the progress on PDO and SPL as well. However, these two
extensions are not an integral part of PHP like the request decoding
is, and thus, I can actually make a preview release with PDO and SPL
only partially complete.
But is PHP6 the top priority right now? No. And
sorry, I do not consider the preview release (or whatever the name is)
as a "crucial" release either.
Please do not misquote me. I said it was a "crucial piece of
functionality" for the release.
I do my best to work on this change during my (free) time, it brings
nothing to push me or to ask me every single day during my holidays,
mail me and expect an answer one day later and start to complain in
all possible channels for my lack of reply.
Pierre, I've been asking you on IRC privately about this for days.
You did say you were busy, but when you volunteered to work on this
and said that you would deliver this piece of code, you made a
commitment. I very much appreciate everyone's contributions and
realize that most of us work on PHP in spare time, but just because
this is open source, it doesn't mean that there aren't
responsibilities. And yes, I will keep pushing and prodding and being
annoying in general, because I took on the responsibility of bringing
about Unicode support in PHP when I started on it almost 2 years ago
and I intend to see it finished. Honestly, and I say this without any
vanity or pride, if I wasn't as annoying and persistent, I doubt very
much that we'd be where we are now with this whole project. Do you
remember a period of about 5 months earlier this year when the
development slowed down to a crawl? It's because I was busy with
other things and did not dedicate enough attention to this. I very
glad we made a lot of progress lately, but I really don't think we
should act like we have all the time in the world. Deadlines exist
for a reason, even in open source projects, and, I don't know about
everyone else, but I am definitely bummed that we missed it, even
though it was known months in advance. So, please, just tell me, are
you still interested in working on this functionality and on the
Unicode project in general? If so, please let us know what the status
is and when you intend to deliver the patch. If not, thank you very
much for your contributions so far. I hope that your other PHP
related pursuits bring you more happiness.
You may start to pushing so much for a preview release. It is counter
productive and does not bring any additional motivation or time.
See above.
Now excuse me, it is 2:19 in the morning, I just spent a hour on a php
bug in a stable and existing branch, I need some rest.
Thank you for working on the bug.
Best,
-Andrei