htmlspecialchars(), htmlentities()
, html_entity_decode()
and
get_html_translation_table()
all take an encoding parameter that used to
default to iso-8859-1. We changed the default in PHP 5.4 to UTF-8. This
is a much more sensible default and in the case of the encoding
functions more secure as it prevents invalid UTF-8 from getting through.
If you use 8859-1 as the default but your app is actually in UTF-8 or
worse, some encoding that isn't low-ascii compatible then
htmlspecialchars()
/htmlentities() aren't doing what you think they are
and you have a glaring security hole in your app.
However, people are understandably lazy and don't want to think about
this stuff. They don't want to explicitly provide their input encoding
to these calls. We provided a solution to this and a way to write
portable apps and that was to pass in an empty string "" as the
encoding. If we saw this we would set the input encoding to match the
output encoding specified by the "default_charset" ini setting. We
couldn't just default to this default_charset because input and output
encodings may very well be different and we would risk making existing
apps insecure. For example an app using BIG5/CJK for its output encoding
might very well be pulling data from 8859/UTF-8 data sources and if we
invisibly switched htmlspecialchars/entities to match their output
encoding we would have problems. Invisibly switching them from 8859-1 to
UTF-8 could still be problematic, but it at least it fails safe in that
it doesn't let invalid UTF-8 through and encodes low-ascii the same way
it did before.
The problem is that there is a lot of legacy code out there that doesn't
explicitly set the encoding on those calls and it is a lot of work to go
through and specify it on each call. I still personally prefer to have
people be explicit here, but I think it is slowing 5.4 adoption (see bug
61354).
In PHP 6 we tried to introduce separate input, script and output
encoding settings. Currently in 5.4 we don't have that, but we have
those 3 separately for mbstring and for iconv:
iconv.input_encoding
iconv.internal_encoding
iconv.output_encoding
mbstring.http_input
mbstring.internal_encoding
mbstring.http_output
Ideally we should be getting rid of the per-feature encoding settings
and have a single set of them that we refer to when we need them. This
is one of these places where we really need a default input encoding
setting. We could have it check mbstring.http_input, but there is a
wrinkle here that it has a fancy "auto" setting which we don't really
want in this case. So we could set it to iconv.input_encoding, but that
seems rather random and unintuitive.
So do we create a new default_input_encoding ini directive mid-stream in
5.4 for this? Of course with the longer-term in mind that this will be
part of a unified set of encoding settings in 5.5 and beyond.
-Rasmus
htmlspecialchars()
,htmlentities()
,html_entity_decode()
and
get_html_translation_table()
all take an encoding parameter that used to
default to iso-8859-1. We changed the default in PHP 5.4 to UTF-8. This
is a much more sensible default and in the case of the encoding
functions more secure as it prevents invalid UTF-8 from getting through.
If you use 8859-1 as the default but your app is actually in UTF-8 or
worse, some encoding that isn't low-ascii compatible then
htmlspecialchars()
/htmlentities() aren't doing what you think they are
and you have a glaring security hole in your app.However, people are understandably lazy and don't want to think about
this stuff. They don't want to explicitly provide their input encoding
to these calls. We provided a solution to this and a way to write
portable apps and that was to pass in an empty string "" as the
encoding. If we saw this we would set the input encoding to match the
output encoding specified by the "default_charset" ini setting. We
couldn't just default to this default_charset because input and output
encodings may very well be different and we would risk making existing
apps insecure. For example an app using BIG5/CJK for its output encoding
might very well be pulling data from 8859/UTF-8 data sources and if we
invisibly switched htmlspecialchars/entities to match their output
encoding we would have problems. Invisibly switching them from 8859-1 to
UTF-8 could still be problematic, but it at least it fails safe in that
it doesn't let invalid UTF-8 through and encodes low-ascii the same way
it did before.The problem is that there is a lot of legacy code out there that doesn't
explicitly set the encoding on those calls and it is a lot of work to go
through and specify it on each call. I still personally prefer to have
people be explicit here, but I think it is slowing 5.4 adoption (see bug
61354).In PHP 6 we tried to introduce separate input, script and output
encoding settings. Currently in 5.4 we don't have that, but we have
those 3 separately for mbstring and for iconv:iconv.input_encoding
iconv.internal_encoding
iconv.output_encoding
mbstring.http_input
mbstring.internal_encoding
mbstring.http_outputIdeally we should be getting rid of the per-feature encoding settings
and have a single set of them that we refer to when we need them. This
is one of these places where we really need a default input encoding
setting. We could have it check mbstring.http_input, but there is a
wrinkle here that it has a fancy "auto" setting which we don't really
want in this case. So we could set it to iconv.input_encoding, but that
seems rather random and unintuitive.So do we create a new default_input_encoding ini directive mid-stream in
5.4 for this? Of course with the longer-term in mind that this will be
part of a unified set of encoding settings in 5.5 and beyond.-Rasmus
Personally, I think you should have just two encodings: page_encoding
and internal_encoding. The former is for form input and page output
(could be latin-1, for instance), and internal_encoding is the internal
representation (default to utf-8 - you can deal with all of, say,
latin-1, as well as unicode entities). Input and output, on the web at
least, are almost always going to match.
--
Andrew Faulds
http://ajf.me/
Personally, I think you should have just two encodings: page_encoding
and internal_encoding. The former is for form input and page output
(could be latin-1, for instance), and internal_encoding is the internal
representation (default to utf-8 - you can deal with all of, say,
latin-1, as well as unicode entities). Input and output, on the web at
least, are almost always going to match.
No, we need 3. The internal/script encoding doesn't have to be the same
as the input encoding. It isn't common in the Western world, but
elsewhere people do write their scripts in their local encoding which
may very well be different from their input and/or output encodings.
-Rasmus
Personally, I think you should have just two encodings: page_encoding
and internal_encoding. The former is for form input and page output
(could be latin-1, for instance), and internal_encoding is the internal
representation (default to utf-8 - you can deal with all of, say,
latin-1, as well as unicode entities). Input and output, on the web at
least, are almost always going to match.
No, we need 3. The internal/script encoding doesn't have to be the same
as the input encoding. It isn't common in the Western world, but
elsewhere people do write their scripts in their local encoding which
may very well be different from their input and/or output encodings.-Rasmus
Oh, you mean script encoding, form input/page output encoding and
internal representation?
Because I don't see a need for differing default input (i.e. file/form
input) and default output (i.e. page/file output) encodings.
--
Andrew Faulds
http://ajf.me/
So do we create a new default_input_encoding ini directive mid-stream in
5.4 for this? Of course with the longer-term in mind that this will be
part of a unified set of encoding settings in 5.5 and beyond.
Yes! This is a fantastic idea.
Adam
Hi,
I'm +1 for having internal/input/output/script encoding setting at PHP
or Zend level.
If the default is the problem is the problem, we should set default_charset
default to UTF-8 and use them as default for internal/input/output/script
and functions that affected by encoding.
When XSS advisory was released at Feb. 2000, it stated encoding
MUST be specified in HTTP response header. Setting default_charset
is the best practice for security perspective anyway.
If we use default_charset as default encoding, transition to 5.4 might
be easier.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
2012/8/24 Rasmus Lerdorf rasmus@lerdorf.com:
htmlspecialchars()
,htmlentities()
,html_entity_decode()
and
get_html_translation_table()
all take an encoding parameter that used to
default to iso-8859-1. We changed the default in PHP 5.4 to UTF-8. This
is a much more sensible default and in the case of the encoding
functions more secure as it prevents invalid UTF-8 from getting through.
If you use 8859-1 as the default but your app is actually in UTF-8 or
worse, some encoding that isn't low-ascii compatible then
htmlspecialchars()
/htmlentities() aren't doing what you think they are
and you have a glaring security hole in your app.However, people are understandably lazy and don't want to think about
this stuff. They don't want to explicitly provide their input encoding
to these calls. We provided a solution to this and a way to write
portable apps and that was to pass in an empty string "" as the
encoding. If we saw this we would set the input encoding to match the
output encoding specified by the "default_charset" ini setting. We
couldn't just default to this default_charset because input and output
encodings may very well be different and we would risk making existing
apps insecure. For example an app using BIG5/CJK for its output encoding
might very well be pulling data from 8859/UTF-8 data sources and if we
invisibly switched htmlspecialchars/entities to match their output
encoding we would have problems. Invisibly switching them from 8859-1 to
UTF-8 could still be problematic, but it at least it fails safe in that
it doesn't let invalid UTF-8 through and encodes low-ascii the same way
it did before.The problem is that there is a lot of legacy code out there that doesn't
explicitly set the encoding on those calls and it is a lot of work to go
through and specify it on each call. I still personally prefer to have
people be explicit here, but I think it is slowing 5.4 adoption (see bug
61354).In PHP 6 we tried to introduce separate input, script and output
encoding settings. Currently in 5.4 we don't have that, but we have
those 3 separately for mbstring and for iconv:iconv.input_encoding
iconv.internal_encoding
iconv.output_encoding
mbstring.http_input
mbstring.internal_encoding
mbstring.http_outputIdeally we should be getting rid of the per-feature encoding settings
and have a single set of them that we refer to when we need them. This
is one of these places where we really need a default input encoding
setting. We could have it check mbstring.http_input, but there is a
wrinkle here that it has a fancy "auto" setting which we don't really
want in this case. So we could set it to iconv.input_encoding, but that
seems rather random and unintuitive.So do we create a new default_input_encoding ini directive mid-stream in
5.4 for this? Of course with the longer-term in mind that this will be
part of a unified set of encoding settings in 5.5 and beyond.-Rasmus
Hi!
In PHP 6 we tried to introduce separate input, script and output
encoding settings. Currently in 5.4 we don't have that, but we have
those 3 separately for mbstring and for iconv:iconv.input_encoding
iconv.internal_encoding
iconv.output_encoding
mbstring.http_input
mbstring.internal_encoding
mbstring.http_outputIdeally we should be getting rid of the per-feature encoding settings
and have a single set of them that we refer to when we need them. This
I agree, having unified set of encodings would be a good thing. However,
I have a feeling most of the people won't really understand what these
three do, and would never bother to set them. From my experience, people
don't even bother to set PHP timezone, even though PHP complains each
time date function is accessed. So these will be left as default in
99.999% of cases.
So do we create a new default_input_encoding ini directive mid-stream in
5.4 for this? Of course with the longer-term in mind that this will be
part of a unified set of encoding settings in 5.5 and beyond.
What happens to these 6 directives? Will we now have 9 directives for
setting the encoding? This reminds me of: http://xkcd.com/927/. Having
yet more settings is not really a solution to the problem of too many
different settings. So unless we deprecate all others in 5.5 and have
people use only generic ones it's not very useful. If we do deprecate
them, we need some kind of migration path - i.e. if you set
iconv.input_encoding what actually happens? If you set
default_input_encoding will it also set mbstring.http_input - or will it
affect mbstring without actually setting it?
I guess we'd need a good detailed RFC on this :)
--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227
Hi,
2012/8/27 Stas Malyshev smalyshev@sugarcrm.com:
Hi!
In PHP 6 we tried to introduce separate input, script and output
encoding settings. Currently in 5.4 we don't have that, but we have
those 3 separately for mbstring and for iconv:iconv.input_encoding
iconv.internal_encoding
iconv.output_encoding
mbstring.http_input
mbstring.internal_encoding
mbstring.http_outputIdeally we should be getting rid of the per-feature encoding settings
and have a single set of them that we refer to when we need them. ThisI agree, having unified set of encodings would be a good thing. However,
I have a feeling most of the people won't really understand what these
three do, and would never bother to set them. From my experience, people
don't even bother to set PHP timezone, even though PHP complains each
time date function is accessed. So these will be left as default in
99.999% of cases.
I agree. Other than applications that are made by CJK native, I rarely
see them set.
So do we create a new default_input_encoding ini directive mid-stream in
5.4 for this? Of course with the longer-term in mind that this will be
part of a unified set of encoding settings in 5.5 and beyond.What happens to these 6 directives? Will we now have 9 directives for
setting the encoding? This reminds me of: http://xkcd.com/927/. Having
yet more settings is not really a solution to the problem of too many
different settings. So unless we deprecate all others in 5.5 and have
people use only generic ones it's not very useful. If we do deprecate
them, we need some kind of migration path - i.e. if you set
iconv.input_encoding what actually happens? If you set
default_input_encoding will it also set mbstring.http_input - or will it
affect mbstring without actually setting it?
I guess we'd need a good detailed RFC on this :)
If I write patch for it, I'll modify iconv./mbstring. to use php.* (or zend.)
When default_chartset is set and other settings are null, use it as
default for all including htmlentities()
, mb_(), etc.
default_charset will be single encoding configuration if user uses
single encoding for application.
How to deal with iconv./mbstring.
master: remove iconv./mbstring.
5.4: iconv./mbstring. remains for compatibility and use them it they set.
We could remove iconv./mbstring. for 5.4. It's a big change for CJK
users but they will be okay with it. Almost all users are using single
encoding for application anyway.
I think removing iconv./mbstring. for master and5.4 would be nicer.
Any opinions?
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi,
2012/8/27 Stas Malyshev smalyshev@sugarcrm.com:
Hi!
In PHP 6 we tried to introduce separate input, script and output
encoding settings. Currently in 5.4 we don't have that, but we have
those 3 separately for mbstring and for iconv:iconv.input_encoding
iconv.internal_encoding
iconv.output_encoding
mbstring.http_input
mbstring.internal_encoding
mbstring.http_outputIdeally we should be getting rid of the per-feature encoding settings
and have a single set of them that we refer to when we need them. ThisI agree, having unified set of encodings would be a good thing. However,
I have a feeling most of the people won't really understand what these
three do, and would never bother to set them. From my experience, people
don't even bother to set PHP timezone, even though PHP complains each
time date function is accessed. So these will be left as default in
99.999% of cases.I agree. Other than applications that are made by CJK native, I rarely
see them set.So do we create a new default_input_encoding ini directive mid-stream in
5.4 for this? Of course with the longer-term in mind that this will be
part of a unified set of encoding settings in 5.5 and beyond.What happens to these 6 directives? Will we now have 9 directives for
setting the encoding? This reminds me of: http://xkcd.com/927/. Having
yet more settings is not really a solution to the problem of too many
different settings. So unless we deprecate all others in 5.5 and have
people use only generic ones it's not very useful. If we do deprecate
them, we need some kind of migration path - i.e. if you set
iconv.input_encoding what actually happens? If you set
default_input_encoding will it also set mbstring.http_input - or will it
affect mbstring without actually setting it?
I guess we'd need a good detailed RFC on this :)If I write patch for it, I'll modify iconv./mbstring. to use php.* (or zend.)
When default_chartset is set and other settings are null, use it as
default for all includinghtmlentities()
, mb_(), etc.default_charset will be single encoding configuration if user uses
single encoding for application.How to deal with iconv./mbstring.
master: remove iconv./mbstring.
5.4: iconv./mbstring. remains for compatibility and use them it they set.We could remove iconv./mbstring. for 5.4. It's a big change for CJK
users but they will be okay with it. Almost all users are using single
encoding for application anyway.I think removing iconv./mbstring. for master and5.4 would be nicer.
Any opinions?
We can't remove them in 5.4. We can add new ones without breaking
anything and we can make mbstring/iconv/html* use those if they are set
and then mark the mbstring/iconv settings as deprecated in master.
-Rasmus
Hi,
I've created RFC page so that this discussion will be forgotten.
https://wiki.php.net/rfc/default_encoding
Please edit the RFC page if needed.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net