default charset confusion

13 years ago by Laruence — view source

unread

I caused this situation myself by not explicitly differentiating between
the default charset for the internal htmlspecialchars() and
htmlentities() functions and the output charset directive ini directive
default_charset.

The idea behind the default_charset ini directive was to act as the
charset that gets specified in the HTTP Content-type header if you do
not explicitly send your own Content-type header with the header()
function. This has been muddied a bit by the fact that
htmlspecialchars/htmlentities can take it into account when it is trying
to choose which encoding to use when handling data passed to it. This
isn't done by default since it actually makes little sense. It is only
done if you pass an empty string as the encoding argument. If you don't
pass anything at all the default is UTF-8 in 5.4. In 5.3 this was
ISO-8859-1.

And here is where the confusion comes in. We, myself included, have told
people that they can get the 5.3 behaviour back by setting the
default_charset ini directive to iso-8859-1. But, this is only true if
they are forcing htmlspecialchars/htmlentities to check that setting
with an empty string as the encoding arg. Most apps just do
htmlspecialchars($str) and nothing else. Plus, it is really not a good
idea to tie the internal encoding of data being passed to these
functions to the output charset. You should be able to change the output
charset without worrying about your runtime encoding at that level.

What this effectively means is that we are asking people to go through
their code and add an explicit charset to all htmlspecialchars() and
htmlentities() calls. I think this will be a hurdle for 5.4 adoption.

What we really need is what we added in PHP 6. A runtime encoding ini
setting that is distinct from the output charset which we can use here.
That would allow people to fix all their legacy code to a specific
runtime encoding with a single ini setting instead of changing thousands
of lines of code. I propose that we add such a directive to 5.4.1 to
ease migration.
+1, especially for non-utf8 applications.

thanks

See https://bugs.php.net/61354 for the first signs of grumbling about
this one. As more people migrate I have a feeling this will end up being
the most difficult part of the migration.

-Rasmus

--

--
Laruence Xinchen Hui
http://www.laruence.com/

13 years ago by Adam Jon Richardson — view source

unread

What we really need is what we added in PHP 6. A runtime encoding ini
setting that is distinct from the output charset which we can use here.
That would allow people to fix all their legacy code to a specific
runtime encoding with a single ini setting instead of changing thousands
of lines of code. I propose that we add such a directive to 5.4.1 to
ease migration.

This seems likes a very reasonable way of dealing with this issue.

Adam

13 years ago by Stas Malyshev — view source

unread

Hi!

What we really need is what we added in PHP 6. A runtime encoding ini
setting that is distinct from the output charset which we can use here.
That would allow people to fix all their legacy code to a specific
runtime encoding with a single ini setting instead of changing thousands
of lines of code. I propose that we add such a directive to 5.4.1 to
ease migration.

One more charset INI setting? I'm not sure I like this. We have tons of
INIs already, and adding a new one each time we change something makes
both writing applications and configuring servers harder.
But as the manual says, ISO-8859-1 and UTF-8 are the same for
htmlspecialchars() - is it wrong? If yes, what exactly is the different
between old and new behavior? I tried to read #61354 but could make
little sense out of it, it lacks expected result and I have hard time
understanding what is the problem there. Could you explain?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by Laruence — view source

unread

Hi!

What we really need is what we added in PHP 6. A runtime encoding ini
setting that is distinct from the output charset which we can use here.
That would allow people to fix all their legacy code to a specific
runtime encoding with a single ini setting instead of changing thousands
of lines of code. I propose that we add such a directive to 5.4.1 to
ease migration.

One more charset INI setting? I'm not sure I like this. We have tons of INIs
already, and adding a new one each time we change something makes both
writing applications and configuring servers harder.
But as the manual says, ISO-8859-1 and UTF-8 are the same for
htmlspecialchars() - is it wrong? If yes, what exactly is the different
between old and new behavior? I tried to read #61354 but could make little
sense out of it, it lacks expected result and I have hard time understanding
what is the problem there. Could you explain?
Hi:
if the argument string passed to htmlspecialchars is not in the
charset the htmlspecialchars expected(default is UTF8, and there is
only one way out is specific the third argument),

a empty string will returned without any notice or warning ;)

thanks

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

--

--
Laruence Xinchen Hui
http://www.laruence.com/

13 years ago by Laruence — view source

unread

Hi!

What we really need is what we added in PHP 6. A runtime encoding ini
setting that is distinct from the output charset which we can use here.
That would allow people to fix all their legacy code to a specific
runtime encoding with a single ini setting instead of changing thousands
of lines of code. I propose that we add such a directive to 5.4.1 to
ease migration.

One more charset INI setting? I'm not sure I like this. We have tons of INIs
If we will definitely add a run_time_charset in the furture, then I
think it's okey add it now. :)

thanks

already, and adding a new one each time we change something makes both
writing applications and configuring servers harder.
But as the manual says, ISO-8859-1 and UTF-8 are the same for
htmlspecialchars() - is it wrong? If yes, what exactly is the different
between old and new behavior? I tried to read #61354 but could make little
sense out of it, it lacks expected result and I have hard time understanding
what is the problem there. Could you explain?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

--

--
Laruence Xinchen Hui
http://www.laruence.com/

13 years ago by Rasmus Lerdorf — view source

unread

Hi!

What we really need is what we added in PHP 6. A runtime encoding ini
setting that is distinct from the output charset which we can use here.
That would allow people to fix all their legacy code to a specific
runtime encoding with a single ini setting instead of changing thousands
of lines of code. I propose that we add such a directive to 5.4.1 to
ease migration.

One more charset INI setting? I'm not sure I like this. We have tons of
INIs already, and adding a new one each time we change something makes
both writing applications and configuring servers harder.
But as the manual says, ISO-8859-1 and UTF-8 are the same for
htmlspecialchars() - is it wrong? If yes, what exactly is the different
between old and new behavior? I tried to read #61354 but could make
little sense out of it, it lacks expected result and I have hard time
understanding what is the problem there. Could you explain?

Yes, it is a bit hard to understand from the bug report because
bugs.php.net is all utf-8, but we are talking about non utf-8 apps here.

This script should illustrate it: ( https://gist.github.com/2020502 )

$gb2312 = iconv('UTF-8','GB2312','我是测试');
$string = $string = "<pre>$gb2312</pre>";
echo htmlspecialchars($string);

If you run that in PHP 5.3 you get:

The garbage-like chars there - if you don't see them, see
https://gist.github.com/2020442 - is the expected output. In PHP 5.4 the
output is nothing. The function recognizes that this is not valid UTF-8
and dumps the entire string.

Ignoring 5.4 for a second, if you in 5.3 do this:

echo htmlspecialchars($string);
echo htmlspecialchars($string, NULL, "ISO-8859-1");
echo htmlspecialchars($string, NULL, "UTF-8");

You will see that the first two output the escaped string with the
GB2312 bytes intact within it and the UTF-8 calls returns false because
it correctly recognizes that GB2312 is not UTF-8. We don't have any such
check for 8859-1, so yes, saying UTF-8 and 8859-1 are the same for
htmlspecialchars() is wrong for PHP 5.3 as well as for 5.4.

And as expected, under 5.4 because the default is now the UTF-8
behaviour only the second echo gives a result.

-Rasmus

13 years ago by Rasmus Lerdorf — view source

unread

$string = $string = "<pre>$gb2312</pre>";

Sorry typo there obviously. Just one $string

-Rasmus

13 years ago by Stas Malyshev — view source

unread

Hi!

Ignoring 5.4 for a second, if you in 5.3 do this:

echo htmlspecialchars($string);
echo htmlspecialchars($string, NULL, "ISO-8859-1");
echo htmlspecialchars($string, NULL, "UTF-8");

You will see that the first two output the escaped string with the
GB2312 bytes intact within it and the UTF-8 calls returns false because
it correctly recognizes that GB2312 is not UTF-8. We don't have any such
check for 8859-1, so yes, saying UTF-8 and 8859-1 are the same for
htmlspecialchars() is wrong for PHP 5.3 as well as for 5.4.

So the difference is that ISO8859-1 does not validate but UTF-8 validates?
I'm not sure what GB2312 encoding does but isn't it dangerous to do
htmlspecialchars() with wrong encoding? Wouldn't htmlentities() also
produce wrong result when used with wrong encoding?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by Adam Jon Richardson — view source

unread

On Mon, Mar 12, 2012 at 3:52 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:

Hi!

Ignoring 5.4 for a second, if you in 5.3 do this:

echo htmlspecialchars($string);
echo htmlspecialchars($string, NULL, "ISO-8859-1");
echo htmlspecialchars($string, NULL, "UTF-8");

You will see that the first two output the escaped string with the
GB2312 bytes intact within it and the UTF-8 calls returns false because
it correctly recognizes that GB2312 is not UTF-8. We don't have any such
check for 8859-1, so yes, saying UTF-8 and 8859-1 are the same for
htmlspecialchars() is wrong for PHP 5.3 as well as for 5.4.

So the difference is that ISO8859-1 does not validate but UTF-8 validates?
I'm not sure what GB2312 encoding does but isn't it dangerous to do
htmlspecialchars() with wrong encoding? Wouldn't htmlentities() also
produce wrong result when used with wrong encoding?

The EUC-CN encoding appears to ensure compatibility with ascii by avoiding
the ascii range for each of its two bytes, so it seems that
htmlspecialchars should work OK:

http://en.wikipedia.org/wiki/GB_2312#EUC-CN
http://php.net/manual/en/mbstring.supported-encodings.php

Adam

13 years ago by Rasmus Lerdorf — view source

unread

Hi!

Ignoring 5.4 for a second, if you in 5.3 do this:

echo htmlspecialchars($string);
echo htmlspecialchars($string, NULL, "ISO-8859-1");
echo htmlspecialchars($string, NULL, "UTF-8");

You will see that the first two output the escaped string with the
GB2312 bytes intact within it and the UTF-8 calls returns false because
it correctly recognizes that GB2312 is not UTF-8. We don't have any such
check for 8859-1, so yes, saying UTF-8 and 8859-1 are the same for
htmlspecialchars() is wrong for PHP 5.3 as well as for 5.4.

So the difference is that ISO8859-1 does not validate but UTF-8 validates?
I'm not sure what GB2312 encoding does but isn't it dangerous to do
htmlspecialchars() with wrong encoding? Wouldn't htmlentities() also
produce wrong result when used with wrong encoding?

Not sure you can validate 8859-1 since it isn't multibyte, can you? Is
there any byte that is explicitly forbidden in 8859-1?

And yes, it may very well be dangerous to use the wrong charset and now
that we have better support for GB2312 and other asian charsets in the
entities functions in 5.4 it is even more prudent to choose the right
one so we should provide some way to help people get it right short of
changing every call.

Gustavo suggested we could use the multibyte encoding setting.
Unfortunately only zend.script_encoding is available and I think
internal_encoding is closer to what we need here, but that is only
available as mbstring.internal_encoding.

-Rasmus

13 years ago by Yasuo Ohgaki — view source

unread

Hi

I think following PHP 5.4.0 NEWS entry is misleading.

. Changed default value of "default_charset" php.ini option from ISO-8859-1 to
UTF-8. (Rasmus)

I thought default_charset became UTF-8, so I was expecting
following HTTP header.

content-type text/html; charset=UTF-8

However, I got empty charset (missing 'charset=UTF-8').
So I looked up to source and found the line in SAPI.h

293 #define SAPI_DEFAULT_CHARSET ""

Empty string should be "UTF-8", isn't it?

BTW, empty charset in HTTP header does not mean the default will
be ISO-8859-1, but it let browser guess the encoding is used.
Guessing encoding may cause XSS under certain conditions.

Anyway, I was curious so I've checked ext/standard/html.c and found

/* {{{ entity_charset determine_charset

returns the charset identifier based on current locale or a hint.
defaults to UTF-8 */
static enum entity_charset determine_charset(char *charset_hint TSRMLS_DC)
{
int i;
enum entity_charset charset = cs_utf_8;
int len = 0;
const zend_encoding *zenc;

/* Default is now UTF-8 */
if (charset_hint == NULL)
return cs_utf_8;

There are 2 problems.

php.ini's default_charset should be UTF-8.
determine_charset() should not blindly default to UTF-8 when there
are no hint.

Old htmlentities/htmlspecialchars actually determines charset from
default_charset/mbstring.internal_encoding/etc. I think old behavior
is better than now.

How about make determine_charset() behaves like 5.3 and set the
SAPI_DEFAULT_CHARSET to "UTF-8"?

Then PHP will behave like as NEWS mentions, htmlentities/htmlspecialchars
default encoding became 'UTF-8' and users will have control for default
htmlenties/htmlspecialchars encoding.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

13 years ago by Yasuo Ohgaki — view source

unread

Hi,

I think motivation of

   /* Default is now UTF-8 */
   if (charset_hint == NULL)
           return cs_utf_8;

is for better performance and I think it's good for better performance.
Alternative of my suggestion is introduce new php.ini entry as Rusmus
mentioned.

The name may be "default_html_escape_encoding"?

We should document this behavior very well, since it affects all of
non UTF-8 web sites.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

13 years ago by Laruence — view source

unread

Hi,

I think motivation of

/* Default is now UTF-8 */
if (charset_hint == NULL)
return cs_utf_8;

is for better performance and I think it's good for better performance.
Alternative of my suggestion is introduce new php.ini entry as Rusmus
mentioned.

The name may be "default_html_escape_encoding"?
Hi:
in consideration of succinctness, I think run_time_encoding is better.

and we should also separate the determine_output_charset and
determine_run_time_charset(there is only one determin_charset now)

thanks

We should document this behavior very well, since it affects all of
non UTF-8 web sites.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

--

--
Laruence Xinchen Hui
http://www.laruence.com/

13 years ago by Rasmus Lerdorf — view source

unread

Hi

I think following PHP 5.4.0 NEWS entry is misleading.

. Changed default value of "default_charset" php.ini option from ISO-8859-1 to
UTF-8. (Rasmus)

Yes, I have fixed that now.

I thought default_charset became UTF-8, so I was expecting
following HTTP header.

content-type text/html; charset=UTF-8

However, I got empty charset (missing 'charset=UTF-8').
So I looked up to source and found the line in SAPI.h

293 #define SAPI_DEFAULT_CHARSET ""

Empty string should be "UTF-8", isn't it?

No, we can't force an output charset on people since it would end up
breaking a lot of sites.

php.ini's default_charset should be UTF-8.

determine_charset() should not blindly default to UTF-8 when there
are no hint.

Old htmlentities/htmlspecialchars actually determines charset from
default_charset/mbstring.internal_encoding/etc. I think old behavior
is better than now.

How about make determine_charset() behaves like 5.3 and set the
SAPI_DEFAULT_CHARSET to "UTF-8"?

PHP 5.3's determine_charset behaves exactly like 5.4's. In 5.3 we have:

if (charset_hint == NULL)
        return cs_8859_1;

and in 5.4 we have:

if (charset_hint == NULL)
        return cs_utf_8;

So there is no difference in their guessing when there is no hint, the
only difference is that in 5.4 we choose utf8 and in 5.3 we choose
8859-1 in that case.

-Rasmus

13 years ago by Michael Stowe — view source

unread

I think the ini directive, while adding another to the list, may be the
most unobtrusive method to address this issue, at least for developers.

I definitely agree with Rasmus that this could be one of the bigger
headaches in transitioning to 5.4 (for non-UTF8 sites) and unless we can
come up with a better solution, I say let's move forward with it for 5.4.1.

Mike

Hi

I think following PHP 5.4.0 NEWS entry is misleading.

. Changed default value of "default_charset" php.ini option from
ISO-8859-1 to
UTF-8. (Rasmus)

Yes, I have fixed that now.

I thought default_charset became UTF-8, so I was expecting
following HTTP header.

content-type text/html; charset=UTF-8

However, I got empty charset (missing 'charset=UTF-8').
So I looked up to source and found the line in SAPI.h

293 #define SAPI_DEFAULT_CHARSET ""

Empty string should be "UTF-8", isn't it?

No, we can't force an output charset on people since it would end up
breaking a lot of sites.

php.ini's default_charset should be UTF-8.

determine_charset() should not blindly default to UTF-8 when there
are no hint.

Old htmlentities/htmlspecialchars actually determines charset from
default_charset/mbstring.internal_encoding/etc. I think old behavior
is better than now.

How about make determine_charset() behaves like 5.3 and set the
SAPI_DEFAULT_CHARSET to "UTF-8"?

PHP 5.3's determine_charset behaves exactly like 5.4's. In 5.3 we have:

if (charset_hint == NULL)
return cs_8859_1;

and in 5.4 we have:

if (charset_hint == NULL)
return cs_utf_8;

So there is no difference in their guessing when there is no hint, the
only difference is that in 5.4 we choose utf8 and in 5.3 we choose
8859-1 in that case.

-Rasmus

--

--

"My command is this: Love each other as I
have loved you." John 15:12

13 years ago by Yasuo Ohgaki — view source

unread

2012/3/13 Rasmus Lerdorf rasmus@lerdorf.com:

I thought default_charset became UTF-8, so I was expecting
following HTTP header.

content-type text/html; charset=UTF-8

However, I got empty charset (missing 'charset=UTF-8').
So I looked up to source and found the line in SAPI.h

293 #define SAPI_DEFAULT_CHARSET ""

Empty string should be "UTF-8", isn't it?

No, we can't force an output charset on people since it would end up
breaking a lot of sites.

Right, so may be for the next major release? 5.5.0?

As the first XSS advisory in 2000 states, explicitly setting char coding will
prevent certain XSS. Recent browsers have much better encoding handing,
but setting encoding explicitly is better for security still.

PHP 5.3's determine_charset behaves exactly like 5.4's. In 5.3 we have:

if (charset_hint == NULL)
return cs_8859_1;

and in 5.4 we have:

if (charset_hint == NULL)
return cs_utf_8;

So there is no difference in their guessing when there is no hint, the
only difference is that in 5.4 we choose utf8 and in 5.3 we choose
8859-1 in that case.

I got this with 5.3
<?php
echo htmlentities('<日本語UTF-8>',ENT_QUOTES);
echo htmlentities('<日本語UTF-8>',ENT_QUOTES, 'UTF-8');

<æ�¥æ�¬èª�UTF8
><日本語UTF-8>

So people migrating from 5.3 to 5.4 should not have problems.
Migration older than 5.3 to 5.4 will be problematic.

I always set all parameters for htmlentities/htmlspecialchars, therefore
I haven't noticed this was changed from 5.3. They may be migrating from
5.2 or older. (RHEL5 uses 5.1)

Since PHP does not have default multibyte module, it may be good for having

input_encoding
internal_encoding
output_encoding

php.ini settings and make multibyte modules use them when they are set.
Or just make mbstring default, alternatively.

Rather big change for released version, but this is simple easy change.

Regards,

--
Yasuo Ohgaki

13 years ago by Rasmus Lerdorf — view source

unread

I always set all parameters for htmlentities/htmlspecialchars, therefore
I haven't noticed this was changed from 5.3. They may be migrating from
5.2 or older. (RHEL5 uses 5.1)

No, like I showed, moving from 5.3 to 5.4 breaks because the new default
UTF-8 encoding validates the input and 8859-1 in 5.3 does not. So for
charsets that are actually safe for the low-ascii chars that are
significant to html htmlspecialchars() now returns false in 5.4 because
their chars fail the UTF8 validity check. For people who explicitly set
all the parameters nothing has changed, of course.

-Rasmus

13 years ago by Christian Schneider — view source

unread

Am 13.03.2012, 02:34 Uhr, schrieb Rasmus Lerdorf rasmus@lerdorf.com:

I always set all parameters for htmlentities/htmlspecialchars, therefore
I haven't noticed this was changed from 5.3. They may be migrating from
5.2 or older. (RHEL5 uses 5.1)

No, like I showed, moving from 5.3 to 5.4 breaks because the new default
UTF-8 encoding validates the input and 8859-1 in 5.3 does not. So for
charsets that are actually safe for the low-ascii chars that are
significant to html htmlspecialchars() now returns false in 5.4 because
their chars fail the UTF8 validity check. For people who explicitly set
all the parameters nothing has changed, of course.

I second that. It causes us big PITA because we're still using 8859-1
(shame
on us) and it is made even worse because the encoding parameter is after
the
(optional) flags parameter which now has to be given too.

The sane version from my naive point of view would be to honor
default_charset
if nothing is given. That's what I expected when I read the migration
guide.

Chris

13 years ago by jpauli — view source

unread

2012/3/13 Rasmus Lerdorf rasmus@lerdorf.com:

I thought default_charset became UTF-8, so I was expecting
following HTTP header.

content-type text/html; charset=UTF-8

However, I got empty charset (missing 'charset=UTF-8').
So I looked up to source and found the line in SAPI.h

293 #define SAPI_DEFAULT_CHARSET ""

Empty string should be "UTF-8", isn't it?

No, we can't force an output charset on people since it would end up
breaking a lot of sites.

Right, so may be for the next major release? 5.5.0?

As the first XSS advisory in 2000 states, explicitly setting char coding
will
prevent certain XSS. Recent browsers have much better encoding handing,
but setting encoding explicitly is better for security still.

PHP 5.3's determine_charset behaves exactly like 5.4's. In 5.3 we have:

if (charset_hint == NULL)
return cs_8859_1;

and in 5.4 we have:

if (charset_hint == NULL)
return cs_utf_8;

So there is no difference in their guessing when there is no hint, the
only difference is that in 5.4 we choose utf8 and in 5.3 we choose
8859-1 in that case.

I got this with 5.3
<?php
echo htmlentities('<日本語UTF-8>',ENT_QUOTES);
echo htmlentities('<日本語UTF-8>',ENT_QUOTES, 'UTF-8');

<æ�¥æ�¬èª�UTF8
><日本語UTF-8>

So people migrating from 5.3 to 5.4 should not have problems.
Migration older than 5.3 to 5.4 will be problematic.

I always set all parameters for htmlentities/htmlspecialchars, therefore
I haven't noticed this was changed from 5.3. They may be migrating from
5.2 or older. (RHEL5 uses 5.1)

Since PHP does not have default multibyte module, it may be good for having

input_encoding
internal_encoding
output_encoding

I would then propose to make mbstring compile time mandatory.

I'm against yet another global ini setting, I find the actual ini settings
confusing enough to add one more that would moreover reflect mbstring one's
(and add more and more confusion).
Why not turn ext/mbstring mandatory at compile time, for all future PHP
versions, like preg or spl are ?

We do need multibyte handling either. ZendEngine takes advantage of
mbstring for internal encoding as well, so I probably missed something as
why it is still possible to --disable-mbstring (or not add
--enable-mbstring) when compiling ? Has it a huge performance impact ?

Thank you :)

Julien.P

13 years ago by Ferenc Kovacs — view source

unread

I would then propose to make mbstring compile time mandatory.

I'm against yet another global ini setting, I find the actual ini settings
confusing enough to add one more that would moreover reflect mbstring one's
(and add more and more confusion).
Why not turn ext/mbstring mandatory at compile time, for all future PHP
versions, like preg or spl are ?

We do need multibyte handling either. ZendEngine takes advantage of
mbstring for internal encoding as well, so I probably missed something as
why it is still possible to --disable-mbstring (or not add
--enable-mbstring) when compiling ? Has it a huge performance impact ?

Thank you :)

Julien.P

see
internals@lists.php.net/msg48452.html" rel="nofollow" target="_blank">http://www.mail-archive.com/internals@lists.php.net/msg48452.html
http://lxr.php.net/opengrok/xref/PHP_5_4/UPGRADING#91
and
internals@lists.php.net/msg53863.html" rel="nofollow" target="_blank">http://www.mail-archive.com/internals@lists.php.net/msg53863.html

basically the mbstring code in the ZE is only used if you
enable zend.multibyte, which is disabled by default, so it isn't mandatory
to have ext/mbstring for the default build/setup.
as you can see from the last link, I would support having ext/mbstring
builtin and always enabled, but I would like to hear from more people about
the pros and cons.

--
Ferenc Kovács
@Tyr43l - http://tyrael.hu

13 years ago by Michael Stowe — view source

unread

Correct me if I'm wrong, but I believe Zend Multibyte is now enabled by
default in PHP 5.4.

Mike

I would then propose to make mbstring compile time mandatory.

I'm against yet another global ini setting, I find the actual ini
settings
confusing enough to add one more that would moreover reflect mbstring
one's
(and add more and more confusion).
Why not turn ext/mbstring mandatory at compile time, for all future PHP
versions, like preg or spl are ?

We do need multibyte handling either. ZendEngine takes advantage of
mbstring for internal encoding as well, so I probably missed something as
why it is still possible to --disable-mbstring (or not add
--enable-mbstring) when compiling ? Has it a huge performance impact ?

Thank you :)

Julien.P

see
internals@lists.php.net/msg48452.html" rel="nofollow" target="_blank">http://www.mail-archive.com/internals@lists.php.net/msg48452.html
http://lxr.php.net/opengrok/xref/PHP_5_4/UPGRADING#91
and
internals@lists.php.net/msg53863.html" rel="nofollow" target="_blank">http://www.mail-archive.com/internals@lists.php.net/msg53863.html

basically the mbstring code in the ZE is only used if you
enable zend.multibyte, which is disabled by default, so it isn't mandatory
to have ext/mbstring for the default build/setup.
as you can see from the last link, I would support having ext/mbstring
builtin and always enabled, but I would like to hear from more people about
the pros and cons.

--
Ferenc Kovács
@Tyr43l - http://tyrael.hu

--

"My command is this: Love each other as I
have loved you." John 15:12

13 years ago by Ferenc Kovacs — view source

unread

Correct me if I'm wrong, but I believe Zend Multibyte is now enabled by
default in PHP 5.4.

Mike

http://lxr.php.net/opengrok/xref/PHP_5_4/UPGRADING#91
http://lxr.php.net/opengrok/xref/PHP_5_4/Zend/zend.c#108
http://lxr.php.net/opengrok/xref/PHP_5_4/php.ini-development#358
http://lxr.php.net/opengrok/xref/PHP_5_4/php.ini-production#358

we just moved the switch from compilation time to runtime, so the code is
there, if you want to enable it, you don't have to recompile php but only
have to change an ini setting, but it isn't turned on by default.
AFAIK

Ferenc Kovács
@Tyr43l - http://tyrael.hu

13 years ago by Gustavo Lopes — view source

unread

I would then propose to make mbstring compile time mandatory.

I'm completely against these kind of lazy solutions. Yes, let's add strong
coupling (already starting to smell) to one of the largest extensions and
make it compile time mandatory because it simplifies the implementation of
a dubiously useful feature like Zend multibyte. Remember PHP is sometimes
used in environments with limited memory/disk space.

Also mbstring takes a long time to build (relatively speaking). Just that
would be a strong argument against making it mandatory, at least for
people like me that compile PHP with --disable-all very frequently.

I'm against yet another global ini setting, I find the actual ini
settings confusing enough to add one more that would moreover reflect
mbstring one's (and add more and more confusion).
Why not turn ext/mbstring mandatory at compile time, for all future PHP
versions, like preg or spl are ?

We do need multibyte handling either. ZendEngine takes advantage of
mbstring for internal encoding as well, so I probably missed something as
why it is still possible to --disable-mbstring (or not add
--enable-mbstring) when compiling ? Has it a huge performance impact ?

mbstring hooks to basically all phases of PHP process/request
startup/shutdown. Some efforts were made to mitigate the impact of this in
5.4 (see e.g. r301068), but at least some impact is inevitable. Of course,
if you start enabling certain features of mbstring (zend multibyte hooks,
translation of input variables, function overload) then it starts to be
significant. However, there are other more compelling reasons not to make
it required (see above).

--
Gustavo Lopes

13 years ago by jpauli — view source

unread

On Wed, Mar 14, 2012 at 3:37 PM, Gustavo Lopes glopes@nebm.ist.utl.ptwrote:

I would then propose to make mbstring compile time mandatory.

I'm completely against these kind of lazy solutions. Yes, let's add strong
coupling (already starting to smell) to one of the largest extensions and
make it compile time mandatory because it simplifies the implementation of
a dubiously useful feature like Zend multibyte. Remember PHP is sometimes
used in environments with limited memory/disk space.

Also mbstring takes a long time to build (relatively speaking). Just that
would be a strong argument against making it mandatory, at least for people
like me that compile PHP with --disable-all very frequently.

I'm against yet another global ini setting, I find the actual ini

settings confusing enough to add one more that would moreover reflect
mbstring one's (and add more and more confusion).
Why not turn ext/mbstring mandatory at compile time, for all future PHP
versions, like preg or spl are ?

We do need multibyte handling either. ZendEngine takes advantage of
mbstring for internal encoding as well, so I probably missed something as
why it is still possible to --disable-mbstring (or not add
--enable-mbstring) when compiling ? Has it a huge performance impact ?

mbstring hooks to basically all phases of PHP process/request
startup/shutdown. Some efforts were made to mitigate the impact of this in
5.4 (see e.g. r301068), but at least some impact is inevitable. Of course,
if you start enabling certain features of mbstring (zend multibyte hooks,
translation of input variables, function overload) then it starts to be
significant. However, there are other more compelling reasons not to make
it required (see above).

--
Gustavo Lopes

That makes sense to me :-)

But we should think about complexity in the final choice.
Having something like "internal_encoding" adding in PHP.ini will confuse
people, at least, if we dont clearly explain them what the setting is for.
The name is nearly the same as mbstring's.

I recently opened a doc bug about multibyte handling in 5.4 (#61373) , as
the documentation is really light on that point

Julien.P

13 years ago by Stas Malyshev — view source

unread

Hi!

And yes, it may very well be dangerous to use the wrong charset and now
that we have better support for GB2312 and other asian charsets in the
entities functions in 5.4 it is even more prudent to choose the right
one so we should provide some way to help people get it right short of
changing every call.

I'm not sure "changing every call" is such a big problem - it's one grep
and one replace, can be done in one line of sed/awk/perl/php probably.
But a bigger issue is here that people insist on using wrong charsets
and expect language to have some magical external defaults that work for
exactly their use case, instead of doing what they should be doing all
along - putting charset right there in the argument.
We need to get people off this mindset fast, since it is not a good one.
Having tons of hidden defaults that modify behavior of functions called
with the same arguments in hundreds of different ways is a coding and
maintenance nightmare. Now if I write `htmlspecialchars()` I can never be
sure if works right and uses UTF-8 - what if somebody messed with the
INI setting because of some other broken library that required that to work?

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by Rasmus Lerdorf — view source

unread

Hi!

And yes, it may very well be dangerous to use the wrong charset and now
that we have better support for GB2312 and other asian charsets in the
entities functions in 5.4 it is even more prudent to choose the right
one so we should provide some way to help people get it right short of
changing every call.

I'm not sure "changing every call" is such a big problem - it's one grep
and one replace, can be done in one line of sed/awk/perl/php probably.
But a bigger issue is here that people insist on using wrong charsets
and expect language to have some magical external defaults that work for
exactly their use case, instead of doing what they should be doing all
along - putting charset right there in the argument.
We need to get people off this mindset fast, since it is not a good one.
Having tons of hidden defaults that modify behavior of functions called
with the same arguments in hundreds of different ways is a coding and
maintenance nightmare. Now if I write htmlspecialchars() I can never be
sure if works right and uses UTF-8 - what if somebody messed with the
INI setting because of some other broken library that required that to
work?

But you can't necessarily hardcode the encoding if you are writing
portable code. That's a bit like hardcoding a timezone. In order to
write portable code you need to give people the ability to localize it.

-Rasmus

13 years ago by Stas Malyshev — view source

unread

Hi!

But you can't necessarily hardcode the encoding if you are writing
portable code. That's a bit like hardcoding a timezone. In order to
write portable code you need to give people the ability to localize it.

No, it's not like timezone at all. I have to support all timezones in a
global app, but I don't have to internally support every encoding on
Earth - having everything internally in UTF-8 works quite well, and a
lot of applications do exactly that - they have everything internally in
UTF-8 and only may convert when importing or exporting the data. I don't
see anything in using UTF-8 throughout the app/library that makes it
non-portable. However, if we allow to change defaults in
`htmlspecialchars()` etc. that essentially makes having defaults useless
as I'd have so explicitly specify UTF-8 each time - otherwise it's a
gamble what encoding I'd actually get.

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by Rasmus Lerdorf — view source

unread

Hi!

But you can't necessarily hardcode the encoding if you are writing
portable code. That's a bit like hardcoding a timezone. In order to
write portable code you need to give people the ability to localize it.

No, it's not like timezone at all. I have to support all timezones in a
global app, but I don't have to internally support every encoding on
Earth - having everything internally in UTF-8 works quite well, and a
lot of applications do exactly that - they have everything internally in
UTF-8 and only may convert when importing or exporting the data. I don't
see anything in using UTF-8 throughout the app/library that makes it
non-portable. However, if we allow to change defaults in
htmlspecialchars() etc. that essentially makes having defaults useless
as I'd have so explicitly specify UTF-8 each time - otherwise it's a
gamble what encoding I'd actually get.

If everything was UTF-8 we wouldn't have any of these issues.
Unfortunately that isn't the case. The question is what to do with apps
that need to deal with non UTF-8 data. Are we going to provide any help
to them beyond just telling them to convert everything to UTF-8?

We took steps in 5.4 to improve htmlspecialchars to understand more
encodings and we have the concept of script_encoding and
internal_encoding that is used both in the engine and in mbstring.
Currently internal_encoding isn't checked by htmlspecialchars. If you
pass it '' it checks script_encoding and default_charset which is a bit
odd since neither directly relate to the encoding of the internal data
you are feeding to it. So maybe a way to tackle this is to use the
mbstring internal encoding when it is set as the htmlspecialchars
default when it is called without an encoding arg.

-Rasmus

13 years ago by Pierre Joye — view source

unread

hi Rasmus,

If everything was UTF-8 we wouldn't have any of these issues.
Unfortunately that isn't the case. The question is what to do with apps
that need to deal with non UTF-8 data. Are we going to provide any help
to them beyond just telling them to convert everything to UTF-8?

That's not really an acceptable solution, obviously.

We took steps in 5.4 to improve htmlspecialchars to understand more
encodings and we have the concept of script_encoding and
internal_encoding that is used both in the engine and in mbstring.

Currently internal_encoding isn't checked by htmlspecialchars. If you
pass it '' it checks script_encoding and default_charset which is a bit
odd since neither directly relate to the encoding of the internal data
you are feeding to it. So maybe a way to tackle this is to use the
mbstring internal encoding when it is set as the htmlspecialchars
default when it is called without an encoding arg.

That's why I would prefer to use an existing setting and clearly
document it instead of creating a new ini settings with a totally
different impact than the existing ones. Not sure which one would fit
best tho'.

Reading these last two paragraphs gave me a headache and I did not
know anymore which encoding we were talking about ;-)

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

13 years ago by Daniel Convissor — view source

unread

Hi Folks:

This topic appears to have been quietly tabled. I didn't notice a
decision here or a commit.

So maybe a way to tackle this is to use the
mbstring internal encoding when it is set as the htmlspecialchars
default when it is called without an encoding arg.

This seems like the clearest indicator of the programmer's intent.

Thanks,

--Dan

--
T H E A N A L Y S I S A N D S O L U T I O N S C O M P A N Y
data intensive web and database programming
http://www.AnalysisAndSolutions.com/
4015 7th Ave #4, Brooklyn NY 11232 v: 718-854-0335 f: 718-854-0409

13 years ago by keisial@gmail.com — view source

unread

Hi!

But you can't necessarily hardcode the encoding if you are writing
portable code. That's a bit like hardcoding a timezone. In order to
write portable code you need to give people the ability to localize it.

No, it's not like timezone at all. I have to support all timezones in
a global app, but I don't have to internally support every encoding on
Earth - having everything internally in UTF-8 works quite well, and a
lot of applications do exactly that - they have everything internally
in UTF-8 and only may convert when importing or exporting the data. I
don't see anything in using UTF-8 throughout the app/library that
makes it non-portable. However, if we allow to change defaults in
htmlspecialchars() etc. that essentially makes having defaults useless
as I'd have so explicitly specify UTF-8 each time - otherwise it's a
gamble what encoding I'd actually get.
If you are a framework developer, and really want to shield against a
bad php.ini setting, you could ini_set() to your prefered charset at the
beginning of the request.

13 years ago by Stas Malyshev — view source

unread

Hi!

If you are a framework developer, and really want to shield against a
bad php.ini setting, you could ini_set() to your prefered charset at the
beginning of the request.

That assuming "the request" is completely processed by your framework
and you never call any outside code and any outside code never calls you

otherwise your messing with INI setting may very well break that code
or that code's messing with INI settings may very well break yours.
--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by keisial@gmail.com — view source

unread

Hi!

If you are a framework developer, and really want to shield against a
bad php.ini setting, you could ini_set() to your prefered charset at the
beginning of the request.

That assuming "the request" is completely processed by your framework
and you never call any outside code and any outside code never calls
you - otherwise your messing with INI setting may very well break that
code or that code's messing with INI settings may very well break yours.
Sure. That's a setting to be kept the same for the request unless you
like trouble.
If you need to call a library function which uses a different html
charset convention you could do so through a wrapper, which sets and
restores the setting.
Still, that API is likely wrong: a library function written by someone
completely unrelated to the main application shouldn't be echoing
anything through the output. And if it's not generating the html, the
htmlspecialchars is better done from the return at the calling
application (probably after converting the internal charset).
Such interfaces may be well served by switching the setting many times.
I was only advocating the usage of ini_set() once in the request,
for the case of a server with two applications having different needs
(equivalent to configuring it on .user.ini or .htaccess).

13 years ago by Stas Malyshev — view source

unread

Hi!

Still, that API is likely wrong: a library function written by someone
completely unrelated to the main application shouldn't be echoing
anything through the output. And if it's not generating the html, the
htmlspecialchars is better done from the return at the calling
application (probably after converting the internal charset).

Again, you making a huge amount of assumptions about how ALL the
applications must work, which means you are wrong in 99.(9)% of cases,
because there's infinitely many applications which don't work exactly
like yours does, and we have no idea how they work.

The main point is that having global state (and yet worse, changeable
global state) significantly influence how basic functions are working is
dangerous. It's like keeping everything in globals and instead of
passing parameters between functions just change some globals and expect
functions to pick it up.

Such interfaces may be well served by switching the setting many times.

That's exactly what I am trying to avoid, and you are just illustrating
why this proposal is dangerous - because that's exactly what is going to
happen in the code, instead of passing proper arguments to
htmlspecialchars people will start changing INI settings left and right,
and then nobody would know what `htmlspecialchars()` call actually does
without tracking all the INI changes along the way.

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by keisial@gmail.com — view source

unread

Hi!

Still, that API is likely wrong: a library function written by someone
completely unrelated to the main application shouldn't be echoing
anything through the output. And if it's not generating the html, the
htmlspecialchars is better done from the return at the calling
application (probably after converting the internal charset).

Again, you making a huge amount of assumptions about how ALL the
applications must work, which means you are wrong in 99.(9)% of cases,
because there's infinitely many applications which don't work exactly
like yours does, and we have no idea how they work.
No. I'm saying how I consider they should work, saying that an API doing
otherwise is likely* wrong (aka. has a bad design), very much as I'd
consider insane a company policy stating "PHP function arguments shall
be named $a, $b, $c...".
That's obviously my opinion, but I think most applications will conform
to that, just as most apps will use more descriptive argument names than
"$c"**.

There might be some very very special application where it turns out
to be an appropiate design, but that would be the exception.
** Even though there are 26!/(26-n)! ways to name so badly the arguments
of a n-ary function.

The main point is that having global state (and yet worse, changeable
global state) significantly influence how basic functions are working
is dangerous. It's like keeping everything in globals and instead of
passing parameters between functions just change some globals and
expect functions to pick it up.
I agree with you, in the general case. Yet, I consider the html charset
to be a global state. And passing the global variables as parameters on
each function call would be nearly as bad as passing parameters as globals.
I just positioned the opposite way for parse_str(), while being fully
aware of that.

Such interfaces may be well served by switching the setting many times.
That's exactly what I am trying to avoid, and you are just
illustrating why this proposal is dangerous - because that's exactly
what is going to happen in the code, instead of passing proper
arguments to htmlspecialchars people will start changing INI settings
left and right, and then nobody would know what htmlspecialchars()
call actually does without tracking all the INI changes along the way.
That's assuming people would need to use different output charsets,
which I don't consider to be the case. How many people is using now the
third htmlspecialchars() parameter?
What makes you think that they would need to change the default global,
several times per request?

13 years ago by Richard Lynch — view source

unread

But you can't necessarily hardcode the encoding if you are writing
portable code. That's a bit like hardcoding a timezone. In order to
write portable code you need to give people the ability to localize
it.

If you wanted it portable, wouldn't you need to have a variable there,
so it can survive the ISO-8859-1 to UTF-8 change, and to allow people
to change it despite whatever non-standard setting might happen to be
in somebody else's php.ini?

I mean, sure, it's nice if it "just works" for the folks who want to
install and have it localized for their own charset hard-coded in
php.ini, but if it's being multi-national website, you have to pass in
a variable there, which seems the more portable option to this naive
reader.

Having it default to whatever happens to be in php.ini only solves the
use case of people who only want to serve up their content in their
own charset.

I'd have to agree with Stas that everybody should start passing in a
variable there, that can be set somewhere in a config, or, perhaps,
would DEFAULT to, errrr...

You can't default to a function call.

ANOTHER magic constant like INI_CHARSET ???

That's probably a bad idea...

--
brain cancer update:
http://richardlynch.blogspot.com/search/label/brain%20tumor
Donate:
https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=FS9NLTNEEKWBE

13 years ago by Tomas Kuliavas — view source

unread

2012.03.13 16:38 Richard Lynch rašė:

I'd have to agree with Stas that everybody should start passing in a
variable there, that can be set somewhere in a config, or, perhaps,
would DEFAULT to, errrr...

You do realize that suggestions on this thread and original bug reporter
failed to make correct decisions about values that should be used to
migrate original function to PHP 5.4 compatible syntax?

htmlspecialchars without arguments does not default to ENT_QUOTES or NULL.

Failure to choose proper second argument value will lead to different
exploit or data corruption.

You can't default to a function call.

Changing default in function was bad idea.

Ignoring bug reports about f....ed up documentation and closing them with
bogus explanations might not be bad idea, but it really helps in
alienating your developer base.

--
Tomas

13 years ago by Richard Lynch — view source

unread

What we really need is what we added in PHP 6. A runtime encoding ini
setting that is distinct from the output charset which we can use
here.

The usual argument against another php.ini setting, other than "too
many already" is the difficulty it presents to write portable code
libraries.

I'm not smart enough to predict how such a setting (regardless of its
name) would help or hinder a library of code that doesn't want another
conditional in a zillion places.

But you folks are that smart. :-)

And I haven't seen any discussion regarding this sub-issue.

So, how would the help / hinder authors of generic library code to be
distributed in the wild?

Forgive me if the answer is so blindingly obvious I should already
know it... :-)

--
brain cancer update:
http://richardlynch.blogspot.com/search/label/brain%20tumor
Donate:
https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=FS9NLTNEEKWBE

default charset confusion

--

--

we just moved the switch from compilation time to runtime, so the code is there, if you want to enable it, you don't have to recompile php but only have to change an ini setting, but it isn't turned on by default. AFAIK

Cheers,

we just moved the switch from compilation time to runtime, so the code is
there, if you want to enable it, you don't have to recompile php but only
have to change an ini setting, but it isn't turned on by default.
AFAIK