[RFC] Multibyte char handling

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

addslashes() could be vulnerable via char encoding based attacks.
It is needed to decide what counter measure we adopt.
This is RFC for this issue.

https://wiki.php.net/multibyte_char_handling

Please comment.
Thank you.

I've copied line from "Array Of" RFC and URL was wrong.
Correct URL is

https://wiki.php.net/rfc/multibyte_char_handling

Sorry for the confusion.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by mails@thomasbley.de — view source

unread

Yasuo Ohgaki wrote on 16.01.2014 01:12:

Hi all,

addslashes() could be vulnerable via char encoding based attacks.
It is needed to decide what counter measure we adopt.
This is RFC for this issue.

https://wiki.php.net/multibyte_char_handling

Please comment.
Thank you.

I've copied line from "Array Of" RFC and URL was wrong.
Correct URL is

https://wiki.php.net/rfc/multibyte_char_handling

Sorry for the confusion.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

Hello Yasuo,

what about mb_trim?
e.g. UTF-8: C2 A0, e2 80 82, e2 80 83, e2 80 af

I currently have lots of untrimmed data in a database since php-trim() and mysql-trim() can't handle these characters.
There are workarounds like trim($str, chr(0xC2).chr(0xA0)); but they are not really nice to code.

Regards
Thomas

11 years ago by Yasuo Ohgaki — view source

unread

what about mb_trim?
e.g. UTF-8: C2 A0, e2 80 82, e2 80 83, e2 80 af

I currently have lots of untrimmed data in a database since php-trim() and
mysql-trim() can't handle these characters.
There are workarounds like trim($str, chr(0xC2).chr(0xA0)); but they are
not really nice to code.

We need few more basic functions like trim().
This is different issue, but I may work on it.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Stas Malyshev — view source

unread

Hi!

We need few more basic functions like trim().
This is different issue, but I may work on it.

It feels like we will end up with creating clones of pretty every string
function there is. Not sure it is the best approach...

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

11 years ago by Yasuo Ohgaki — view source

unread

Hi Stas,

On Thu, Jan 16, 2014 at 4:22 PM, Stas Malyshev smalyshev@sugarcrm.comwrote:

We need few more basic functions like trim().
This is different issue, but I may work on it.

It feels like we will end up with creating clones of pretty every string
function there is. Not sure it is the best approach...

We could learn from Python 2.x and 3.x.
There are Python users, who are serious about multilingual support,
complain about 3.x. They insist 2.x behavior and even want to
discontinue 3.x.

I don't think mbstring is optimum multibyte string library neither.
We may keep current structure until we have decent multibyte
string library that could live long enough.

By the time we have it, we may use compatibility switch for
basic string functions to change multibyte awareness.
If we are going to choose this way, it may be better to have
byte_len() and other byte_*() function now for easier transition.

I'm not sure which is the best having mb_(), byte_() or specifying
binary encoding for standard string functions, though.
It's out of this RFC scope.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Derick Rethans — view source

unread

Hi!

We need few more basic functions like trim().
This is different issue, but I may work on it.

It feels like we will end up with creating clones of pretty every string
function there is. Not sure it is the best approach...

No, the best approach is to have "proper" unicode support in PHP, a la
"PHP 6". Then at least the duplication happens for us devs and not
users.

cheers,
Derick

11 years ago by Yasuo Ohgaki — view source

unread

Hi Derick,

We need few more basic functions like trim().
This is different issue, but I may work on it.

It feels like we will end up with creating clones of pretty every string
function there is. Not sure it is the best approach...

No, the best approach is to have "proper" unicode support in PHP, a la
"PHP 6". Then at least the duplication happens for us devs and not
users.

I agree that we need to discuss encoding issue as a whole.
This RFC is to eliminate known vulnerability from current implementation.
Let's discuss this issue later.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Nikita Popov — view source

unread

Hi all,

addslashes() could be vulnerable via char encoding based attacks.
It is needed to decide what counter measure we adopt.
This is RFC for this issue.

https://wiki.php.net/multibyte_char_handling

Please comment.
Thank you.

Please do not add encoding parameters to our existing string functions -
we have an mb extension and mb functionality should go there. Don't mix the
things, it will only lead to a lot of confusion. Right now it's obvious
which functions handle encoding how, no need to break that.

Nikita

11 years ago by Yasuo Ohgaki — view source

unread

Hi Nikita,

Hi all,

addslashes() could be vulnerable via char encoding based attacks.
It is needed to decide what counter measure we adopt.
This is RFC for this issue.

https://wiki.php.net/multibyte_char_handling

Please comment.
Thank you.

Please do not add encoding parameters to our existing string functions -
we have an mb extension and mb functionality should go there. Don't mix the
things, it will only lead to a lot of confusion. Right now it's obvious
which functions handle encoding how, no need to break that.

This discussion circulate discussion.

At first, I proposed locale based solution using php_mblen().
This approach does not require additional encoding parameter
since encoding is specified by locale.

However, some people don't like the solution (in security ML)
because it is locale based solution. It may have unwanted side
effects. Locale is unreliable and most user just don't care about it.

Therefore, I proposed this approach that introduce encoding
parameter just like htmlspecialchars()/htmlentities().

Encoding parameter (or some way to specify encoding) for security
related string function is mandatory. We should provide some way
to specify encoding.

Do you like locale based approach for now?

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Nikita Popov — view source

unread

Hi Nikita,

On Thu, Jan 16, 2014 at 9:18 PM, Nikita Popov nikita.ppv@gmail.comwrote:

On Thu, Jan 16, 2014 at 12:50 AM, Yasuo Ohgaki yohgaki@ohgaki.netwrote:

Hi all,

addslashes() could be vulnerable via char encoding based attacks.
It is needed to decide what counter measure we adopt.
This is RFC for this issue.

https://wiki.php.net/multibyte_char_handling

Please comment.
Thank you.

Please do not add encoding parameters to our existing string functions

we have an mb extension and mb functionality should go there. Don't mix
the things, it will only lead to a lot of confusion. Right now it's obvious
which functions handle encoding how, no need to break that.

This discussion circulate discussion.

At first, I proposed locale based solution using php_mblen().
This approach does not require additional encoding parameter
since encoding is specified by locale.

However, some people don't like the solution (in security ML)
because it is locale based solution. It may have unwanted side
effects. Locale is unreliable and most user just don't care about it.

Therefore, I proposed this approach that introduce encoding
parameter just like htmlspecialchars()/htmlentities().

Encoding parameter (or some way to specify encoding) for security
related string function is mandatory. We should provide some way
to specify encoding.

Do you like locale based approach for now?

No, I don't want a locale-based approach. I want the string functions to
stay as is. Multibyte variants of the functions can be added to the
multibyte extension.

Nikita

11 years ago by Yasuo Ohgaki — view source

unread

Hi Nikita,

No, I don't want a locale-based approach. I want the string functions to
stay as is. Multibyte variants of the functions can be added to the
multibyte extension.

Creating mb_*() function would not solve security issues of
multibyte char handling since multibyte aware functions are
optional feature.

However, it may work if PHP compiles mbstring by default and
discourage use of addslashes()/var_export()/stripslashes()
in favor of mb_*() variants.

This could be voting option.
Did I understand your opinion correctly?

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Pierre Joye — view source

unread

Hi Nikita,

No, I don't want a locale-based approach. I want the string functions to
stay as is. Multibyte variants of the functions can be added to the
multibyte extension.

Creating mb_*() function would not solve security issues of
multibyte char handling since multibyte aware functions are
optional feature.

We never supported nor claimed that these functions are multi bytes
safe. However I actually fully understand that we should solve this
problem, in one way or another.

However, it may work if PHP compiles mbstring by default and
discourage use of addslashes()/var_export()/stripslashes()
in favor of mb_*() variants.

I do not think we should discourage the use of these functions but
clearly document to rely on mb_* APIs as long as multi bytes support
is required.

I join other about not making any optional arguments in the existing
APIs, for a couple of reasons:

it does not solve anything as people still have to update their
code, and they won't unless maybe if they read the doc/changelog
It is really not a clean solution
we already have many duplicate functions in mb, it has worked well
so far and we can add the ones discussed here

The last question was about relying on locale. This is absolutely not
a solution. Locale has been proven to be totally unreliable, buggy and
unsafe. Let alone the total lack of real posix locale support on
Windows.

For anything related to locale, formats or encoding, we should rely on
intl (ICU) and not on systems's locale. This is the only way to be
portable, safe and updated.

Cheers,
Pierre

11 years ago by Yasuo Ohgaki — view source

unread

Hi Pierre,

Hi Nikita,

On Fri, Jan 17, 2014 at 7:38 AM, Nikita Popov nikita.ppv@gmail.com
wrote:

No, I don't want a locale-based approach. I want the string functions to
stay as is. Multibyte variants of the functions can be added to the
multibyte extension.

Creating mb_*() function would not solve security issues of
multibyte char handling since multibyte aware functions are
optional feature.

We never supported nor claimed that these functions are multi bytes
safe. However I actually fully understand that we should solve this
problem, in one way or another.

However, it may work if PHP compiles mbstring by default and
discourage use of addslashes()/var_export()/stripslashes()
in favor of mb_*() variants.

I do not think we should discourage the use of these functions but
clearly document to rely on mb_* APIs as long as multi bytes support
is required.

I join other about not making any optional arguments in the existing
APIs, for a couple of reasons:

it does not solve anything as people still have to update their
code, and they won't unless maybe if they read the doc/changelog

It is really not a clean solution

we already have many duplicate functions in mb, it has worked well
so far and we can add the ones discussed here

I'll leave existing ext/standard functions alone.

The last question was about relying on locale. This is absolutely not

a solution. Locale has been proven to be totally unreliable, buggy and
unsafe. Let alone the total lack of real posix locale support on
Windows.

mb_escape_shell_arg()/mb_escape_shell_cmd() need locale based
solution, since there aren't good way to detect terminal encoding. I'll
make mb version explicitly overrides this behavior by explicitly specifying
encoding.

On UNIXes, UTF-8 encoding is popular terminal encoding, but there
would be systems using other encoding such as EUC, or even SJIS, BIG5.
Windows uses different encoding for terminal encoding according to locale,
so it's much more complex.

This is the reason why I would use locale. However, this implementation
is debatable.

We could say "Users should explicitly specify terminal encoding
by themselves". In fact, I prefer this even if I am about to implement
mb_escape_shell_*() using locale for automatic encoding detection.

It may be better to raise E_NOTICE at least if encoding parameter is
omitted for mb_escape_shell_*().

For anything related to locale, formats or encoding, we should rely on

intl (ICU) and not on systems's locale. This is the only way to be
portable, safe and updated.

I agree.
I also would like to propose

https://wiki.php.net/rfc/altmbstring - ICU version of mbstring

for future release. Most work has done by Moriyoshi. We may try to
switch to it now, but I suppose there is not enough time for 5.6.

It's supposed to work the same as current mbstring mostly. It may be
better mbstring compile as optional in favor of ICU implementation.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

I also would like to propose

https://wiki.php.net/rfc/altmbstring - ICU version of mbstring

for future release. Most work has done by Moriyoshi. We may try to
switch to it now, but I suppose there is not enough time for 5.6.

It's supposed to work the same as current mbstring mostly. It may be
better mbstring compile as optional in favor of ICU implementation.

Although, I was about to propose this RFC for future PHP release.
It may be better to have alternative mbstring implementation now. i.e.
for 5.6
If we are going to make this a standard.

It could be implemented to compile mbstring or mbstring-ng as user's choice.
mbstring-ng can be marked as EXPERIMENTAL until it's implementation is
finished. Distributors may use conflict option for their package to provide
both.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Pierre Joye — view source

unread

Hi Pierre,

On Thu, Jan 16, 2014 at 11:47 PM, Yasuo Ohgaki yohgaki@ohgaki.net
wrote:

Hi Nikita,

On Fri, Jan 17, 2014 at 7:38 AM, Nikita Popov nikita.ppv@gmail.com
wrote:

No, I don't want a locale-based approach. I want the string functions
to
stay as is. Multibyte variants of the functions can be added to the
multibyte extension.

Creating mb_*() function would not solve security issues of
multibyte char handling since multibyte aware functions are
optional feature.

We never supported nor claimed that these functions are multi bytes
safe. However I actually fully understand that we should solve this
problem, in one way or another.

However, it may work if PHP compiles mbstring by default and
discourage use of addslashes()/var_export()/stripslashes()
in favor of mb_*() variants.

I do not think we should discourage the use of these functions but
clearly document to rely on mb_* APIs as long as multi bytes support
is required.

I join other about not making any optional arguments in the existing
APIs, for a couple of reasons:

it does not solve anything as people still have to update their
code, and they won't unless maybe if they read the doc/changelog

It is really not a clean solution

we already have many duplicate functions in mb, it has worked well
so far and we can add the ones discussed here

I'll leave existing ext/standard functions alone.

:)

The last question was about relying on locale. This is absolutely not
a solution. Locale has been proven to be totally unreliable, buggy and
unsafe. Let alone the total lack of real posix locale support on
Windows.

mb_escape_shell_arg()/mb_escape_shell_cmd() need locale based
solution, since there aren't good way to detect terminal encoding. I'll
make mb version explicitly overrides this behavior by explicitly
specifying
encoding.

Sounds good

On UNIXes, UTF-8 encoding is popular terminal encoding, but there
would be systems using other encoding such as EUC, or even SJIS, BIG5.

Right, and as far as I remember UTF-8 does not suffer from this problem.

Windows uses different encoding for terminal encoding according to
locale,
so it's much more complex.

Let me provide a function to detect it, but we need something to normalize
the names. Do we have such thing in mbstring?

This is the reason why I would use locale. However, this implementation
is debatable.

Yes :)

We could say "Users should explicitly specify terminal encoding
by themselves". In fact, I prefer this even if I am about to implement
mb_escape_shell_*() using locale for automatic encoding detection.

It may be better to raise E_NOTICE at least if encoding parameter is
omitted for mb_escape_shell_*().

Notice sounds good too.

For anything related to locale, formats or encoding, we should rely on
intl (ICU) and not on systems's locale. This is the only way to be
portable, safe and updated.

I agree.
I also would like to propose

https://wiki.php.net/rfc/altmbstring - ICU version of mbstring

Oh, very nice.

for future release. Most work has done by Moriyoshi. We may try to
switch to it now, but I suppose there is not enough time for 5.6.

What's the status? We still have some time :)

Cheers,
Pierre

11 years ago by Yasuo Ohgaki — view source

unread

Hi Pierre,

On UNIXes, UTF-8 encoding is popular terminal encoding, but there
would be systems using other encoding such as EUC, or even SJIS, BIG5.

Right, and as far as I remember UTF-8 does not suffer from this problem.

UTF-8 does not have this issue if terminal handles encoding correctly.
I think almost all termianls handle UTF-8 properly, otherwise it is
considered as
security hole :)

Windows uses different encoding for terminal encoding according to
locale,
so it's much more complex.

Let me provide a function to detect it, but we need something to normalize
the names. Do we have such thing in mbstring?

Yes. mbstring has ID for supported encoding and there is normalize function
to set encoding ID.

This is the reason why I would use locale. However, this implementation
is debatable.

Yes :)

We need to decide what to do :)

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Mike — view source

unread

This discussion circulate discussion.

I'm also in favor of putting those things in ext/mbstring.

At first, I proposed locale based solution using php_mblen().
This approach does not require additional encoding parameter
since encoding is specified by locale.

Meh, but would be okay for me.

However, some people don't like the solution (in security ML)
because it is locale based solution. It may have unwanted side
effects. Locale is unreliable and most user just don't care about it.

Therefore, I proposed this approach that introduce encoding
parameter just like htmlspecialchars()/htmlentities().

Encoding parameter (or some way to specify encoding) for security
related string function is mandatory. We should provide some way
to specify encoding.

Many of us do not have access to that mailing list, so yould you shed
some light on the acutal issue?

--
Regards,
Mike

11 years ago by Yasuo Ohgaki — view source

unread

Hi Mike,

This discussion circulate discussion.

I'm also in favor of putting those things in ext/mbstring.

I'll make this a vote option.

At first, I proposed locale based solution using php_mblen().
This approach does not require additional encoding parameter
since encoding is specified by locale.

Meh, but would be okay for me.

It's feasible solution for older versions.
I would like to remove locale based code for future releases, though.

Functions that are using php_mblen() could be modified to use mbstring
when PHP is built with mbstring. Functions may use internal_encoding.

Use of internal_encoding requires user code modification in some cases.
For instance, Japanese Windows command line uses Shift_JIS as
terminal encoding while many users uses UTF-8 for script. Users has to
add code that changes internal_encoding. e.g.
escapeshellarg()/escapeshellcmd().
They could use simple wrapper for escapeshellarg()/escapeshellcmd(), though.

Although users have to modify their code a little, fgetcsv() and like would
be more usable because it's more reliable than locale.

It may be better to add mb version of these functions and deprecate them
like addslashes(), if we are not going to modify these functions.

However, some people don't like the solution (in security ML)
because it is locale based solution. It may have unwanted side
effects. Locale is unreliable and most user just don't care about it.

Therefore, I proposed this approach that introduce encoding
parameter just like htmlspecialchars()/htmlentities().

Encoding parameter (or some way to specify encoding) for security
related string function is mandatory. We should provide some way
to specify encoding.

Many of us do not have access to that mailing list, so yould you shed
some light on the acutal issue?

There are 2 classes of security issue in php_addslashes()

First is PHP script execution.
Suppose this is a script save app config script.

<?php
$v = addslashes('表') . addslashes(''; exec('rm -rf /'); die();''));
file_put_contents('myconfig.php', '<?php $config='.$v);
?>

then read it as PHP script.

<?php
include 'myconfig.php';
// other code follows
?>

If '表' is SJIS, the char code is 0x955c (0x5c = ). Since addslashes() is
not multibyte aware, it escapes
the char as 0x95, 0x5c, 0x5c. This make possible that break out string
quoting and write attack code.

The contents of myconfig.php became

<?php $config= '表'; exec('rm -rf /'); die()

with SJIS, BIG5 and other similar encoding.

var_export() can be attacked by the same reason and method.
This attack method is well known for attackers around East Asia region,
but it's not limited to East Asia.

Second is rather obvious. It's a DoS.
Since Zend engine raise compile error for invalid encoding, data generated
by addslashes()/var_export() could stop script execution.

For instance,
http://lxr.php.net/xref/PHP_5_5/Zend/zend_language_scanner.l#507

Although it would be rare in real code, php_stripslashes() could be
problematic.
Since some chars have special byte in SJIS like encoding, it could be used
to remove escape character and cause problem. If stripped string is
evaluated
as PHP script, it may cause issues like php_addslashes().

Anyway, there isn't a feasible solution that could satisfy all. AFAIK.
Many users don't have to update their code, but some users may have to
modify
their code according to their usage.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Julien Pauli — view source

unread

[Sorry I cant answer inline as there are many answers to give to many users]

Like everybody, I' absolutely against adding an "encoding" parameter
to ext/standard functions or relying on unreliable system locale.
Like Nikita says, every multibyte function should go to ext/mbstring ,
and nowhere else, please , do not turn PHP into something even more
dirty as it is nowadays :-p

I'm +1 to embed and activate mbstring by default in future PHP releases.
However, this has already been discussed (from what I remember) and I
dont remember why we ended with a "no" end-word, could we be refreshed
about this ?

I'm not in favor of magic things. Magically replacing PHP strings by
mb_ implementation is a really bad idea. We should keep the INI
parameter about this alive though (mbstring.func_overload), so that
people that explicitely want to activate such a magic can do it if
they want to.

I'm +1 also to start a "serious" (hum) discussion about multibyte/PHP6
, we've been saying this for years : we need native support of unicode
in PHP :-p The actual problem we are facing once again shows this.

Julien.P

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

Like everybody, I' absolutely against adding an "encoding" parameter
to ext/standard functions or relying on unreliable system locale.
Like Nikita says, every multibyte function should go to ext/mbstring ,
and nowhere else, please , do not turn PHP into something even more
dirty as it is nowadays :-p

I'm +1 to embed and activate mbstring by default in future PHP releases.
However, this has already been discussed (from what I remember) and I
dont remember why we ended with a "no" end-word, could we be refreshed
about this ?

I'm not in favor of magic things. Magically replacing PHP strings by
mb_ implementation is a really bad idea. We should keep the INI
parameter about this alive though (mbstring.func_overload), so that
people that explicitely want to activate such a magic can do it if
they want to.

I'm +1 also to start a "serious" (hum) discussion about multibyte/PHP6
, we've been saying this for years : we need native support of unicode
in PHP :-p The actual problem we are facing once again shows this.

It seem this is the majority excluding INI usage.
I updated the RFC to reflect this.

https://wiki.php.net/rfc/multibyte_char_handling

Compile mbstring by default from 5.6
Add mb_*() functions for 5.3 and up
Keep ext/standard function as it is now

Open Issue

Use of INI for overriding single byte string functions by mbstring
functions.

I would like to consolidate code location 5.6 and up because of the history
of
mbstring function remain insecure. e.g. When parse_str()/mail() security
issue
was fixed, mb_parse_str()/mb_send_mail() didn't fixed.

However, refactoring isn't mandatory. We could have 2 different codes that
are
mostly the same. We may postpone refactoring until PHP6. Please don't forget
to update mbstring code when anyone update ext/standard.

If there are opinions/open issues/questions, please reply.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

Keep ext/standard function as it is now

SPL file object's fgetcsv method is using php_fgetcsv() which is using
php_mblen().
Add mb_fget_csv method to it using #if HAVE_MBSTRING?

I made new mb function names in the RFC a cording standard compliant. e.g.
mb_add_slashes()
Any comments for this?

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Julien Pauli — view source

unread

Hi all,

Like everybody, I' absolutely against adding an "encoding" parameter
to ext/standard functions or relying on unreliable system locale.
Like Nikita says, every multibyte function should go to ext/mbstring ,
and nowhere else, please , do not turn PHP into something even more
dirty as it is nowadays :-p

I'm +1 to embed and activate mbstring by default in future PHP releases.
However, this has already been discussed (from what I remember) and I
dont remember why we ended with a "no" end-word, could we be refreshed
about this ?

I'm not in favor of magic things. Magically replacing PHP strings by
mb_ implementation is a really bad idea. We should keep the INI
parameter about this alive though (mbstring.func_overload), so that
people that explicitely want to activate such a magic can do it if
they want to.

I'm +1 also to start a "serious" (hum) discussion about multibyte/PHP6
, we've been saying this for years : we need native support of unicode
in PHP :-p The actual problem we are facing once again shows this.

It seem this is the majority excluding INI usage.
I updated the RFC to reflect this.

https://wiki.php.net/rfc/multibyte_char_handling

Compile mbstring by default from 5.6

Add mb_*() functions for 5.3 and up

Keep ext/standard function as it is now

Open Issue

Use of INI for overriding single byte string functions by mbstring
functions.

Why is that an issue ?
We just leave it as-is , or ?

I would like to consolidate code location 5.6 and up because of the history
of
mbstring function remain insecure. e.g. When parse_str()/mail() security
issue
was fixed, mb_parse_str()/mb_send_mail() didn't fixed.

However, refactoring isn't mandatory. We could have 2 different codes that
are
mostly the same. We may postpone refactoring until PHP6. Please don't forget
to update mbstring code when anyone update ext/standard.

If there are opinions/open issues/questions, please reply.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

Julien

11 years ago by Yasuo Ohgaki — view source

unread

Hi Julien,

Hi all,

Like everybody, I' absolutely against adding an "encoding" parameter
to ext/standard functions or relying on unreliable system locale.
Like Nikita says, every multibyte function should go to ext/mbstring ,
and nowhere else, please , do not turn PHP into something even more
dirty as it is nowadays :-p

I'm +1 to embed and activate mbstring by default in future PHP releases.
However, this has already been discussed (from what I remember) and I
dont remember why we ended with a "no" end-word, could we be refreshed
about this ?

I'm not in favor of magic things. Magically replacing PHP strings by
mb_ implementation is a really bad idea. We should keep the INI
parameter about this alive though (mbstring.func_overload), so that
people that explicitely want to activate such a magic can do it if
they want to.

I'm +1 also to start a "serious" (hum) discussion about multibyte/PHP6
, we've been saying this for years : we need native support of unicode
in PHP :-p The actual problem we are facing once again shows this.

It seem this is the majority excluding INI usage.
I updated the RFC to reflect this.

https://wiki.php.net/rfc/multibyte_char_handling

Compile mbstring by default from 5.6

Add mb_*() functions for 5.3 and up

Keep ext/standard function as it is now

Open Issue

Use of INI for overriding single byte string functions by mbstring
functions.

Why is that an issue ?
We just leave it as-is , or ?

Some users are annoyed by sloppy multilingual implementations using
this option. There is feature request from user who want to remove
mbstring.func_overload INI option.

https://bugs.php.net/bug.php?id=65785

We may extend or drop this feature. I neutral for this.

Regards.

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

Some users are annoyed by sloppy multilingual implementations using
this option. There is feature request from user who want to remove
mbstring.func_overload INI option.

https://bugs.php.net/bug.php?id=65785

We may extend or drop this feature. I'm neutral for this.

Added this to the RFC.

https://wiki.php.net/rfc/altmbstring

is also added to related RFC. I'm not intended to propose/implement it
anytime soon.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Julien Pauli — view source

unread

Hi Julien,

Hi all,

Like everybody, I' absolutely against adding an "encoding" parameter
to ext/standard functions or relying on unreliable system locale.
Like Nikita says, every multibyte function should go to ext/mbstring ,
and nowhere else, please , do not turn PHP into something even more
dirty as it is nowadays :-p

I'm +1 to embed and activate mbstring by default in future PHP
releases.
However, this has already been discussed (from what I remember) and I
dont remember why we ended with a "no" end-word, could we be refreshed
about this ?

I'm not in favor of magic things. Magically replacing PHP strings by
mb_ implementation is a really bad idea. We should keep the INI
parameter about this alive though (mbstring.func_overload), so that
people that explicitely want to activate such a magic can do it if
they want to.

I'm +1 also to start a "serious" (hum) discussion about multibyte/PHP6
, we've been saying this for years : we need native support of unicode
in PHP :-p The actual problem we are facing once again shows this.

It seem this is the majority excluding INI usage.
I updated the RFC to reflect this.

https://wiki.php.net/rfc/multibyte_char_handling

Compile mbstring by default from 5.6

Add mb_*() functions for 5.3 and up

Keep ext/standard function as it is now

Open Issue

Use of INI for overriding single byte string functions by mbstring
functions.

Why is that an issue ?
We just leave it as-is , or ?

Some users are annoyed by sloppy multilingual implementations using
this option. There is feature request from user who want to remove
mbstring.func_overload INI option.

https://bugs.php.net/bug.php?id=65785

We may extend or drop this feature. I neutral for this.

Yep, I admit that lots of people misconfigure it and that make PITA of
developers that always have to test its value.

Julien.P

11 years ago by Yasuo Ohgaki — view source

unread

Hi Julien,

Why is that an issue ?

We just leave it as-is , or ?

Some users are annoyed by sloppy multilingual implementations using
this option. There is feature request from user who want to remove
mbstring.func_overload INI option.

https://bugs.php.net/bug.php?id=65785

We may extend or drop this feature. I neutral for this.

Since we are better to concentrate on fixing the security issue, I've
made it open issue for future releases.

I'll just add mb new functions to mbstring.func_overload INI option.

If func_overload is supported, we still needs php_mblen() for command
line output since it's not good idea to detect locale and using
internal_encoding
opens new vulnerability.

Functions that are better to use locale are

mb_escape_shell_arg()
mb_escape_shell_cmd()

These function may override locale by encoding parameter.

Since fgetcsv() uses locale now, we may do the same for fgetcsv() also.

The RFC is updated.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Lester Caine — view source

unread

Yasuo Ohgaki wrote:

addslashes() could be vulnerable via char encoding based attacks.
It is needed to decide what counter measure we adopt.
This is RFC for this issue.

https://wiki.php.net/multibyte_char_handling

Please comment.

Multibyte characters are still a contentious area, and the current compromise of
supporting multibyte content, but being essentially 'single byte' for the
programming structure as been a solution adopted in a few projects. Firebird is
once again debating the same point that they and PHP last discussed 10 years
ago, and was too difficult so PHP6 floundered and Firebird remained essentially
single byte strings in the metadata.

10 years on isn't it time to re-open the debate on making the core unicode since
32 bit processors are more likely to be the norm these days. Certainly if
everything internal is UTF8, then all of the encoding problems are moved to the
client interface?

(p.s rfc needs a little work via the spell checker and the link above is wrong)

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

11 years ago by Yasuo Ohgaki — view source

unread

Hi Lester,

Multibyte characters are still a contentious area, and the current
compromise of supporting multibyte content, but being essentially 'single
byte' for the programming structure as been a solution adopted in a few
projects. Firebird is once again debating the same point that they and PHP
last discussed 10 years ago, and was too difficult so PHP6 floundered and
Firebird remained essentially single byte strings in the metadata.

Making a product only works for single byte char is completely OK.

The issue is there is no proper function/method/feature that escapes PHP
string with multibyte chars correctly. PHP needs to provide API that
handles data properly/safely.

It's awful that reading var_export()ed data could execute arbitrarily PHP
script and/or terminate script execution, isn't it? It cannot be ignored.

10 years on isn't it time to re-open the debate on making the core unicode

since 32 bit processors are more likely to be the norm these days.
Certainly if everything internal is UTF8, then all of the encoding problems
are moved to the client interface?

I'm not proposing transition like Python 2.x to 3.x. The RFC is proposing
required feature for proper/safe coding. Anyway, it seems Python's approach
is not working well. We could learn from it.

Server side should never expect clients are sending proper data, therefore
proper encoding handling is mandatory on server side. Adoption of UTF-8
makes things easier, but there are ways to exploit UTF-8 encoding also. For
example, recent Chrome may display blank page with malformed chars and it
could be used for DoS attack, mixing systems that validate and un-validate
encoding could be vulnerable DoS. New mb functions handle encoding
properly not only SJIS like encoding but also any encoding supported by
mbstiring.

Did I make typo? My Chrome did not report spell error. I appreciate if you
point it out.

I sent correct URL right after first mail, but it wouldn't work. I also
would not check second mail in long thread :(

https://wiki.php.net/rfc/multibyte_char_handling

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Lester Caine — view source

unread

Yasuo Ohgaki wrote:

I'm not proposing transition like Python 2.x to 3.x. The RFC is proposing
required feature for proper/safe coding. Anyway, it seems Python's approach is
not working well. We could learn from it.

I am more than convinced that is a perfect example of how not to do it, but
there surly is no reason to make such a mess of 'PHP6' over PHP5? I have to be
able to program in Python for some of the tools, and they have no plans to move
to 3.x any time soon :)

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

addslashes() could be vulnerable via char encoding based attacks.
It is needed to decide what counter measure we adopt.
This is RFC for this issue.

https://wiki.php.net/multibyte_char_handling

Please comment.
Thank you.

I've revised the RFC a little and integrated with following RFC.

Alternative implementation of mbstring using ICU
https://wiki.php.net/rfc/altmbstring

RFC is changed to include mbstring-ng as a default compiled module.
It would be better for license wise. IMHO.

If there is no additional comment, I would like to start vote.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

[RFC] Multibyte char handling

It feels like we will end up with creating clones of pretty every string function there is. Not sure it is the best approach...

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

It feels like we will end up with creating clones of pretty every string
function there is. Not sure it is the best approach...

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL