[RFC] Locale-independent case conversion

3 years ago by Tim Starling — view source

unread

Please consider my RFC for locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii
https://github.com/php/php-src/pull/7506

The RFC and associated PR ended up going some way beyond the original
scope, because for consistency, it's best if everything has the same
concept of case folding. I saw this as an opportunity to clean up a
common kind of locale-dependence in PHP which was previously inconsistent.

So not only will strtolower() and strtoupper() become
locale-independent, converting only ASCII, but also stristr, stripos,
strripos, lcfirst, ucfirst, ucwords, str_ireplace, the array sorting
functions with SORT_FLAG_CASE, and array_change_key_case.

Also, I changed a number of internal functions to use ASCII case
folding, giving rise to a range of effects in callers throughout the
core tree. The effects are all documented in the RFC.

I am proposing that locale-sensitive case conversion be provided with
the new names ctype_tolower() and ctype_toupper(). Those names might
seem odd at first glance, but they are wrappers for functions in
ctype.h and work in a very similar way to the rest of the ctype extension.

-- Tim Starling

3 years ago by Pierre Joye — view source

unread

Hi Tim,

Please consider my RFC for locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii

very good one, thanks :)

https://github.com/php/php-src/pull/7506

The RFC and associated PR ended up going some way beyond the original
scope, because for consistency, it's best if everything has the same
concept of case folding. I saw this as an opportunity to clean up a
common kind of locale-dependence in PHP which was previously inconsistent.

Good patch, I am commenting here as other or more discussions may be
better here.

I wonder if either JIT could be used for the intrinsics support,
adding neon, sse[2-4.2] or avc256/512 (the latter would basically
allow most common strings to be converted in one go.

If not, maybe split implementation however runtime cpu support would
be better. Many distributions built with SSE2 flags but are actually
ran on much recent CPUs, same for ARM (ie. graviton/neon f.e.).
Thoughts?

Maybe Dmitry has a better idea :)

Best,

Pierre

@pierrejoye | http://www.libgd.org

3 years ago by Tim Starling — view source

unread

I wonder if either JIT could be used for the intrinsics support,
adding neon, sse[2-4.2] or avc256/512 (the latter would basically
allow most common strings to be converted in one go.

If not, maybe split implementation however runtime cpu support would
be better. Many distributions built with SSE2 flags but are actually
ran on much recent CPUs, same for ARM (ie. graviton/neon f.e.).
Thoughts?

SSE2 is already scary-fast. I benchmarked the SSE2 tolower code on my
laptop: it was chewing through strings at 18 GiB/s, i.e. 50ps per
byte. This was a benchmark written in C -- you would have a lot of
trouble making a PHP tight loop in which SSE2 case conversion is the
slow part.

-- Tim Starling

3 years ago by Nikita Popov — view source

unread

On Thu, Sep 23, 2021 at 8:32 AM Tim Starling tstarling@wikimedia.org
wrote:

Please consider my RFC for locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii
https://github.com/php/php-src/pull/7506

The RFC and associated PR ended up going some way beyond the original
scope, because for consistency, it's best if everything has the same
concept of case folding. I saw this as an opportunity to clean up a
common kind of locale-dependence in PHP which was previously inconsistent.

So not only will strtolower() and strtoupper() become
locale-independent, converting only ASCII, but also stristr, stripos,
strripos, lcfirst, ucfirst, ucwords, str_ireplace, the array sorting
functions with SORT_FLAG_CASE, and array_change_key_case.

Also, I changed a number of internal functions to use ASCII case
folding, giving rise to a range of effects in callers throughout the
core tree. The effects are all documented in the RFC.

I am proposing that locale-sensitive case conversion be provided with
the new names ctype_tolower() and ctype_toupper(). Those names might
seem odd at first glance, but they are wrappers for functions in
ctype.h and work in a very similar way to the rest of the ctype extension.

Hi Tim,

Thanks for creating this proposal, it looks great!

I think this is a very beneficial change, and the amount of incorrect
locale-dependent calls we had just in php-src further convinced me of this:
We're generally aware of the problem, and we still made this mistake. Many
times.

The only open question I have is regarding the ctype_* functions. One might
argue that these functions should be locale-independent as well. Certainly,
whenever I have used ctype_digit() I only intended it to match [0-9]. It
seems like some people try to use ctype_alpha() in a locale-sensitive way (
https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check)
and then fail because it doesn't support UTF-8.

Regards,
Nikita

PS: Regarding escapeshellarg(), are you aware of the array command support
for proc_open() that was added in PHP 7.4? That does away the need to
escape arguments.

3 years ago by Bob Weinand — view source

unread

Am 04.10.2021 um 12:08 schrieb Nikita Popov nikita.ppv@gmail.com:

On Thu, Sep 23, 2021 at 8:32 AM Tim Starling <tstarling@wikimedia.org mailto:tstarling@wikimedia.org>
wrote:

Please consider my RFC for locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii
https://github.com/php/php-src/pull/7506

The RFC and associated PR ended up going some way beyond the original
scope, because for consistency, it's best if everything has the same
concept of case folding. I saw this as an opportunity to clean up a
common kind of locale-dependence in PHP which was previously inconsistent.

So not only will strtolower() and strtoupper() become
locale-independent, converting only ASCII, but also stristr, stripos,
strripos, lcfirst, ucfirst, ucwords, str_ireplace, the array sorting
functions with SORT_FLAG_CASE, and array_change_key_case.

Also, I changed a number of internal functions to use ASCII case
folding, giving rise to a range of effects in callers throughout the
core tree. The effects are all documented in the RFC.

I am proposing that locale-sensitive case conversion be provided with
the new names ctype_tolower() and ctype_toupper(). Those names might
seem odd at first glance, but they are wrappers for functions in
ctype.h and work in a very similar way to the rest of the ctype extension.

Hi Tim,

Thanks for creating this proposal, it looks great!

I think this is a very beneficial change, and the amount of incorrect
locale-dependent calls we had just in php-src further convinced me of this:
We're generally aware of the problem, and we still made this mistake. Many
times.

I definitely agree that it's good to make these functions locale-insentive.

The only open question I have is regarding the ctype_* functions. One might
argue that these functions should be locale-independent as well. Certainly,
whenever I have used ctype_digit() I only intended it to match [0-9]. It
seems like some people try to use ctype_alpha() in a locale-sensitive way (
https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check)
and then fail because it doesn't support UTF-8.

On that topic, do we also want to add mb_ucfrist, mb_lcfirst and mb_ucwords?
Then you have also proper local independent functions handling these use cases.

This is only tangentially related to the current RFC, so feel free to ignore this.

Bob

3 years ago by Tim Starling — view source

unread

Hi Tim,

Thanks for creating this proposal, it looks great!

I think this is a very beneficial change, and the amount of
incorrect locale-dependent calls we had just in php-src further
convinced me of this: We're generally aware of the problem, and we
still made this mistake. Many times.

The only open question I have is regarding the ctype_* functions.
One might argue that these functions should be locale-independent as
well. Certainly, whenever I have used ctype_digit() I only intended
it to match [0-9]. It seems like some people try to use
ctype_alpha() in a locale-sensitive way
(https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check
https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check)
and then fail because it doesn't support UTF-8.

OK, I removed ctype_tolower() and ctype_toupper() from the RFC and the
PR since they would be incompatible with a move towards a
locale-independent ctype extension.

The non-controversial parts of the PR were split and merged, so I
rebased the PR and updated the RFC accordingly.

Do you think the RFC is ready for voting now?

PS: Regarding escapeshellarg(), are you aware of the array command
support for proc_open() that was added in PHP 7.4? That does away
the need to escape arguments.

It doesn't really help us. I recently wrote a new shell command
execution system for MediaWiki called Shellbox. As part of that
project, I reviewed how shell execution is used in the MediaWiki
ecosystem. There are a lot of callers which are using shell features,
for example redirecting inputs or outputs, or constructing pipelines.
I didn't really want to break them all or reimplement those features
without the shell. And we have security and containerization wrappers
which depend on construction of a shell command string. So we need to
be able to construct shell command strings safely.

After studying locale sensitivity for this RFC, I decided to get rid
of escapeshellarg() from MediaWiki. Instead we are doing our own shell
escaping:

https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/722548

I also made MediaWiki use a fixed locale, instead of being configurable.

-- Tim Starling

3 years ago by Nicolas Grekas — view source

unread

Le lun. 11 oct. 2021 à 03:33, Tim Starling tstarling@wikimedia.org a
écrit :

Hi Tim,

Thanks for creating this proposal, it looks great!

I think this is a very beneficial change, and the amount of
incorrect locale-dependent calls we had just in php-src further
convinced me of this: We're generally aware of the problem, and we
still made this mistake. Many times.

The only open question I have is regarding the ctype_* functions.
One might argue that these functions should be locale-independent as
well. Certainly, whenever I have used ctype_digit() I only intended
it to match [0-9]. It seems like some people try to use
ctype_alpha() in a locale-sensitive way
(
https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check
<
https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check
)
and then fail because it doesn't support UTF-8.

OK, I removed ctype_tolower() and ctype_toupper() from the RFC and the
PR since they would be incompatible with a move towards a
locale-independent ctype extension.

The non-controversial parts of the PR were split and merged, so I
rebased the PR and updated the RFC accordingly.

Do you think the RFC is ready for voting now?

PS: Regarding escapeshellarg(), are you aware of the array command
support for proc_open() that was added in PHP 7.4? That does away
the need to escape arguments.

It doesn't really help us. I recently wrote a new shell command
execution system for MediaWiki called Shellbox. As part of that
project, I reviewed how shell execution is used in the MediaWiki
ecosystem. There are a lot of callers which are using shell features,
for example redirecting inputs or outputs, or constructing pipelines.
I didn't really want to break them all or reimplement those features
without the shell. And we have security and containerization wrappers
which depend on construction of a shell command string. So we need to
be able to construct shell command strings safely.

After studying locale sensitivity for this RFC, I decided to get rid
of escapeshellarg() from MediaWiki. Instead we are doing our own shell
escaping:

https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/722548

I also made MediaWiki use a fixed locale, instead of being configurable.

Hi Tim,

thanks for the RFC and for the above pointers, I'm going to have a look at
Symfony Process to follow your lead!

About the RFC, I just have one note:

I didn't include strnatcasecmp() and natcasesort() in this RFC, because
they also use isdigit() and isspace(), and because they are intended for
natural language processing. They could be migrated in future.

Despite their name, I never used natcase functions for natural language
processing. I use them eg to sort lists of files in a directory, to account
for numbers mainly. But that's not what I would call natural language
processing. I'm not aware of anyone using them for that actually. I'm
wondering if it's a good idea to postpone migrating them to an hypothetical
future as to me, the whole reasoning of the RFC applies to them.

Regards,
Nicolas

3 years ago by Hans Henrik Bergan — view source

unread

@Nicolas i hope mediawiki doesn't run on Windows, because that
escapeshellarg-replacement you did is not valid for windows.

the code
<?php
function e(string $str):string{
return "'" . str_replace( "'", "'\''", $str ) . "'";
}
$arg = "foo && whoami && echo ";
$cmd = "echo ".e($arg);
echo $cmd;
?>
prints:
echo 'foo && whoami && echo '

and when i run that in bash i get:

C:\Users\hansh>echo 'foo && whoami && echo '
'foo
laptop-1plmku02\hansh
'

whoami was executed even though it ran through mediawiki's escapeshellarg
replacement

PS creating a proper escapeshellarg for windows is actually difficult, see
https://docs.microsoft.com/en-gb/archive/blogs/twistylittlepassagesallalike/everyone-quotes-command-line-arguments-the-wrong-way
(and even php's upstream escapeshellarg() doesn't get it quite right on
Windows, corrupting more data than it strictly needs to corrupt, but i
don't recall the specifics)

On Mon, 11 Oct 2021 at 09:38, Nicolas Grekas nicolas.grekas+php@gmail.com
wrote:

Le lun. 11 oct. 2021 à 03:33, Tim Starling tstarling@wikimedia.org a
écrit :

Hi Tim,

Thanks for creating this proposal, it looks great!

I think this is a very beneficial change, and the amount of
incorrect locale-dependent calls we had just in php-src further
convinced me of this: We're generally aware of the problem, and we
still made this mistake. Many times.

The only open question I have is regarding the ctype_* functions.
One might argue that these functions should be locale-independent as
well. Certainly, whenever I have used ctype_digit() I only intended
it to match [0-9]. It seems like some people try to use
ctype_alpha() in a locale-sensitive way
(

https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check

<

https://stackoverflow.com/questions/19929965/php-setlocale-not-working-for-ctype-alpha-check

)
and then fail because it doesn't support UTF-8.

OK, I removed ctype_tolower() and ctype_toupper() from the RFC and the
PR since they would be incompatible with a move towards a
locale-independent ctype extension.

The non-controversial parts of the PR were split and merged, so I
rebased the PR and updated the RFC accordingly.

Do you think the RFC is ready for voting now?

PS: Regarding escapeshellarg(), are you aware of the array command
support for proc_open() that was added in PHP 7.4? That does away
the need to escape arguments.

It doesn't really help us. I recently wrote a new shell command
execution system for MediaWiki called Shellbox. As part of that
project, I reviewed how shell execution is used in the MediaWiki
ecosystem. There are a lot of callers which are using shell features,
for example redirecting inputs or outputs, or constructing pipelines.
I didn't really want to break them all or reimplement those features
without the shell. And we have security and containerization wrappers
which depend on construction of a shell command string. So we need to
be able to construct shell command strings safely.

After studying locale sensitivity for this RFC, I decided to get rid
of escapeshellarg() from MediaWiki. Instead we are doing our own shell
escaping:

https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/722548

I also made MediaWiki use a fixed locale, instead of being configurable.

Hi Tim,

thanks for the RFC and for the above pointers, I'm going to have a look at
Symfony Process to follow your lead!

About the RFC, I just have one note:

I didn't include strnatcasecmp() and natcasesort() in this RFC, because
they also use isdigit() and isspace(), and because they are intended for
natural language processing. They could be migrated in future.

Despite their name, I never used natcase functions for natural language
processing. I use them eg to sort lists of files in a directory, to account
for numbers mainly. But that's not what I would call natural language
processing. I'm not aware of anyone using them for that actually. I'm
wondering if it's a good idea to postpone migrating them to an hypothetical
future as to me, the whole reasoning of the RFC applies to them.

Regards,
Nicolas

3 years ago by Tim Starling — view source

unread

@Nicolas i hope mediawiki doesn't run on Windows, because that
escapeshellarg-replacement you did is not valid for windows.

Maybe you missed the Windows-specific code on lines 113-146.

https://gerrit.wikimedia.org/g/mediawiki/libs/Shellbox/+/refs/changes/48/722548/5/src/Shellbox.php#113

-- Tim Starling

3 years ago by Hans Henrik Bergan — view source

unread

dang, yeah completely missed that, sorry

@Nicolas i hope mediawiki doesn't run on Windows, because that
escapeshellarg-replacement you did is not valid for windows.

Maybe you missed the Windows-specific code on lines 113-146.

<
https://gerrit.wikimedia.org/g/mediawiki/libs/Shellbox/+/refs/changes/48/722548/5/src/Shellbox.php#113

-- Tim Starling