Make strtolower/strtoupper just do ASCII

4 years ago by Tim Starling — view source — reply

unread

I would like to know if a patch to make strtolower and strtoupper do
plain ASCII case conversion would be accepted, or if an RFC should be
created.

The situation with case conversion is inconsistent.

The following functions do ASCII case conversion: strcasecmp,
strncasecmp, substr_compare.

The following functions do locale-dependent case conversion:
strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,
strnatcasecmp, ucfirst, ucwords, lcfirst.

I would make them all do ASCII case conversion.

Developers need ASCII case conversion, because it is used internally
by PHP for things like class name comparison, and because it is a
specified algorithm in HTML 5 and related standards.

The existing options for ASCII case conversion are:

Never call setlocale(). But this breaks non-ASCII characters in
escapeshellarg() and can't be guaranteed in a library.
Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also
can't be guaranteed in a library.
Use strtr(). But this is ugly and slow.

If mbstring has a way to do it, I can't find it. I tested
mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').

Note that locale-dependent case conversion is almost never a useful
feature. Strings are passed through tolower() one byte at a time, to
be interpreted with some legacy 8-bit character set. So the result
will typically be mojibake even if the correct locale is selected.

strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I
made a full list at https://phabricator.wikimedia.org/T291234. The
UTF-8 locales mostly work, except for the Turkish ones, which mangle
ASCII strings.

At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My
general recommendation is to avoid locales and locale-dependent
functions, as locales are a fundamentally broken concept." I agree
with that. I think PHP should migrate away from locale dependence.
When PHP was young, it was convenient to use the C library, but we've
progressed well past that point now.

-- Tim Starling

4 years ago by Tim Starling — view source — reply

unread

I would like to know if a patch to make strtolower and strtoupper do
plain ASCII case conversion would be accepted, or if an RFC should be
created.

In case it's unclear, I mean that strtolower() should do 8-bit clean
conversion of letters in the 0-127 range, equivalent to
strtr( $val,
'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz' );

-- Tim Starling

4 years ago by Nikita Popov — view source — reply

unread

On Fri, Sep 17, 2021 at 4:59 AM Tim Starling tstarling@wikimedia.org
wrote:

I would like to know if a patch to make strtolower and strtoupper do
plain ASCII case conversion would be accepted, or if an RFC should be
created.

The situation with case conversion is inconsistent.

The following functions do ASCII case conversion: strcasecmp,
strncasecmp, substr_compare.

The following functions do locale-dependent case conversion:
strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,
strnatcasecmp, ucfirst, ucwords, lcfirst.

I would make them all do ASCII case conversion.

Developers need ASCII case conversion, because it is used internally
by PHP for things like class name comparison, and because it is a
specified algorithm in HTML 5 and related standards.

The existing options for ASCII case conversion are:

Never call setlocale(). But this breaks non-ASCII characters in
escapeshellarg() and can't be guaranteed in a library.

Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also
can't be guaranteed in a library.

Use strtr(). But this is ugly and slow.

If mbstring has a way to do it, I can't find it. I tested
mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').

Note that locale-dependent case conversion is almost never a useful
feature. Strings are passed through tolower() one byte at a time, to
be interpreted with some legacy 8-bit character set. So the result
will typically be mojibake even if the correct locale is selected.

strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I
made a full list at https://phabricator.wikimedia.org/T291234. The
UTF-8 locales mostly work, except for the Turkish ones, which mangle
ASCII strings.

At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My
general recommendation is to avoid locales and locale-dependent
functions, as locales are a fundamentally broken concept." I agree
with that. I think PHP should migrate away from locale dependence.
When PHP was young, it was convenient to use the C library, but we've
progressed well past that point now.

-- Tim Starling

We've been slowly moving away from locale-dependent functionality. Since
PHP 8 we no longer inherit any locales from the environment and have made
float to string conversion locale-independent.

I would very much support making strtolower() and friends a simple ASCII
case conversion operation. mb_strtolower() etc already offer full
Unicode-compliant case conversions that work correctly with multi-byte
encodings. The locale-sensitivity of strtolower() only works with legacy
single-byte encodings and as such is of questionable usefulness even in
cases where it is not actively harmful.

That said, I do think this change requires an RFC.

Regards,
Nikita

4 years ago by Kamil Tekiela — view source — reply

unread

+1 from me. I wasn't even aware that these functions are locale-dependent
until recently. I see an added benefit that we could add them to the
optimizer once they are no longer locale-dependent.
What would happen to users who really need the locale-dependent functions?
Do we offer some workarounds?

4 years ago by Tim Starling — view source — reply

unread

+1 from me. I wasn't even aware that these functions are
locale-dependent until recently. I see an added benefit that we could
add them to the optimizer once they are no longer locale-dependent.
What would happen to users who really need the locale-dependent
functions? Do we offer some workarounds?

We could add a global mode, although that would prevent constant
propagation, if that's what you mean by adding them to the optimizer.
Or we could add variant functions like locale_strtolower() and
locale_strtoupper(). But I think I would want to hear from someone who
uses locale-dependence so I can understand what their needs are. I
guess the RFC will sort that out.

-- Tim Starling

4 years ago by Pierre Joye — view source — reply

unread

Hi Tim,

hope you are well :)

We could add a global mode, although that would prevent constant
propagation, if that's what you mean by adding them to the optimizer.
Or we could add variant functions like locale_strtolower() and
locale_strtoupper(). But I think I would want to hear from someone who
uses locale-dependence so I can understand what their needs are. I
guess the RFC will sort that out.

may I suggest a function rather than a ino setting?

it has the advantage to be contextual and allows the user to enable/disable
it before calling some library api they may not be able to(or don't want
to) patch.

str_use_locale(bool) f.e.?

and at some point it can be false by default and later on removed.

best,
Pierre

4 years ago by Nikita Popov — view source — reply

unread

On Fri, Sep 17, 2021 at 12:07 PM Tim Starling tstarling@wikimedia.org
wrote:

+1 from me. I wasn't even aware that these functions are
locale-dependent until recently. I see an added benefit that we could
add them to the optimizer once they are no longer locale-dependent.
What would happen to users who really need the locale-dependent
functions? Do we offer some workarounds?

We could add a global mode, although that would prevent constant
propagation, if that's what you mean by adding them to the optimizer.
Or we could add variant functions like locale_strtolower() and
locale_strtoupper(). But I think I would want to hear from someone who
uses locale-dependence so I can understand what their needs are. I
guess the RFC will sort that out.

I would expect that in nearly all cases the replacement would be one of
these:

You were using an UTF-8 locale (which you likely are), then just keep
using strtolower(). Without having checked all the details here, I think
strtolower() under UTF-8 locales already effectively behaves like ASCII
lowercase, because it skips continuation bytes.
If you were using some other charset, then using mb_strtolower() with
that charset should work. So if you were using de_DE.ISO8859-1, then using
mb_strtolower() with "ISO8859-1" encoding would be the replacement.

As a matter of general policy, it is unlikely that we will accept an option
(whether that be an ini option or something else) to control this behavior.
We can make the change or not make it, but not both ;)

Regards,
Nikita

4 years ago by Tim Starling — view source — reply

unread

We've been slowly moving away from locale-dependent functionality.
Since PHP 8 we no longer inherit any locales from the environment and
have made float to string conversion locale-independent.

I would very much support making strtolower() and friends a simple
ASCII case conversion operation. mb_strtolower() etc already offer
full Unicode-compliant case conversions that work correctly with
multi-byte encodings. The locale-sensitivity of strtolower() only
works with legacy single-byte encodings and as such is of questionable
usefulness even in cases where it is not actively harmful.

That said, I do think this change requires an RFC.

Thanks Nikita. I'll write the code and then make an RFC.

-- Tim Starling

4 years ago by Christian Schneider — view source — reply

unread

Am 17.09.2021 um 10:43 schrieb Nikita Popov nikita.ppv@gmail.com:

The locale-sensitivity of strtolower() only works with legacy
single-byte encodings and as such is of questionable usefulness even in
cases where it is not actively harmful.

That said, I do think this change requires an RFC.

I agree that this is a big enough BC to require an RFC and I'd recommend a phase where strtolower in combination with locales where it did make something useful to show a deprecation warning to allow migration away from strtolower in those cases.

Chris

4 years ago by tyson andre — view source — reply

unread

Hi Tim Starling,

I would like to know if a patch to make strtolower and strtoupper do
plain ASCII case conversion would be accepted, or if an RFC should be
created.

The situation with case conversion is inconsistent.

The following functions do ASCII case conversion: strcasecmp,
strncasecmp, substr_compare.

The following functions do locale-dependent case conversion:
strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,
strnatcasecmp, ucfirst, ucwords, lcfirst.

I would make them all do ASCII case conversion.

Developers need ASCII case conversion, because it is used internally
by PHP for things like class name comparison, and because it is a
specified algorithm in HTML 5 and related standards.

The existing options for ASCII case conversion are:

Never call setlocale(). But this breaks non-ASCII characters in escapeshellarg() and can't be guaranteed in a library.

Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also
can't be guaranteed in a library.

Use strtr(). But this is ugly and slow.

If mbstring has a way to do it, I can't find it. I tested
mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').

Note that locale-dependent case conversion is almost never a useful
feature. Strings are passed through tolower() one byte at a time, to
be interpreted with some legacy 8-bit character set. So the result
will typically be mojibake even if the correct locale is selected.

strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I
made a full list at https://phabricator.wikimedia.org/T291234. The
UTF-8 locales mostly work, except for the Turkish ones, which mangle
ASCII strings.

At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My
general recommendation is to avoid locales and locale-dependent
functions, as locales are a fundamentally broken concept." I agree
with that. I think PHP should migrate away from locale dependence.
When PHP was young, it was convenient to use the C library, but we've
progressed well past that point now.

I think it's a good idea (But would still require an RFC)
As you said, the way it acts on bytes rather than codepoints seems like it's almost always incorrect outside a narrow range
(except for rare charsets such as https://en.wikipedia.org/wiki/ISO/IEC_8859-1)

The behavior of strtolower is inconvenient for common uses in

filesystem paths, where strolower('I') isn't 'i' in tr_TR
username validation, if it's possible to create a new account that is considered the same case-insensitive strings in some locales but not others
etc.

When implementing this, Zend/Optimizer/sccp.c has optimizations for functions such as str_contains, etc to optimize.
After removing locale dependence, those optimizations could be safely added for functions that would be locale independent as a result of your change.

This would allow eliminating more dead code, and make code calling those functions (on constant arguments) faster by caching the resulting strings in opcache.

The function zend_string_tolower can safely be used to efficiently convert strings to lowercase in a case-insensitive way.
(zend_string_toupper hasn't been needed yet due to not yet having any use cases in php-src's internals, but could be added in such a PR)

841:            || zend_string_equals_literal(name, "str_contains")
842:            || zend_string_equals_literal(name, "str_ends_with")
843:            || zend_string_equals_literal(name, "str_replace")
844:            || zend_string_equals_literal(name, "str_split")
845:            || zend_string_equals_literal(name, "str_starts_with")

Thanks,
Tyson

4 years ago by Tim Starling — view source — reply

unread

When implementing this, Zend/Optimizer/sccp.c has optimizations for functions such as str_contains, etc to optimize.
After removing locale dependence, those optimizations could be safely added for functions that would be locale independent as a result of your change.

This would allow eliminating more dead code, and make code calling those functions (on constant arguments) faster by caching the resulting strings in opcache.

Thanks, I will do that.

The function zend_string_tolower can safely be used to efficiently convert strings to lowercase in a case-insensitive way.
(zend_string_toupper hasn't been needed yet due to not yet having any use cases in php-src's internals, but could be added in such a PR)

I uploaded my work so far and made a PR. It already has
zend_string_toupper.

-- Tim Starling

4 years ago by Tim Starling — view source — reply

unread

When implementing this, Zend/Optimizer/sccp.c has optimizations for functions such as str_contains, etc to optimize.
After removing locale dependence, those optimizations could be safely added for functions that would be locale independent as a result of your change.

This would allow eliminating more dead code, and make code calling those functions (on constant arguments) faster by caching the resulting strings in opcache.

I couldn't make this work. Even after setting
opcache.optimization_level to 0x7FFFFFBF (pass 6 will not run unless
pass 7 is disabled), zend_dfa_optimize_op_array() is called with
call_map=NULL, so ct_eval_func_call() is never entered. I'll leave
this change for someone who is able to test it (or for someone braver
than me).

-- Tim Starling