[VOTE] Locale-independent case conversion

3 years ago by Nicolas Grekas — view source

unread

Le jeu. 25 nov. 2021 à 06:05, Tim Starling tstarling@wikimedia.org a
écrit :

Voting is now open for my RFC on locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii

Voting will close in two weeks, on 2021-12-09.

Hi Tim,

I voted yes because I want to see this happen but I raised a point in
https://externals.io/message/116141#116259 and didn't get an answer:

Despite their name, I never used natcase functions for natural language

processing. I use them eg to sort lists of files in a directory, to account
for numbers mainly. But that's not what I would call natural language
processing. I'm not aware of anyone using them for that actually. I'm
wondering if it's a good idea to postpone migrating them to an hypothetical
future as to me, the whole reasoning of the RFC applies to them.

I wish the strnatcasecmp() and natcasesort() function, but also the
SORT_NATURAL flag were also covered by this RFC.
Is that possible? Would it make sense?

Nicolas

3 years ago by Tim Starling — view source

unread

I voted yes because I want to see this happen but I raised a point
in https://externals.io/message/116141#116259
https://externals.io/message/116141#116259 and didn't get an answer:
Despite their name, I never used natcase functions for natural
language
processing. I use them eg to sort lists of files in a directory,
to account
for numbers mainly. But that's not what I would call natural
language
processing. I'm not aware of anyone using them for that
actually. I'm
wondering if it's a good idea to postpone migrating them to an
hypothetical
future as to me, the whole reasoning of the RFC applies to them.
I wish the strnatcasecmp() and natcasesort() function, but also the
SORT_NATURAL flag were also covered by this RFC.
Is that possible? Would it make sense?

I'm not going to migrate those functions at this time. It's just a
project scope decision.

-- Tim Starling

3 years ago by Nicolas Grekas — view source

unread

Le jeu. 25 nov. 2021 à 10:47, Tim Starling tstarling@wikimedia.org a
écrit :

I voted yes because I want to see this happen but I raised a point in
https://externals.io/message/116141#116259 and didn't get an answer:

Despite their name, I never used natcase functions for natural language

processing. I use them eg to sort lists of files in a directory, to
account
for numbers mainly. But that's not what I would call natural language
processing. I'm not aware of anyone using them for that actually. I'm
wondering if it's a good idea to postpone migrating them to an
hypothetical
future as to me, the whole reasoning of the RFC applies to them.

I wish the strnatcasecmp() and natcasesort() function, but also the
SORT_NATURAL flag were also covered by this RFC.
Is that possible? Would it make sense?

I'm not going to migrate those functions at this time. It's just a project
scope decision.

Why?

The RFC says:

because they also use isdigit() and isspace(),

Does that mean "too much work needed"? I would totally understand that of
course but I hope someone could do these last miles.

and because they are intended for natural language processing

I definitely do not agree with this argument and it should be removed from
the RFC to me as it might add confusion in the future.

Nicolas

3 years ago by Christoph M. Becker — view source

unread

Le jeu. 25 nov. 2021 à 10:47, Tim Starling tstarling@wikimedia.org a
écrit :

and because they are intended for natural language processing

I definitely do not agree with this argument and it should be removed from
the RFC to me as it might add confusion in the future.

Yeah, the PHP manual says[1]:

| This function implements a comparison algorithm that orders
| alphanumeric strings in the way a human being would, this is described
| as a "natural ordering".

[1] https://www.php.net/manual/en/function.strnatcmp.php

--
Christoph M. Becker

3 years ago by Nicolas Grekas — view source

unread

Le jeu. 25 nov. 2021 à 11:34, Christoph M. Becker cmbecker69@gmx.de a
écrit :

Le jeu. 25 nov. 2021 à 10:47, Tim Starling tstarling@wikimedia.org a
écrit :

and because they are intended for natural language processing

I definitely do not agree with this argument and it should be removed
from
the RFC to me as it might add confusion in the future.

Yeah, the PHP manual says[1]:

| This function implements a comparison algorithm that orders
| alphanumeric strings in the way a human being would, this is described
| as a "natural ordering".

[1] https://www.php.net/manual/en/function.strnatcmp.php

Yep, yet "natural language processing" means processing sentences we write
as humans, e.g. processing this very message. natcase sorting functions are
not done for that. They're useful to sort items in a list - typically file
names - in a way that makes sense to humans. This is very different from
"natural language processing". Having "natsort" vary by locale doesn't make
more sense than having "sort()" vary by locale. That's my point. The
argument doesn't stand against implementing locale-insensitivity for these
functions and I think the RFC shouldn't make it (the argument.)

Nicolas

3 years ago by Tim Starling — view source

unread

The RFC says:

because they also use isdigit() and isspace(),

Does that mean "too much work needed"? I would totally understand
that of course but I hope someone could do these last miles.

Yes.

and because they are intended for natural language processing

I definitely do not agree with this argument and it should be
removed from the RFC to me as it might add confusion in the future.

Done.

-- Tim Starling

3 years ago by Nicolas Grekas — view source

unread

Le jeu. 25 nov. 2021 à 12:23, Tim Starling tstarling@wikimedia.org a
écrit :

The RFC says:

because they also use isdigit() and isspace(),

Does that mean "too much work needed"? I would totally understand that of
course but I hope someone could do these last miles.

Yes.

and because they are intended for natural language processing

I definitely do not agree with this argument and it should be removed from
the RFC to me as it might add confusion in the future.

Done.

Great, thanks!

3 years ago by come@chilliet.eu — view source

unread

Le jeudi 25 novembre 2021, 06:05:37 CET Tim Starling a écrit :

Voting is now open for my RFC on locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii

Hello,

The RFC is missing information about alternatives:
Do all of these function have an mbstring version?
Are those locale dependant or have an option for it?

To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?

Côme

3 years ago by Hans Henrik Bergan — view source

unread

btw why is this code not getting dotted capital i on 3v4l?
https://3v4l.org/D1WG1#v7.4.26
it gets ["res_hex"]=> string(2) "49"

<?php
setlocale(LC_ALL, "Turkish");
$str="i";
$res=strtoupper($str);
var_dump([
"str"=>$str,
"str_hex"=>bin2hex($str),
"res"=>$res,
"res_hex"=>bin2hex($res),
]);
?>

Le jeudi 25 novembre 2021, 06:05:37 CET Tim Starling a écrit :

Voting is now open for my RFC on locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii

Hello,

The RFC is missing information about alternatives:
Do all of these function have an mbstring version?
Are those locale dependant or have an option for it?

To reuse the example from the RFC, if I want to convert a UTF string to
uppercase using Turkish rules and get dotted capital I, what should I use?

Côme

--

To unsubscribe, visit: https://www.php.net/unsub.php

3 years ago by Dusk — view source

unread

btw why is this code not getting dotted capital i on 3v4l?
https://3v4l.org/D1WG1#v7.4.26
it gets ["res_hex"]=> string(2) "49"

<?php
setlocale(LC_ALL, "Turkish");

Because "Turkish" isn't a locale. "tr_TR" is.

https://3v4l.org/GD91W#v7.4.26

Notice that the output doesn't show up correctly, as it is not UTF-8. (Which is part of the problem addressed by this RFC.)

3 years ago by Tim Starling — view source

unread

Hello,

The RFC is missing information about alternatives:
Do all of these function have an mbstring version?

The following functions have an mbstring version: strtolower,
strtoupper, stristr, stripos, strripos.

mb_convert_case() provides functionality equivalent to lcfirst,
ucfirst and ucwords.

There is no mbstring version of str_ireplace, that is
https://bugs.php.net/bug.php?id=75225

There is no mbstring equivalent for the array sorting functions with
SORT_FLAG_CASE, but there is Collator::asort() in intl.

Are those locale dependant or have an option for it?

The mbstring functions are locale-independent.

Unfortunately there do not seem to be PHP wrappers for the family of
case conversion functions in ICU's ustring.h. There is
IntlChar::tolower() and IntlChar::toupper(), but they provide
locale-independent case conversion, equvialent to mbstring. It's not
ideal to change the case of a string character by character, since
some languages have multi-character mappings. ICU calls this
context-sensitive case conversion.

Considering the lack of wide character support or context-sensitive
case conversion in the existing strtoupper/strtolower, I would
consider this missing functionality rather than functionality which I
am removing.

To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?

For case-insensitive comparison you can use Collator. But for display
you just have to do it yourself. For the Turkish Wikipedia and other
Turkic language websites we are currently using str_replace().

-- Tim Starling

3 years ago by Paul Crovella — view source

unread

To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?

For case-insensitive comparison you can use Collator. But for display
you just have to do it yourself. For the Turkish Wikipedia and other
Turkic language websites we are currently using str_replace().

Any particular reason not to use transliterators? https://3v4l.org/I038T

3 years ago by Tim Starling — view source

unread

To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?
For case-insensitive comparison you can use Collator. But for display
you just have to do it yourself. For the Turkish Wikipedia and other
Turkic language websites we are currently using str_replace().
Any particular reason not to use transliterators? https://3v4l.org/I038T

Thanks, I missed that.

You would need to do your own mapping from language code to
transliterator name, since it only has converters for az/tr, el, lt
and "Any", with no fallbacks. For example if you did
Transliterator::create("en-Upper")->transliterate('a') you would get a
fatal error.

Presumably if I submitted a PR adding wrappers for u_strToUpper()
etc., it would not be rejected on the basis that we already have
transliterators.

-- Tim Starling

3 years ago by Dan Ackroyd — view source

unread

Voting is now open for my RFC on locale-independent case conversion.

It seems popular, and likely to pass, but I voted no as the "Backward
Incompatible Changes" section is missing which makes it hard to
evaluate the impact.

cheers
Dan
Ack