Voting is now open for my RFC on locale-independent case conversion.
https://wiki.php.net/rfc/strtolower-ascii
Voting will close in two weeks, on 2021-12-09.
-- Tim Starling
Le jeu. 25 nov. 2021 à 06:05, Tim Starling tstarling@wikimedia.org a
écrit :
Voting is now open for my RFC on locale-independent case conversion.
https://wiki.php.net/rfc/strtolower-ascii
Voting will close in two weeks, on 2021-12-09.
Hi Tim,
I voted yes because I want to see this happen but I raised a point in
https://externals.io/message/116141#116259 and didn't get an answer:
Despite their name, I never used natcase functions for natural language
processing. I use them eg to sort lists of files in a directory, to account
for numbers mainly. But that's not what I would call natural language
processing. I'm not aware of anyone using them for that actually. I'm
wondering if it's a good idea to postpone migrating them to an hypothetical
future as to me, the whole reasoning of the RFC applies to them.
I wish the strnatcasecmp()
and natcasesort()
function, but also the
SORT_NATURAL
flag were also covered by this RFC.
Is that possible? Would it make sense?
Nicolas
I voted yes because I want to see this happen but I raised a point
in https://externals.io/message/116141#116259
https://externals.io/message/116141#116259 and didn't get an answer:Despite their name, I never used natcase functions for natural language processing. I use them eg to sort lists of files in a directory, to account for numbers mainly. But that's not what I would call natural language processing. I'm not aware of anyone using them for that actually. I'm wondering if it's a good idea to postpone migrating them to an hypothetical future as to me, the whole reasoning of the RFC applies to them.
I wish the
strnatcasecmp()
andnatcasesort()
function, but also the
SORT_NATURAL
flag were also covered by this RFC.
Is that possible? Would it make sense?
I'm not going to migrate those functions at this time. It's just a
project scope decision.
-- Tim Starling
Le jeu. 25 nov. 2021 à 10:47, Tim Starling tstarling@wikimedia.org a
écrit :
I voted yes because I want to see this happen but I raised a point in
https://externals.io/message/116141#116259 and didn't get an answer:Despite their name, I never used natcase functions for natural language
processing. I use them eg to sort lists of files in a directory, to
account
for numbers mainly. But that's not what I would call natural language
processing. I'm not aware of anyone using them for that actually. I'm
wondering if it's a good idea to postpone migrating them to an
hypothetical
future as to me, the whole reasoning of the RFC applies to them.I wish the
strnatcasecmp()
andnatcasesort()
function, but also the
SORT_NATURAL
flag were also covered by this RFC.
Is that possible? Would it make sense?I'm not going to migrate those functions at this time. It's just a project
scope decision.
Why?
The RFC says:
because they also use isdigit() and isspace(),
Does that mean "too much work needed"? I would totally understand that of
course but I hope someone could do these last miles.
and because they are intended for natural language processing
I definitely do not agree with this argument and it should be removed from
the RFC to me as it might add confusion in the future.
Nicolas
Le jeu. 25 nov. 2021 à 10:47, Tim Starling tstarling@wikimedia.org a
écrit :and because they are intended for natural language processing
I definitely do not agree with this argument and it should be removed from
the RFC to me as it might add confusion in the future.
Yeah, the PHP manual says[1]:
| This function implements a comparison algorithm that orders
| alphanumeric strings in the way a human being would, this is described
| as a "natural ordering".
[1] https://www.php.net/manual/en/function.strnatcmp.php
--
Christoph M. Becker
Le jeu. 25 nov. 2021 à 11:34, Christoph M. Becker cmbecker69@gmx.de a
écrit :
Le jeu. 25 nov. 2021 à 10:47, Tim Starling tstarling@wikimedia.org a
écrit :and because they are intended for natural language processing
I definitely do not agree with this argument and it should be removed
from
the RFC to me as it might add confusion in the future.Yeah, the PHP manual says[1]:
| This function implements a comparison algorithm that orders
| alphanumeric strings in the way a human being would, this is described
| as a "natural ordering".
Yep, yet "natural language processing" means processing sentences we write
as humans, e.g. processing this very message. natcase sorting functions are
not done for that. They're useful to sort items in a list - typically file
names - in a way that makes sense to humans. This is very different from
"natural language processing". Having "natsort" vary by locale doesn't make
more sense than having "sort()" vary by locale. That's my point. The
argument doesn't stand against implementing locale-insensitivity for these
functions and I think the RFC shouldn't make it (the argument.)
Nicolas
The RFC says:
because they also use isdigit() and isspace(),
Does that mean "too much work needed"? I would totally understand
that of course but I hope someone could do these last miles.
Yes.
and because they are intended for natural language processing
I definitely do not agree with this argument and it should be
removed from the RFC to me as it might add confusion in the future.
Done.
-- Tim Starling
Le jeu. 25 nov. 2021 à 12:23, Tim Starling tstarling@wikimedia.org a
écrit :
The RFC says:
because they also use isdigit() and isspace(),
Does that mean "too much work needed"? I would totally understand that of
course but I hope someone could do these last miles.Yes.
and because they are intended for natural language processing
I definitely do not agree with this argument and it should be removed from
the RFC to me as it might add confusion in the future.Done.
Great, thanks!
Le jeudi 25 novembre 2021, 06:05:37 CET Tim Starling a écrit :
Voting is now open for my RFC on locale-independent case conversion.
Hello,
The RFC is missing information about alternatives:
Do all of these function have an mbstring version?
Are those locale dependant or have an option for it?
To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?
Côme
btw why is this code not getting dotted capital i on 3v4l?
https://3v4l.org/D1WG1#v7.4.26
it gets ["res_hex"]=> string(2) "49"
<?php
setlocale(LC_ALL, "Turkish");
$str="i";
$res=strtoupper($str);
var_dump([
"str"=>$str,
"str_hex"=>bin2hex($str),
"res"=>$res,
"res_hex"=>bin2hex($res),
]);
?>
Le jeudi 25 novembre 2021, 06:05:37 CET Tim Starling a écrit :
Voting is now open for my RFC on locale-independent case conversion.
Hello,
The RFC is missing information about alternatives:
Do all of these function have an mbstring version?
Are those locale dependant or have an option for it?To reuse the example from the RFC, if I want to convert a UTF string to
uppercase using Turkish rules and get dotted capital I, what should I use?Côme
--
To unsubscribe, visit: https://www.php.net/unsub.php
btw why is this code not getting dotted capital i on 3v4l?
https://3v4l.org/D1WG1#v7.4.26
it gets ["res_hex"]=> string(2) "49"<?php
setlocale(LC_ALL, "Turkish");
Because "Turkish" isn't a locale. "tr_TR" is.
https://3v4l.org/GD91W#v7.4.26
Notice that the output doesn't show up correctly, as it is not UTF-8. (Which is part of the problem addressed by this RFC.)
Hello,
The RFC is missing information about alternatives:
Do all of these function have an mbstring version?
The following functions have an mbstring version: strtolower,
strtoupper, stristr, stripos, strripos.
mb_convert_case()
provides functionality equivalent to lcfirst,
ucfirst and ucwords.
There is no mbstring version of str_ireplace, that is
https://bugs.php.net/bug.php?id=75225
There is no mbstring equivalent for the array sorting functions with
SORT_FLAG_CASE, but there is Collator::asort() in intl.
Are those locale dependant or have an option for it?
The mbstring functions are locale-independent.
Unfortunately there do not seem to be PHP wrappers for the family of
case conversion functions in ICU's ustring.h. There is
IntlChar::tolower() and IntlChar::toupper(), but they provide
locale-independent case conversion, equvialent to mbstring. It's not
ideal to change the case of a string character by character, since
some languages have multi-character mappings. ICU calls this
context-sensitive case conversion.
Considering the lack of wide character support or context-sensitive
case conversion in the existing strtoupper/strtolower, I would
consider this missing functionality rather than functionality which I
am removing.
To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?
For case-insensitive comparison you can use Collator. But for display
you just have to do it yourself. For the Turkish Wikipedia and other
Turkic language websites we are currently using str_replace()
.
-- Tim Starling
To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?
For case-insensitive comparison you can use Collator. But for display
you just have to do it yourself. For the Turkish Wikipedia and other
Turkic language websites we are currently usingstr_replace()
.
Any particular reason not to use transliterators? https://3v4l.org/I038T
To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?
For case-insensitive comparison you can use Collator. But for display
you just have to do it yourself. For the Turkish Wikipedia and other
Turkic language websites we are currently usingstr_replace()
.
Any particular reason not to use transliterators? https://3v4l.org/I038T
Thanks, I missed that.
You would need to do your own mapping from language code to
transliterator name, since it only has converters for az/tr, el, lt
and "Any", with no fallbacks. For example if you did
Transliterator::create("en-Upper")->transliterate('a') you would get a
fatal error.
Presumably if I submitted a PR adding wrappers for u_strToUpper()
etc., it would not be rejected on the basis that we already have
transliterators.
-- Tim Starling
Voting is now open for my RFC on locale-independent case conversion.
It seems popular, and likely to pass, but I voted no as the "Backward
Incompatible Changes" section is missing which makes it hard to
evaluate the impact.
cheers
Dan
Ack