Hi, Internals
I changed below the RFC.
- https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive
Pull request is below: - https://github.com/php/php-src/pull/18792
Change point is below:
- Add a strength for grapheme_* functions
- Affect to all over the world characters, ex: Ideographic Variation
Sequence(IVS) - Use Collator object const values.
- Affect to all over the world characters, ex: Ideographic Variation
$locale parameter is not change anything. Because I could not find any way.
Maybe I overlooked something, So please point it out to me.
Regards
Yuya
--
Yuya Hamada (tekimen)
Hi, Internals
I changed below the RFC.
- https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive
Pull request is below:- https://github.com/php/php-src/pull/18792
Change point is below:
- Add a strength for grapheme_* functions
- Affect to all over the world characters, ex: Ideographic Variation
Sequence(IVS)- Use Collator object const values.
These settings are indeed important for these functions, but I can't get
around the fact that it makes these APIs really cluttered and
complicated — something that many functions in the grapheme_ / intl
extension already suffer from.
Is this API really the best way?
$locale parameter is not change anything. Because I could not find any way.
It seems that I came to a similar conclusion, but locales are much more
complicated than just languageCode_regionCode (for example, see
https://github.com/derickr/php-text/blob/main/tests/text-contains.phpt#L25)
You also don't really need a strength argument, as you can 'encode' that
in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
and the list of options is vast:
https://www.unicode.org/reports/tr35/tr35-collation.html#Common_Settings
cheers,
Derick
2025年7月14日(月) 19:22 Derick Rethans derick@php.net:
Hi, Internals
I changed below the RFC.
- https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive
Pull request is below:- https://github.com/php/php-src/pull/18792
Change point is below:
- Add a strength for grapheme_* functions
- Affect to all over the world characters, ex: Ideographic Variation
Sequence(IVS)- Use Collator object const values.
These settings are indeed important for these functions, but I can't get
around the fact that it makes these APIs really cluttered and
complicated — something that many functions in the grapheme_ / intl
extension already suffer from.Is this API really the best way?
$locale parameter is not change anything. Because I could not find any way.
It seems that I came to a similar conclusion, but locales are much more
complicated than just languageCode_regionCode (for example, see
https://github.com/derickr/php-text/blob/main/tests/text-contains.phpt#L25)You also don't really need a strength argument, as you can 'encode' that
in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
and the list of options is vast:
https://www.unicode.org/reports/tr35/tr35-collation.html#Common_Settingscheers,
Derick
Hi, Derick
Thank you very much for response.
Is this API really the best way?
I reconsidered the function signature based on what you said.
It seems that I came to a similar conclusion, but locales are much more
complicated than just languageCode_regionCode (for example, see
https://github.com/derickr/php-text/blob/main/tests/text-contains.phpt#L25)You also don't really need a strength argument, as you can 'encode' that
in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
and the list of options is vast:
https://www.unicode.org/reports/tr35/tr35-collation.html#Common_Settings
Indeed, since strength can be specified in the locale,
I thought it would be better to specify it in the locale rather than
as a parameter for strength.
For example, The grapheme_* functions can detect difference for IVS.
$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("\u{908A}",
"\u{908A}\u{E0101}", locale: "ja_JP-u-ks-identic"));'
int(1)
$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("\u{908A}",
"\u{908A}\u{E0101}"));'
int(0)
$ sapi/cli/php -r 'var_dump(grapheme_strpos("\u{908A}", "\u{908A}\u{E0101}"));'
int(0)
$ sapi/cli/php -r 'var_dump(grapheme_strpos("\u{908A}",
"\u{908A}\u{E0101}", locale: "ja_JP-u-ks-identic"));'
bool(false)
Since ideographic characters also have identities (e.g., names), we
would like to make IVS compatible with them.
However, it should be simple, so we should compromise somewhere.
Regards
Yuya
--
Yuya Hamada (tekimen)
2025年7月15日(火) 16:05 youkidearitai youkidearitai@gmail.com:
2025年7月14日(月) 19:22 Derick Rethans derick@php.net:
Hi, Internals
I changed below the RFC.
- https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive
Pull request is below:- https://github.com/php/php-src/pull/18792
Change point is below:
- Add a strength for grapheme_* functions
- Affect to all over the world characters, ex: Ideographic Variation
Sequence(IVS)- Use Collator object const values.
These settings are indeed important for these functions, but I can't get
around the fact that it makes these APIs really cluttered and
complicated — something that many functions in the grapheme_ / intl
extension already suffer from.Is this API really the best way?
$locale parameter is not change anything. Because I could not find any way.
It seems that I came to a similar conclusion, but locales are much more
complicated than just languageCode_regionCode (for example, see
https://github.com/derickr/php-text/blob/main/tests/text-contains.phpt#L25)You also don't really need a strength argument, as you can 'encode' that
in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
and the list of options is vast:
https://www.unicode.org/reports/tr35/tr35-collation.html#Common_Settingscheers,
DerickHi, Derick
Thank you very much for response.
Is this API really the best way?
I reconsidered the function signature based on what you said.
It seems that I came to a similar conclusion, but locales are much more
complicated than just languageCode_regionCode (for example, see
https://github.com/derickr/php-text/blob/main/tests/text-contains.phpt#L25)You also don't really need a strength argument, as you can 'encode' that
in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
and the list of options is vast:
https://www.unicode.org/reports/tr35/tr35-collation.html#Common_SettingsIndeed, since strength can be specified in the locale,
I thought it would be better to specify it in the locale rather than
as a parameter for strength.For example, The grapheme_* functions can detect difference for IVS.
$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("\u{908A}", "\u{908A}\u{E0101}", locale: "ja_JP-u-ks-identic"));' int(1) $ sapi/cli/php -r 'var_dump(grapheme_levenshtein("\u{908A}", "\u{908A}\u{E0101}"));' int(0) $ sapi/cli/php -r 'var_dump(grapheme_strpos("\u{908A}", "\u{908A}\u{E0101}"));' int(0) $ sapi/cli/php -r 'var_dump(grapheme_strpos("\u{908A}", "\u{908A}\u{E0101}", locale: "ja_JP-u-ks-identic"));' bool(false)
Since ideographic characters also have identities (e.g., names), we
would like to make IVS compatible with them.
However, it should be simple, so we should compromise somewhere.Regards
Yuya--
Yuya Hamada (tekimen)
Hi, Internals
I have revised this RFC.
https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive
I believe I have done my best to address the complexity of Unicode.
I would like to go to "Voting" phase.
If there are no objections, I would like to start voting this week.
Regards
Yuya
--
Yuya Hamada (tekimen)