[RFC] grapheme cluster for str_split, grapheme_str_split function

1 year ago by youkidearitai — view source

unread

Hello, Internals

I created an wiki for grapheme_str_split function.
Please see:
https://wiki.php.net/rfc/grapheme_str_split

I would like to "Under Discussion" section.

Best Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by youkidearitai — view source

unread

2024年3月9日(土) 15:26 youkidearitai youkidearitai@gmail.com:

Hello, Internals

I created an wiki for grapheme_str_split function.
Please see:
https://wiki.php.net/rfc/grapheme_str_split

I would like to "Under Discussion" section.

Best Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

Hello, Internals

I want to go to "Voting" phase if nothing any comment.
I will start at tomorrow(26th) to "Voting" phase.

Thank you
Yuya

--

Yuya Hamada (tekimen)

1 year ago by Ayesh Karunaratne — view source

unread

2024年3月9日(土) 15:26 youkidearitai youkidearitai@gmail.com:

Hello, Internals

I created an wiki for grapheme_str_split function.
Please see:
https://wiki.php.net/rfc/grapheme_str_split

I would like to "Under Discussion" section.

Best Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

Hello, Internals

I want to go to "Voting" phase if nothing any comment.
I will start at tomorrow(26th) to "Voting" phase.

Thank you
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

I think it makes sense to add this function, and the PR worked well
too; It correctly split individual graphemes for all comlex Emojis,
ZWJs, and those Cthulu texts, and everything else I threw at it.

Good luck for the RFC vote today, hope it passes 🤞.

1 year ago by David CARLIER — view source

unread

I second this, I think it is a good addition which makes a lot of sense.

Cheers.

2024年3月9日(土) 15:26 youkidearitai youkidearitai@gmail.com:

Hello, Internals

I created an wiki for grapheme_str_split function.
Please see:
https://wiki.php.net/rfc/grapheme_str_split

I would like to "Under Discussion" section.

Best Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

Hello, Internals

I want to go to "Voting" phase if nothing any comment.
I will start at tomorrow(26th) to "Voting" phase.

Thank you
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

I think it makes sense to add this function, and the PR worked well
too; It correctly split individual graphemes for all comlex Emojis,
ZWJs, and those Cthulu texts, and everything else I threw at it.

Good luck for the RFC vote today, hope it passes 🤞.

1 year ago by youkidearitai — view source

unread

2024年3月26日(火) 5:43 David CARLIER devnexen@gmail.com:

I second this, I think it is a good addition which makes a lot of sense.

Cheers.

2024年3月9日(土) 15:26 youkidearitai youkidearitai@gmail.com:

Hello, Internals

I created an wiki for grapheme_str_split function.
Please see:
https://wiki.php.net/rfc/grapheme_str_split

I would like to "Under Discussion" section.

Best Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

Hello, Internals

I want to go to "Voting" phase if nothing any comment.
I will start at tomorrow(26th) to "Voting" phase.

Thank you
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

I think it makes sense to add this function, and the PR worked well
too; It correctly split individual graphemes for all comlex Emojis,
ZWJs, and those Cthulu texts, and everything else I threw at it.

Good luck for the RFC vote today, hope it passes 🤞.

Hi, Internals

grapheme_str_split going to "Voting" phase.
Vote end is 10th April 00:00 GMT

Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by Casper Langemeijer — view source

unread

I'd like to address an issue I have with this RFC.

I'm not sure is solves a problem by itself. If I understand all of this correctly this only does what already can be accomplished with preg_match_all('/\X/u', ...). The result of this method in my opinion is not very usefull by itself. I've done some searching on various code platforms where I mostly find the use-case for counting the number of grapheme's. I've used it to implement strrev() that correctly works multibyte.

I'm very sad that mbstring works on codepoints instead of grapheme's and I would very much like to see something happening in that area, but I think expanding a simple string to an array of as many elements to give developers a tool to do this in PHP-space is not good enough. Especially since it can already be achieved with a regexp that already works.

In my opinion: This adds nothing, and tells the PHP developer that is ok to do count(grapheme_str_split()) for a more accurate mb_strlen().

I would like to see a family of functions that can do multibyte str_split(), strrev(), substr(). Ideally as bugfix in mb_* functions, because the edge-case of wanting to know the length in codepoints of a string is a weird edge-case. No developer wants to know that. mb_strlen() should have returned the number of graphemes from the start.

2024年3月26日(火) 5:43 David CARLIER devnexen@gmail.com:

I second this, I think it is a good addition which makes a lot of sense.

Cheers.

2024年3月9日(土) 15:26 youkidearitai youkidearitai@gmail.com:

Hello, Internals

I created an wiki for grapheme_str_split function.
Please see:
https://wiki.php.net/rfc/grapheme_str_split

I would like to "Under Discussion" section.

Best Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

Hello, Internals

I want to go to "Voting" phase if nothing any comment.
I will start at tomorrow(26th) to "Voting" phase.

Thank you
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

I think it makes sense to add this function, and the PR worked well
too; It correctly split individual graphemes for all comlex Emojis,
ZWJs, and those Cthulu texts, and everything else I threw at it.

Good luck for the RFC vote today, hope it passes 🤞.

Hi, Internals

grapheme_str_split going to "Voting" phase.
Vote end is 10th April 00:00 GMT

Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

1 year ago by Derick Rethans — view source

unread

I'd like to address an issue I have with this RFC.

Please don't top reply.

I'm not sure is solves a problem by itself. If I understand all of this correctly this only does what already can be accomplished with preg_match_all('/\X/u', ...). The result of this method in my opinion is not very usefull by itself. I've done some searching on various code platforms where I mostly find the use-case for counting the number of grapheme's. I've used it to implement strrev() that correctly works multibyte.

I'm very sad that mbstring works on codepoints instead of grapheme's and I would very much like to see something happening in that area, but I think expanding a simple string to an array of as many elements to give developers a tool to do this in PHP-space is not good enough. Especially since it can already be achieved with a regexp that already works.

In my opinion: This adds nothing, and tells the PHP developer that is ok to do count(grapheme_str_split()) for a more accurate mb_strlen().

I would like to see a family of functions that can do multibyte str_split(), strrev(), substr(). Ideally as bugfix in mb_* functions, because the edge-case of wanting to know the length in codepoints of a string is a weird edge-case. No developer wants to know that. mb_strlen() should have returned the number of graphemes from the start.

Many of these already exist, such as grapheme_substr. We can't simply change the behaviour of the already existing functions due to BC reasons.

The intl extension is also built on ICU, an actual unicode text processing library.

The grapheme_str_split function, as well as other intl extension functions is what should replace mbstring really.

cheers
Derick

1 year ago by Casper Langemeijer — view source

unread

Many of these already exist, such as grapheme_substr. We can't simply change the behaviour of the already existing functions due to BC reasons.

Wow. I feel very stupid. I feel I should have known about grapheme_*, but I didn't. Oh my, the manual says since PHP 5.3 no less. From what I've seen around being used, I'm far from the only one though. In an attempt to justify my own stupidity I searched its use and it's bad.

Searching on github with language:PHP:
mb_strlen 84k files, grapheme_strlen 680

Then a big number of first 100 of these files are stubs/polyfills/phpstan metadata. I've seen no framework except Symphony (but they might be further in the searchresults)

The grapheme_str_split function, as well as other intl extension functions is what should replace mbstring really.

YES!

I'm sorry to have wasted your time. If you need someone to help for the grapheme_ marketing team, let me know.

1 year ago by youkidearitai — view source

unread

2024年3月27日(水) 6:18 Casper Langemeijer langemeijer@php.net:

Many of these already exist, such as grapheme_substr. We can't simply change the behaviour of the already existing functions due to BC reasons.

Wow. I feel very stupid. I feel I should have known about grapheme_*, but I didn't. Oh my, the manual says since PHP 5.3 no less. From what I've seen around being used, I'm far from the only one though. In an attempt to justify my own stupidity I searched its use and it's bad.

Searching on github with language:PHP:
mb_strlen 84k files, grapheme_strlen 680

Then a big number of first 100 of these files are stubs/polyfills/phpstan metadata. I've seen no framework except Symphony (but they might be further in the searchresults)

The grapheme_str_split function, as well as other intl extension functions is what should replace mbstring really.

YES!

I'm sorry to have wasted your time. If you need someone to help for the grapheme_ marketing team, let me know.

Hi, Casper

I think still useful mbstring functions. Because mbstring functions is
still valid as a bridge to non-Unicode character codes.
We think it makes sense for mbstring to calculate in Unicode code point units.

Therefore, I think make sense that separate mbstring functions and
grapheme functions.

Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by Rowan Tommins [IMSoP] — view source

unread

If you need someone to help for the grapheme_ marketing team, let me know.

I think a big part of the problem is that very few people dig into the
complexities of text encoding, and so don't know that a "grapheme" is
what they're looking for.

Unicode documentation is, generally, very careful with its terminology -
distinguishing between "code points", "code units" "graphemes" ,
"grapheme clusters", "glyphs", etc. Pretty much everyone else just says
"character", and assumes that everyone knows what they mean.

As a case in point, looking at the PHP manual pages for strlen,
mb_strlen, and grapheme_strlen:

Short summary:

strlen — Get string length
mb_strlen — Get string length
grapheme_strlen — Get string length in grapheme units

Description:

Returns the length of the given string.
Gets the length of a string.
Get string length in grapheme units (not bytes or characters)

The first two don't actually say what units they're measuring in. Maybe
it's millimetres? ;)

The last one uses the term "grapheme" without explaining what it means,
and makes a contrast with "characters", which is confusing, as one of
the definitions in the Unicode glossary
[https://unicode.org/glossary/#grapheme] is:

What a user thinks of as a character.

The mb_strlen documentation has a bit more explanation in its Return
Values section:

Returns the number of characters in string string having character
encoding encoding. A multi-byte character is counted as 1.

For Unicode in particular, this is a poor description; it is completely
missing the term "code point", which is what it actually counts.

That's probably because ext/mbstring wasn't written with Unicode in
mind, it was "developed to handle Japanese characters", back in 2001;
and it still does support several pre-Unicode "multi-byte encodings".
For a bit of nostalgia:
http://web.archive.org/web/20010605075550/http://www.php.net/manual/en/ref.mbstring.php

So... if you want to help make people more aware of the grapheme_*
functions, one place to start would be editing the documentation for the
various string, mbstring, and grapheme functions to use consistent
terminology, and sign-post each other more clearly.
http://doc.php.net/tutorial/

Regards,

--
Rowan Tommins
[IMSoP]

1 year ago by Derick Rethans — view source

unread

If you need someone to help for the grapheme_ marketing team, let me know.

I think a big part of the problem is that very few people dig into the complexities of text encoding, and so don't know that a "grapheme" is what they're looking for.

Unicode documentation is, generally, very careful with its terminology - distinguishing between "code points", "code units" "graphemes" , "grapheme clusters", "glyphs", etc. Pretty much everyone else just says "character", and assumes that everyone knows what they mean.

That's why I have been working on https://wiki.php.net/rfc/unicode_text_processing

Takes all the (or most) terminology out of it.

It's time to resurrect it.

cheers
Derick

1 year ago by Casper Langemeijer — view source

unread

So... if you want to help make people more aware of the grapheme_*
functions, one place to start would be editing the documentation for the
various string, mbstring, and grapheme functions to use consistent
terminology, and sign-post each other more clearly.
http://doc.php.net/tutorial/

Yes I agree, Also I've edited documentation before in the svn days. I already planned to read up on how this is working nowadays.

Also I'm planning an outline for a conference talk on the subject. I've educated people on unicode related subjects before, and think I have a few very good stories that can give insight into this for unsuspecting developers.

I love the analogy that most Europeans understand. For the city of Cologne, there are two equally valid ways to write it's German name. Köln and Koeln. (Used when hindered by technical limitations, or maybe in informal conversation) Every German can extra_e_decode() and extra_e_encode(). Same for Straße and Strasse.

Ligatures in fonts make it harder though, sometimes they intentionally obfuscate what's happening in the unicode layer. You might know this from special programming fonts with glyphs for ===, <> and such.

Some Dutch fonts do a special ligature that combines ij, which was in the Dutch alphabet when I was a kid, 'y' was not. Unicode U+0132 and U+0133 describe this symbol, but I've never seen them used. Fonts combining ij to one visual entity is more common.

I imagine most languages and cultures have these kind of edge-cases.