Hi, Internals
I noticed grapheme cluster is not limit code points in UAX#29.
https://www.unicode.org/reports/tr29/
And there is no limit code point in Unicode that confirmed in issue of ICU.
https://unicode-org.atlassian.net/browse/ICU-23302
So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.
For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u
{200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M >
emoji_bomb.txt
(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:
grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): bool
I don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hakṣhmalawarayaṁ(ཧྐྵྨླྺྼྻྂ) in
9 code points.
If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.
Please see also my speakerdeck.
https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-cluster
What do you think about this idea?
Regards
Yuya
--
Yuya Hamada (tekimen)
Hi Yuya,
I think this is a good idea. While spec compliance is generally desirable,
DoS via unbounded grapheme clusters is a real threat, and it's reasonable
for a language-level implementation to impose practical limits that the
Unicode spec itself doesn't define. This kind of gap between a
general-purpose spec and a concrete implementation is not unusual.
The default of 32 code points sounds sensible given that natural language
grapheme clusters top out well below that.
One minor note: it might help to clarify the intended behavior of
grapheme_limit_codepoints a bit more — for instance, whether it is meant
as a validation check (returning false when a cluster exceeds the limit) or
something else.
Regards,
Kentaro Takeda
2026年2月23日(月) 20:28 youkidearitai youkidearitai@gmail.com:
Hi, Internals
I noticed grapheme cluster is not limit code points in UAX#29.
https://www.unicode.org/reports/tr29/And there is no limit code point in Unicode that confirmed in issue of ICU.
https://unicode-org.atlassian.net/browse/ICU-23302So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M > emoji_bomb.txt(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): boolI don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
9 code points.If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.Please see also my speakerdeck.
https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-cluster
What do you think about this idea?
Regards
Yuya--
Yuya Hamada (tekimen)
2026年2月24日(火) 11:38 Kentaro Takeda takeda@youmind.jp:
Hi Yuya,
I think this is a good idea. While spec compliance is generally desirable, DoS via unbounded grapheme clusters is a real threat, and it's reasonable for a language-level implementation to impose practical limits that the Unicode spec itself doesn't define. This kind of gap between a general-purpose spec and a concrete implementation is not unusual.
The default of 32 code points sounds sensible given that natural language grapheme clusters top out well below that.
One minor note: it might help to clarify the intended behavior of
grapheme_limit_codepointsa bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.Regards,
Kentaro Takeda2026年2月23日(月) 20:28 youkidearitai youkidearitai@gmail.com:
Hi, Internals
I noticed grapheme cluster is not limit code points in UAX#29.
https://www.unicode.org/reports/tr29/And there is no limit code point in Unicode that confirmed in issue of ICU.
https://unicode-org.atlassian.net/browse/ICU-23302So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M > emoji_bomb.txt(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): boolI don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
9 code points.If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.Please see also my speakerdeck.
https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-clusterWhat do you think about this idea?
Regards
Yuya--
Yuya Hamada (tekimen)
Hi, Kentaro
Thank you very much for your feedback.
One minor note: it might help to clarify the intended behavior of
grapheme_limit_codepointsa bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.
Okay. I'll show you.
// something string in $_POST['text']
// Validate many code points in a grapheme cluster.
if (grapheme_limit_codepoints($_POST['text'], 32) !== true) {
throw new InvalidException("Found invalid / many code points in
grapheme cluster");
}
// Validate grapheme cluster length
if (grapheme_strlen($_POST['text']) > 100) {
throw new InvalidException("Invalid grater than 100 graphemes");
}
// do anything...
The intention is "count correct graphemes avoid DoS".
And I want to overcoming to
https://github.com/symfony/symfony/pull/13527 in grapheme_strlen
function.
Feel free to more comment.
Regards
Yuya.
--
Yuya Hamada (tekimen)
2026年2月24日(火) 16:21 youkidearitai youkidearitai@gmail.com:
2026年2月24日(火) 11:38 Kentaro Takeda takeda@youmind.jp:
Hi Yuya,
I think this is a good idea. While spec compliance is generally desirable, DoS via unbounded grapheme clusters is a real threat, and it's reasonable for a language-level implementation to impose practical limits that the Unicode spec itself doesn't define. This kind of gap between a general-purpose spec and a concrete implementation is not unusual.
The default of 32 code points sounds sensible given that natural language grapheme clusters top out well below that.
One minor note: it might help to clarify the intended behavior of
grapheme_limit_codepointsa bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.Regards,
Kentaro Takeda2026年2月23日(月) 20:28 youkidearitai youkidearitai@gmail.com:
Hi, Internals
I noticed grapheme cluster is not limit code points in UAX#29.
https://www.unicode.org/reports/tr29/And there is no limit code point in Unicode that confirmed in issue of ICU.
https://unicode-org.atlassian.net/browse/ICU-23302So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M > emoji_bomb.txt(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): boolI don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
9 code points.If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.Please see also my speakerdeck.
https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-clusterWhat do you think about this idea?
Regards
Yuya--
Yuya Hamada (tekimen)
Hi, Kentaro
Thank you very much for your feedback.
One minor note: it might help to clarify the intended behavior of
grapheme_limit_codepointsa bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.Okay. I'll show you.
// something string in $_POST['text'] // Validate many code points in a grapheme cluster. if (grapheme_limit_codepoints($_POST['text'], 32) !== true) { throw new InvalidException("Found invalid / many code points in grapheme cluster"); } // Validate grapheme cluster length if (grapheme_strlen($_POST['text']) > 100) { throw new InvalidException("Invalid grater than 100 graphemes"); } // do anything...The intention is "count correct graphemes avoid DoS".
And I want to overcoming to
https://github.com/symfony/symfony/pull/13527 in grapheme_strlen
function.Feel free to more comment.
Regards
Yuya.--
Yuya Hamada (tekimen)
Hi, Internals
I created a PoC and RFC.
https://github.com/php/php-src/pull/21311
https://wiki.php.net/rfc/grapheme_limit_codepoints
I tried to ask Unicode that UAX#29 add for limit of codepoint for
grapheme cluster.
Perhaps Unicode adds my suggestion if it is make sense. However, I
don't know what happen.
Anyway, I think make sense that grapheme cluster limits codepoint in PHP side.
Feel free to comment.
Regards
Yuya
--
Yuya Hamada (tekimen)
2026年2月28日(土) 0:59 youkidearitai youkidearitai@gmail.com:
2026年2月24日(火) 16:21 youkidearitai youkidearitai@gmail.com:
2026年2月24日(火) 11:38 Kentaro Takeda takeda@youmind.jp:
Hi Yuya,
I think this is a good idea. While spec compliance is generally desirable, DoS via unbounded grapheme clusters is a real threat, and it's reasonable for a language-level implementation to impose practical limits that the Unicode spec itself doesn't define. This kind of gap between a general-purpose spec and a concrete implementation is not unusual.
The default of 32 code points sounds sensible given that natural language grapheme clusters top out well below that.
One minor note: it might help to clarify the intended behavior of
grapheme_limit_codepointsa bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.Regards,
Kentaro Takeda2026年2月23日(月) 20:28 youkidearitai youkidearitai@gmail.com:
Hi, Internals
I noticed grapheme cluster is not limit code points in UAX#29.
https://www.unicode.org/reports/tr29/And there is no limit code point in Unicode that confirmed in issue of ICU.
https://unicode-org.atlassian.net/browse/ICU-23302So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M > emoji_bomb.txt(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): boolI don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
9 code points.If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.Please see also my speakerdeck.
https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-clusterWhat do you think about this idea?
Regards
Yuya--
Yuya Hamada (tekimen)
Hi, Kentaro
Thank you very much for your feedback.
One minor note: it might help to clarify the intended behavior of
grapheme_limit_codepointsa bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.Okay. I'll show you.
// something string in $_POST['text'] // Validate many code points in a grapheme cluster. if (grapheme_limit_codepoints($_POST['text'], 32) !== true) { throw new InvalidException("Found invalid / many code points in grapheme cluster"); } // Validate grapheme cluster length if (grapheme_strlen($_POST['text']) > 100) { throw new InvalidException("Invalid grater than 100 graphemes"); } // do anything...The intention is "count correct graphemes avoid DoS".
And I want to overcoming to
https://github.com/symfony/symfony/pull/13527 in grapheme_strlen
function.Feel free to more comment.
Regards
Yuya.--
Yuya Hamada (tekimen)
Hi, Internals
I created a PoC and RFC.
https://github.com/php/php-src/pull/21311
https://wiki.php.net/rfc/grapheme_limit_codepointsI tried to ask Unicode that UAX#29 add for limit of codepoint for
grapheme cluster.
Perhaps Unicode adds my suggestion if it is make sense. However, I
don't know what happen.Anyway, I think make sense that grapheme cluster limits codepoint in PHP side.
Feel free to comment.
Regards
Yuya--
Yuya Hamada (tekimen)
Hi, Internals
This topic, I reported Unicode. Then received reply that is below:
Thank you for your feedback and your interest in Unicode.
Your feedback will be reviewed by one of Unicode’s working groups.
If appropriate, it may be posted to the PRI feedback page or be made part of a list of general feedback that will be considered for the next quarterly UTC meeting.
My understand, if appropriate PRI(https://www.unicode.org/review/) or UTC.
I'm going to wait and see.
Regards
Yuya
--
Yuya Hamada (tekimen)