Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:130149 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id 552B51A00BC for ; Tue, 24 Feb 2026 01:57:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1771898239; bh=u6d923807+FD1odFRdcT1aQj5J15eBUsXCQd+yC1ZMg=; h=References:In-Reply-To:From:Date:Subject:To:From; b=mHylYi8XtLL2ddNI6AGn0w3xdWcgCfIOEb6kCX7pKP0B3ospMAV9t7YZnsay/RUYT aQsnxKiSGJ7y97EwFYKkKSt2IPu89UI71Fp1GweLn/Fj5W+x5DAEAIbdkMRbDOldO+ L0cJYLWdeRA53ypMCg1i0Kxc16ShcxL2in5BsMbBgg47jvTcvySbehhJFtB8FED8v5 FOJB+vd886HcftvPbPkryL84YHasf32bbFz5oivdqrE95TTO9TQJOHYTuCGbGk1HId ciTPgkBNyZUjNXf8C/b6EltSFu+8Xs8uhOAxHeBKwe6TSB1O08le/FX8jFR4Aqzpak LvoMg3AVj/HZQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id D7CAF1805B9 for ; Tue, 24 Feb 2026 01:57:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=ARC_SIGNED,ARC_VALID,BAYES_50, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.1 X-Spam-Virus: No X-Envelope-From: Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 24 Feb 2026 01:57:18 +0000 (UTC) Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-59e646a4b9eso4906367e87.3 for ; Mon, 23 Feb 2026 17:57:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1771898231; cv=none; d=google.com; s=arc-20240605; b=Kq4UhVLImaCfPGUHiey7nlgG9hL4jaoVAnPZL1zKyhM1+7j032XthQvk+6rvuZaK7u PqQ3GYFywY1N5kYt7jBSWeawTQJAKRk/qBVkY6Nj7EA/vw2aE1cdBcMFb2xqnZ6qAltt p/7XiTpGboM08/z9TJC8EC3WygwQ70B/UtR4c0H/p2xV2WcLT7Mh7yz7lBErBJx6tmse cUl+TuM7NqSeVi8kvvdVNkQfyZDmdmK2Xswz5K+qM+/SGc6LTqPoeZgyY8tdvNer47lK S9wSZLZoH3Q9YucMbKQBBX4dj44adeCp8nREPB5LhYrpYg50IG8oBgZ96FC+0VBuMm6Y dpfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=u6d923807+FD1odFRdcT1aQj5J15eBUsXCQd+yC1ZMg=; fh=RnZ+4KjfdZdxwVfzmjFoBvUAaJ227RJecqE9MM9tvfQ=; b=aKy4pGV7HamTyQy8xltKfSYw8O9b/116XDQUxr+QOhelqHuxNl76IVoJitB/uxe3sW zmoUBkiqXcrYeni1gvOkLiaX1j+BRwgmTwquh+/HNgXnTlhNR95CvArcrS33lCC0bjp9 fYsRqSlDH7xwxT1lGA221Zw4MnenjwOT2CKAh4KOlGvTFV0M6aa5B7G/HbcoUT1mvtrR UlXAoWwppq0a3mdewG2r591nKVBP5a5c7ROp6izrnEhRBY45IWubIJxFrJ67by9u9l55 6H9bNp9bnNeDuLYHMH67Otrhx17hg24yYVmiGTKb0jwYRoNRMDGI7QA844dllBaEmN1Z D6fQ==; darn=lists.php.net ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=youmind.jp; s=google; t=1771898231; x=1772503031; darn=lists.php.net; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=u6d923807+FD1odFRdcT1aQj5J15eBUsXCQd+yC1ZMg=; b=bOefFXp+Pdc+U3kiTCnFcE/R7trMzVu9Bw+fK13b6CU9E1yLSpeTwJR1T14WgnUis/ S06yMhvMBdqfkNMHiW6hMMu668oB6zGDsW767SR893kA0TFW6oM4psjGgZCPFSdUWEYy Zp2+Fjlm3O9BmGJQeWyHjaculSLOKPgQF37xAJ3X5d39PyE+91mt+kIo4SeHDbPyYTlh I2inj9yH0F5knkz/r7lVlmOtuz69TMN3r6XGcMBZ0QdYTJZMlmv19BLYsDB+6PnMFICS 6WKwz3bRwktb+nemY1jf9Vesq/e7/FXio3D35c3JLgmKe3ObDlIaaD5v4V8POreiMWpk aoWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771898231; x=1772503031; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=u6d923807+FD1odFRdcT1aQj5J15eBUsXCQd+yC1ZMg=; b=i7wPREM5KaX+H6Yi0Jl+2R23+thHuXoZDvd/r20UXMT8wHNywZ2Lqco+0VL++vLXyu CfiX/UZm55snP5aM2hvwl0w60uqspM/HCoD4QWIjNKkpKEYqCTiZrDd43YUyLHJFmmnF oMBxTJec83Byi28IZ452C3H8IJ3Rbab0VD/3iaid2jpJz3lKOdKEP0qE63lubLhHgrdv hdBW35z4PAMv8CpLnaG64kB/sBNgXtFEveaI0grwj+NZljDh93RZ4C34f2oh9YDEArj0 BI/4rf/WT8kOMWPkDm2uFmfLNrRcWQhKPzN/nkSjXkVX3w1xF5H3P6prXqSbS0yuw/05 dDug== X-Gm-Message-State: AOJu0YxgF2KhEybRrK/lThexqUrG387BWWYtUdzLHG4p3MB+PomIVSke p6PIK2fK+bojil6L5B7ADquciDmTuWFF9fafNTCBqRmUiivp/nVe3SStLeG/H7xh8AJ9cdFN1/G teMexAloRJEwiAPUzGMQzNQaUprqBq66ZD9FSuuqXE1qytHHcKF4NTcM= X-Gm-Gg: ATEYQzwjefGqMb+gCeSUqDThy7BmK23X6sasauHEYpzDyKhStOFYIT/6TzsieKKW4HD yUcTpbMDWPFCCXO1aIpmj2Wf26vuQBq/eg9/rNjBQzChOQYA4Y587hP7gi+Sep0mRuVJxTejea2 2EFLlhJGSh2SgoUQ0u3Mo0rSEBG2agLEjwawE0pseUthFtddajSNUvyuvG/qPg76rYgDdjFedon /3qFehQ3ETsZvaFnZ0DGD/PkorGfnec1jUMnrR/H9QBNiACpExCZpw2S/y1PtYyqwmjQrxTCJQ3 kB+j67cVdg== X-Received: by 2002:a05:6512:104a:b0:59d:e3e5:d0dc with SMTP id 2adb3069b0e04-5a0ed9a251fmr3272285e87.42.1771898231275; Mon, 23 Feb 2026 17:57:11 -0800 (PST) Precedence: list list-help: list-unsubscribe: list-post: List-Id: x-ms-reactions: disallow MIME-Version: 1.0 References: In-Reply-To: Date: Tue, 24 Feb 2026 10:57:00 +0900 X-Gm-Features: AaiRm50bHT2V0bP-Ul7_x6e7UyPus043PDKJzELPH3PwdtH45cblVMpg26om5og Message-ID: Subject: Re: [PHP-DEV][DISCUSSION] Limit of code point for grapheme cluster in programming languages To: php internals Content-Type: multipart/alternative; boundary="000000000000a74fe4064b88341b" From: takeda@youmind.jp (Kentaro Takeda) --000000000000a74fe4064b88341b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Yuya, I think this is a good idea. While spec compliance is generally desirable, DoS via unbounded grapheme clusters is a real threat, and it's reasonable for a language-level implementation to impose practical limits that the Unicode spec itself doesn't define. This kind of gap between a general-purpose spec and a concrete implementation is not unusual. The default of 32 code points sounds sensible given that natural language grapheme clusters top out well below that. One minor note: it might help to clarify the intended behavior of `grapheme_limit_codepoints` a bit more =E2=80=94 for instance, whether it i= s meant as a validation check (returning false when a cluster exceeds the limit) or something else. Regards, Kentaro Takeda 2026=E5=B9=B42=E6=9C=8823=E6=97=A5(=E6=9C=88) 20:28 youkidearitai : > Hi, Internals > > I noticed grapheme cluster is not limit code points in UAX#29. > https://www.unicode.org/reports/tr29/ > > And there is no limit code point in Unicode that confirmed in issue of IC= U. > https://unicode-org.atlassian.net/browse/ICU-23302 > > So that means create many code points in 1 grapheme cluster, > That is crash for program because computer resource is limited. > > For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt > ``` > php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u > {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=3D600M > > emoji_bomb.txt > ``` > (PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH) > > So, I think we(php-src, programming language level) need to create new > custom limit function. > My idea is below: > > ``` > grapheme_limit_codepoints(string $str, integer $max_codepoints =3D 32): b= ool > ``` > > I don't have heavy opinion that $max_codepoints is 32. > However, 32 code points is enough of grapheme cluster because > human language max code points is maybe Hak=E1=B9=A3hmalawaraya=E1=B9=81(= =E0=BD=A7) in > 9 code points. > > If need more than code points in grapheme cluster, > Userland can to increase $max_codepoints. > > Please see also my speakerdeck. > > https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-cl= uster > > What do you think about this idea? > > Regards > Yuya > > -- > --------------------------- > Yuya Hamada (tekimen) > - https://tekitoh-memdhoi.info > - https://github.com/youkidearitai > ----------------------------- > --000000000000a74fe4064b88341b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Yuya,

I think this is a go= od idea. While spec compliance is generally desirable, DoS via unbounded gr= apheme clusters is a real threat, and it's reasonable for a language-le= vel implementation to impose practical limits that the Unicode spec itself = doesn't define. This kind of gap between a general-purpose spec and a c= oncrete implementation is not unusual.

The default of 32 code points= sounds sensible given that natural language grapheme clusters top out well= below that.

One minor note: it might help to clarify the intended b= ehavior of `grapheme_limit_codepoints` a bit more =E2=80=94 for instance, w= hether it is meant as a validation check (returning false when a cluster ex= ceeds the limit) or something else.

Regards,
Kentaro Takeda
=


2026=E5=B9=B42=E6=9C=8823=E6=97=A5(=E6=9C=88) 20:28 youkidearitai <youkidearitai@gmail.com>:
Hi, Internals

I noticed grapheme cluster is not limit code points in UAX#29.
https://www.unicode.org/reports/tr29/

And there is no limit code point in Unicode that confirmed in issue of ICU.=
https://unicode-org.atlassian.net/browse/ICU-23302=

So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.

For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt ```
php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466= }\u
{200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_li= mit=3D600M >
emoji_bomb.txt
```
(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)

So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:

```
grapheme_limit_codepoints(string $str, integer $max_codepoints =3D 32): boo= l
```

I don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hak=E1=B9=A3hmalawaraya=E1=B9=81(= =E0=BD=A7) in
9 code points.

If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.

Please see also my speakerdeck.
https://speakerdeck.co= m/youkidearitai/limit-of-code-point-for-grapheme-cluster

What do you think about this idea?

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------
--000000000000a74fe4064b88341b--