Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:130193 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by lists.php.net (Postfix) with ESMTPS id BFDE91A00BC for ; Fri, 27 Feb 2026 15:59:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1772207968; bh=qovyemVqlrVDMANZjOigqj1NesYU3xPiVM6qwLk+UfE=; h=References:In-Reply-To:From:Date:Subject:To:From; b=P4r5uibpIFaJ9omRS75VG6wtiQZaVFliZ263PnabdJ7Wmamtz7K0Ihso/5/beH5i/ NEw3UWPKyW6nPGEAt5f25HmOWUZs62yX+1QrI+ktE2mFkjf6Wl7xF+p1X0incHZlFg ThpTFJ63Um5KMoKrt2yAS/iBbkFd/AoCpaJNEFqJbYV24131Hd6zSSYsMx2q2l3Z+L LbULXKlM2Y1qZo7d8/qc1RZFwBFjGxXUQOWGV+jyyCixshuV8uzMEfodv5ttPUNVbE fyZC+9eCKXCkb7clOPtWZYNgrg2IGg9jldIiGsOaqDb8FWH391WUc+Gb8FJIp52Is7 S6UlPXqQ9WxYQ== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 676E01801E4 for ; Fri, 27 Feb 2026 15:59:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=ARC_SIGNED,ARC_VALID,BAYES_50, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS, FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.1 X-Spam-Virus: No X-Envelope-From: Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 27 Feb 2026 15:59:17 +0000 (UTC) Received: by mail-wr1-f48.google.com with SMTP id ffacd0b85a97d-4398c7083d7so1937053f8f.3 for ; Fri, 27 Feb 2026 07:59:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1772207951; cv=none; d=google.com; s=arc-20240605; b=YqGqsLcPriHXT8ARg4N1TlfCPK0ZJ2eKpOcOk/8iTWaU0swXzeTIGh3ZJswnTz0Qig Onr84JZbLT8DSSbSUNp/Kb7xBds+2vxjDbHmS4q8p/VmHlQkzHj2pQxd/TiKYDlyhPKc 1m1lGZ6Zd66ZGNXkuXKzyXG+c4XzfYfVZ8sGQXlD1OZgjaj6VZFu/QqESKtXHnN4Bq2y p5UbNDBmdBnGrgAqrq0xFa7ZUBXbhNooEIGVUWDJ2k+LT96tUHu5zAZMOgzmYFypIrzn Pa+JreGWJwI99Z2YtHtWFbhF66Ep3QPX0NzxWr+uF5F/BAWE+YcpFtVYO/xaPuy+aBd6 dsDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=hRtlqjSYA2wxAnRqBlNFdJ0qY0a0yLzqpYl40RIa+eg=; fh=RnZ+4KjfdZdxwVfzmjFoBvUAaJ227RJecqE9MM9tvfQ=; b=HaXeBA960hGN9jGRPqIcXYF2krasm3pL9Zw+E/hs+ychmZfWqtN1+9Oll92F0tQIs4 EW3o1GvujCfHNT7HGSiTcd+oRpIzP1aG7TdkKmss4uczyE9VypeqMtfT1ciI186ony7E nd1XwoTZR+JiOUh+GEk9ZOzbs0O7ykwUMMpR1EFvR25BvYn/6JphAmGzBX7wnsX9rsdD /nkAUCA5/tGNMYA62trt0y1fl4bJwh1MtyvhDChc72UXhVlRJulvEKDKQEixHn5l6phl ZBoQVdo1+WbcDtY3nr1ESbyDz+lKSCLJ3FZ/pINUQwFQ+Y/f58qv/5kgi/g0BWWHF+WY 4nTg==; darn=lists.php.net ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1772207951; x=1772812751; darn=lists.php.net; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hRtlqjSYA2wxAnRqBlNFdJ0qY0a0yLzqpYl40RIa+eg=; b=hBLjPcM6GGfT/fS+GMm4w0dl/WEsNK7NLN2ibPS5RXa2cPEfK8X0C9+4vR91UmIJ3M tq2sGqWainQk1JOtf5/p9FQqs0MgKF8aRU2riAcUEDzFrWjkk9yCylcYYFSIRQ2BUfjg hzYZBKBsy4u3+peNDG+iX4xDCCXjUIiH6td4weURNYmWECrK+Lf2UsfXF1fPSdqSXtIB OYPXpfUGtL8q9Cy8ZUanJnFH/I4+WQl2gtoXKvoPEW5ou7N/4xJACO3QeGXS5v+rEkam kiw0XpCspOIo576mkKUy6z0PN7KE0D0q+7/eoBBKbc1iSG99XO1WMWuXQw8wTX3HB4ZG Y1CA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772207951; x=1772812751; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=hRtlqjSYA2wxAnRqBlNFdJ0qY0a0yLzqpYl40RIa+eg=; b=n8qXRN69lizrIOPl25NPv6vFFwrTEUS294T8N/L6I444lPtvbwm6P4eWXhcW6KzPdF AiJM2pP//l4akmvQPFm4IIc8IqGQ4I9aADkrHctX31NvbELkTuL5+3h+h2CCEuWLUK3l 72Foz/6owzZbpIyhwrWUK5O6tClEHWHce0eLP/gmg99jLq6hDJJCP4O1dWVPlIk3Jk99 PwIoxvujeoc1wcg+iMAIuIVoIRg0w4HtEnRS4l6rg9xILqlkmTth6TOPPj8PhjwGmqcl /s3Q7vwQ8z11oITTw5nzpD/E0deb5Ib/7a2IOUIIudj8/whxhUpcvtIm9YCOvLwlMmxh lxSg== X-Gm-Message-State: AOJu0YxSDgM5ofT5a37B6Ce4exYLaRkSQWe1UDwLb6tW4YjvSmwoie6M ehwY1fmyZt/5WGvpPxv8CaG1XTD8KsNO080Sg2oYHCgYXA4a4TaNNPxtAz1AnRArT4CB1z62a0/ jykUaZ+0U8TPlaXH99bI1g0kaezOFaRQJJClU5w== X-Gm-Gg: ATEYQzw0e3Kdvx/HqTaJ2pnTewbxFGgPVplTgAd08xeKksCvyFtYS5VvuiNaERqMp9/ OHiaDk3k7WK9DI6FtOjoeVxD4w1SRdP1BqsWSDVMLBYsdGfZfl7Slhet2SDDP+5cIkhk1H49UH1 KXOC5X9VVJBfMEbPY28u67lBjsnjw6AGy2y/LtwPJEKBGpdsGdolLFqyplEAhbbHWoNUGRQbhyP dWJVTvdGZr5l72TMqDBs1YK7tHVw0eTFkEwPgyREWsrgsvENNTSN9Ti4NDFoKU8Z4KZimYaxFXk SoR57yHMh5rHOAjbiOP0Mr6OVvA1GnvkThg= X-Received: by 2002:a05:6000:2507:b0:439:8f08:cc05 with SMTP id ffacd0b85a97d-4399de193dcmr5617248f8f.33.1772207950457; Fri, 27 Feb 2026 07:59:10 -0800 (PST) Precedence: list list-help: list-unsubscribe: list-post: List-Id: x-ms-reactions: disallow MIME-Version: 1.0 References: In-Reply-To: Date: Sat, 28 Feb 2026 00:59:00 +0900 X-Gm-Features: AaiRm52byspBaZEJ8a42wbgQcW1cpwWYx6U8-UDhh_UuK_0fql--n25F3eTi3UI Message-ID: Subject: Re: [PHP-DEV][DISCUSSION] Limit of code point for grapheme cluster in programming languages To: php internals Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable From: youkidearitai@gmail.com (youkidearitai) 2026=E5=B9=B42=E6=9C=8824=E6=97=A5(=E7=81=AB) 16:21 youkidearitai : > > 2026=E5=B9=B42=E6=9C=8824=E6=97=A5(=E7=81=AB) 11:38 Kentaro Takeda : > > > > Hi Yuya, > > > > I think this is a good idea. While spec compliance is generally desirab= le, DoS via unbounded grapheme clusters is a real threat, and it's reasonab= le for a language-level implementation to impose practical limits that the = Unicode spec itself doesn't define. This kind of gap between a general-purp= ose spec and a concrete implementation is not unusual. > > > > The default of 32 code points sounds sensible given that natural langua= ge grapheme clusters top out well below that. > > > > One minor note: it might help to clarify the intended behavior of `grap= heme_limit_codepoints` a bit more =E2=80=94 for instance, whether it is mea= nt as a validation check (returning false when a cluster exceeds the limit)= or something else. > > > > Regards, > > Kentaro Takeda > > > > > > 2026=E5=B9=B42=E6=9C=8823=E6=97=A5(=E6=9C=88) 20:28 youkidearitai : > >> > >> Hi, Internals > >> > >> I noticed grapheme cluster is not limit code points in UAX#29. > >> https://www.unicode.org/reports/tr29/ > >> > >> And there is no limit code point in Unicode that confirmed in issue of= ICU. > >> https://unicode-org.atlassian.net/browse/ICU-23302 > >> > >> So that means create many code points in 1 grapheme cluster, > >> That is crash for program because computer resource is limited. > >> > >> For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.t= xt > >> ``` > >> php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u > >> {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=3D600M > > >> emoji_bomb.txt > >> ``` > >> (PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH) > >> > >> So, I think we(php-src, programming language level) need to create new > >> custom limit function. > >> My idea is below: > >> > >> ``` > >> grapheme_limit_codepoints(string $str, integer $max_codepoints =3D 32)= : bool > >> ``` > >> > >> I don't have heavy opinion that $max_codepoints is 32. > >> However, 32 code points is enough of grapheme cluster because > >> human language max code points is maybe Hak=E1=B9=A3hmalawaraya=E1=B9= =81(=E0=BD=A7) in > >> 9 code points. > >> > >> If need more than code points in grapheme cluster, > >> Userland can to increase $max_codepoints. > >> > >> Please see also my speakerdeck. > >> https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme= -cluster > >> > >> What do you think about this idea? > >> > >> Regards > >> Yuya > >> > >> -- > >> --------------------------- > >> Yuya Hamada (tekimen) > >> - https://tekitoh-memdhoi.info > >> - https://github.com/youkidearitai > >> ----------------------------- > > Hi, Kentaro > > Thank you very much for your feedback. > > > One minor note: it might help to clarify the intended behavior of `grap= heme_limit_codepoints` a bit more =E2=80=94 for instance, whether it is mea= nt as a validation check (returning false when a cluster exceeds the limit)= or something else. > > Okay. I'll show you. > > ``` > // something string in $_POST['text'] > // Validate many code points in a grapheme cluster. > if (grapheme_limit_codepoints($_POST['text'], 32) !=3D=3D true) { > throw new InvalidException("Found invalid / many code points in > grapheme cluster"); > } > > // Validate grapheme cluster length > if (grapheme_strlen($_POST['text']) > 100) { > throw new InvalidException("Invalid grater than 100 graphemes"); > } > > // do anything... > ``` > The intention is "count correct graphemes avoid DoS". > And I want to overcoming to > https://github.com/symfony/symfony/pull/13527 in grapheme_strlen > function. > > Feel free to more comment. > Regards > Yuya. > > -- > --------------------------- > Yuya Hamada (tekimen) > - https://tekitoh-memdhoi.info > - https://github.com/youkidearitai > ----------------------------- Hi, Internals I created a PoC and RFC. https://github.com/php/php-src/pull/21311 https://wiki.php.net/rfc/grapheme_limit_codepoints I tried to ask Unicode that UAX#29 add for limit of codepoint for grapheme cluster. Perhaps Unicode adds my suggestion if it is make sense. However, I don't know what happen. Anyway, I think make sense that grapheme cluster limits codepoint in PHP si= de. Feel free to comment. Regards Yuya --=20 --------------------------- Yuya Hamada (tekimen) - https://tekitoh-memdhoi.info - https://github.com/youkidearitai -----------------------------