Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:122781 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id D4B551A009C for ; Wed, 27 Mar 2024 23:18:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1711581534; bh=zEr1KnxtZfMjz3zLlUvFcZILKlJYsOAUSpYedPG4IOc=; h=Date:Subject:To:References:From:In-Reply-To:From; b=Z1hVeaPh9ZImlQpDQAyDrnyvGey39iHszQLcaixx8QSpwB81XFQP+mVXCOpAQ/mjf rKHCSR7DQ3+cujbI18eMaJsuiDDUtYbBa3qPP9OujrmvyNeFtLwnvxdyfXH9gNWH8C rsnt4bV57Kun45xRDeRttCDeTI5vtn1l6W82FGyrH8UX2M95d7zNq0o+UEHM2Bvijj Nqps/yoLF1qpLnV4BhSFSK+AN0HZ4Fk7XBCzn/WNPdx6ivxEV+Gcfq7bXBozHCjKEp AZuSN+z9XURIPmwYwcgTuRDDNxEIvJwYdwH5QtHn+yKs3HCqfHlec2ZCf6kH0kmXCN yGYXYZU6i7kzw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id B3AB0180062 for ; Wed, 27 Mar 2024 23:18:52 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from fout8-smtp.messagingengine.com (fout8-smtp.messagingengine.com [103.168.172.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 27 Mar 2024 23:18:51 +0000 (UTC) Received: from compute6.internal (compute6.nyi.internal [10.202.2.47]) by mailfout.nyi.internal (Postfix) with ESMTP id EC86513800CE for ; Wed, 27 Mar 2024 19:18:25 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute6.internal (MEProxy); Wed, 27 Mar 2024 19:18:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rwec.co.uk; h=cc :content-transfer-encoding:content-type:content-type:date:date :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm2; t=1711581505; x=1711667905; bh=qvDp3djbEbVMNsUd/o5SW6sb1Ug5ie55BoP0MlZ5A9M=; b= fQHPn5J6EpjKZTfIjzpw7oPpMKi6npfqx5z2yWDn9EZphaAEgLg4G+4MztbHJUxY 5Oo6cOOEJP7cmAA0somZwvLmAdBAHfbFKmsBAUhfUctdzvyd29G9GIMXb4JFCvrg h1NTemcJJXKf+sSW34B0DcYwNo5Ohxfa2QlSQjyWovIbdO6gyiGTvtt7I5IrJE/F oZX+S3cQgtc45i7cf/r/OtatC1PPisxJlXKL42NxD9G8loJ1/K7SRNZvp9TOwIGb kvPLna0QvxBdxdc0Tut5DMmZtlXdpM44YIBAEe2ImEdws3U/hNdV4Baaau6OGVp7 qIeHGuowGBbE6MyyqAjDuA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1711581505; x= 1711667905; bh=qvDp3djbEbVMNsUd/o5SW6sb1Ug5ie55BoP0MlZ5A9M=; b=U oluiTT62OCbqNORJ42bTBdGG2AOlNouVwJuosSu7a0MG+dJvVY+ltcOClE1bQaiq ceFOlm8aAXxleSj3EFKKA1hoZX1RFYoXcmRsc8UaVy799hYSf5MBGbWYeYjxhXnD ocZLl/g7skWc5aT2GDFdpCIXHw8DO5E5xBKa9qGq9cZZXGe19b+sLiNT+mwy6NTI RVwtQiNiQiIaYlGYt4JpufZc0i7D06+14HiO9vi5HP7oqL4KzNEQ354BvpjVi0B8 un4Xti0KlAyi1PCDrKs9Ht0eNZqJvdvGk/w4kOliUAwCugExhssn5R7331NW5zcc AnjzPfPNEV9EWkB3Z/0Ow== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledruddukedgtdeiucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefkffggfgfuvfhfhfgjtgfgsehtke ertddtvdejnecuhfhrohhmpedftfhofigrnhcuvfhomhhmihhnshculgfkoffuohfrngdf uceoihhmshhophdrphhhphesrhifvggtrdgtohdruhhkqeenucggtffrrghtthgvrhhnpe etjeektdfhhfeiieeigeeuleeggeffjeeuhfeiledtheeujeeuffejtdfhleehieenucff ohhmrghinhepuhhnihgtohguvgdrohhrghdprghrtghhihhvvgdrohhrghdpphhhphdrnh gvthenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehi mhhsohhprdhphhhpsehrfigvtgdrtghordhukh X-ME-Proxy: Feedback-ID: id5114917:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Wed, 27 Mar 2024 19:18:25 -0400 (EDT) Message-ID: <4952eb36-1c45-4b6f-ae62-fa641e391246@rwec.co.uk> Date: Wed, 27 Mar 2024 23:18:21 +0000 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function To: internals@lists.php.net References: <141e31f3-b7cf-4bd1-9bac-c9ec078767ed@app.fastmail.com> <216d4a09-7921-48f6-b892-8d9605d367ab@app.fastmail.com> Content-Language: en-GB In-Reply-To: <216d4a09-7921-48f6-b892-8d9605d367ab@app.fastmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit From: imsop.php@rwec.co.uk ("Rowan Tommins [IMSoP]") On 26/03/2024 21:14, Casper Langemeijer wrote: > If you need someone to help for the grapheme_ marketing team, let me know. I think a big part of the problem is that very few people dig into the complexities of text encoding, and so don't know that a "grapheme" is what they're looking for. Unicode documentation is, generally, very careful with its terminology - distinguishing between "code points", "code units" "graphemes" , "grapheme clusters", "glyphs", etc. Pretty much everyone else just says "character", and assumes that everyone knows what they mean. As a case in point, looking at the PHP manual pages for strlen, mb_strlen, and grapheme_strlen: Short summary: - strlen — Get string length - mb_strlen — Get string length - grapheme_strlen — Get string length in grapheme units Description: - Returns the length of the given string. - Gets the length of a string. - Get string length in grapheme units (not bytes or characters) The first two don't actually say what units they're measuring in. Maybe it's millimetres? ;) The last one uses the term "grapheme" without explaining what it means, and makes a contrast with "characters", which is confusing, as one of the definitions in the Unicode glossary [https://unicode.org/glossary/#grapheme] is: > What a user thinks of as a character. The mb_strlen documentation has a bit more explanation in its Return Values section: > Returns the number of characters in string string having character encoding encoding. A multi-byte character is counted as 1. For Unicode in particular, this is a poor description; it is completely missing the term "code point", which is what it actually counts. That's probably because ext/mbstring wasn't written with Unicode in mind, it was "developed to handle Japanese characters", back in 2001; and it still does support several pre-Unicode "multi-byte encodings". For a bit of nostalgia: http://web.archive.org/web/20010605075550/http://www.php.net/manual/en/ref.mbstring.php So... if you want to help make people more aware of the grapheme_* functions, one place to start would be editing the documentation for the various string, mbstring, and grapheme functions to use consistent terminology, and sign-post each other more clearly. http://doc.php.net/tutorial/ Regards, -- Rowan Tommins [IMSoP]