Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:122799 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id BFD531A009C for ; Thu, 28 Mar 2024 21:25:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1711661162; bh=sg0YuItym+euJtINZH3N1LI/X9OE37/cfAZvUQeyYf4=; h=In-Reply-To:References:Date:From:To:Subject:From; b=HvlxBSuojnxe4Ovk4CBjZpLY9lPREdoOqeI87noGmK4ZTpLgWOmZjl/reEoIcBtXY rCxkfKgSl6S2myrkCbMLkFgAOsfA1qcZv4vEJKE0XYvAAxOOL2xrSJwfkzkq8a0Ila pMJyIwEsfE9YfT8auw9lpvLx6aw/YFc1AVgng8CUVeEZsYmLKw2blm5q4Gxpu553e4 pGl20Dpci0buEp9pLBJ0crXuagkJlXmsMNjg6jLCFMIRiPNhS1xcY40HiIQFWtSfHp blr2OE3Y56ybH76XHwN5HL7jnbnq0zmS0yDdnEH+IsvYDGxXucizt8eYkbWHWup0dg kWx7gvcq5gTmg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 2E0DC18006F for ; Thu, 28 Mar 2024 21:26:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: *** X-Spam-Status: No, score=3.1 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DMARC_MISSING,HTML_MESSAGE,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_PASS,SPF_SOFTFAIL, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 28 Mar 2024 21:25:59 +0000 (UTC) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id CF1455C009E for ; Thu, 28 Mar 2024 17:25:32 -0400 (EDT) Received: from imap50 ([10.202.2.100]) by compute3.internal (MEProxy); Thu, 28 Mar 2024 17:25:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm2; t=1711661132; x=1711747532; bh=sg0YuItym+euJtINZH3N1LI/X9OE 37/cfAZvUQeyYf4=; b=gzwlQOBte1VLemwP4gDiuSrjhVZe2IQrnfOCD+uPlNzv px20upIUJCkAfcdxQBDUGqShzRK9SQMFr14piXzU4MjFZXIQz2aY9u5A0d3uQOiW yGQhakh3ZLeUkVO0Ij6TMRcyfuXjpJb/WTGzd0j3z8D7MC+lluEFIqfAjinK+xIk zTlYvC4AM8/rFAsRwgQpTCPJlL6OHrMxroqVK3LoTA3RNKLrzGXISIYIBpDvXrTs HG27WAFvwYI2QFW5wbGzSYt4jMF8ptAa/eRbENIVUTaN0trYFcPLrbHXHPsDF8Av qJPZhJi5MZIVYchT3CDuxNcHRpWY78mbudbjHN1ujw== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrudduledgudegkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhepofgfggfkjghffffhvffutgesrg dtreerreerjeenucfhrhhomhepfdevrghsphgvrhcunfgrnhhgvghmvghijhgvrhdfuceo lhgrnhhgvghmvghijhgvrhesphhhphdrnhgvtheqnecuggftrfgrthhtvghrnheptddvhe dtveelkeelgeeiieeuveejffdvteeigffhvddtheduheegteetieekkeelnecuffhomhgr ihhnpehphhhprdhnvghtnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrg hilhhfrhhomheplhgrnhhgvghmvghijhgvrhesphhhphdrnhgvth X-ME-Proxy: Feedback-ID: id4f946ef:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 157321700093; Thu, 28 Mar 2024 17:25:32 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.11.0-alpha0-333-gbfea15422e-fm-20240327.001-gbfea1542 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net MIME-Version: 1.0 Message-ID: <1dfd354a-898c-4721-818c-0284191354a5@app.fastmail.com> In-Reply-To: <4952eb36-1c45-4b6f-ae62-fa641e391246@rwec.co.uk> References: <141e31f3-b7cf-4bd1-9bac-c9ec078767ed@app.fastmail.com> <216d4a09-7921-48f6-b892-8d9605d367ab@app.fastmail.com> <4952eb36-1c45-4b6f-ae62-fa641e391246@rwec.co.uk> Date: Thu, 28 Mar 2024 22:25:11 +0100 To: "Levi Morrison" Subject: Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function Content-Type: multipart/alternative; boundary=5ba9b3f1d27d4b798c20d37e91168ef1 From: langemeijer@php.net ("Casper Langemeijer") --5ba9b3f1d27d4b798c20d37e91168ef1 Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable > So... if you want to help make people more aware of the grapheme_*=20 > functions, one place to start would be editing the documentation for t= he=20 > various string, mbstring, and grapheme functions to use consistent=20 > terminology, and sign-post each other more clearly.=20 > http://doc.php.net/tutorial/ Yes I agree, Also I've edited documentation before in the svn days. I al= ready planned to read up on how this is working nowadays. Also I'm planning an outline for a conference talk on the subject. I've = educated people on unicode related subjects before, and think I have a f= ew very good stories that can give insight into this for unsuspecting de= velopers. I love the analogy that most Europeans understand. For the city of Colog= ne, there are two equally valid ways to write it's German name. K=C3=B6l= n and Koeln. (Used when hindered by technical limitations, or maybe in i= nformal conversation) Every German can extra_e_decode() and extra_e_enco= de(). Same for Stra=C3=9Fe and Strasse. Ligatures in fonts make it harder though, sometimes they intentionally o= bfuscate what's happening in the unicode layer. You might know this from= special programming fonts with glyphs for =3D=3D=3D, <> and such. Some Dutch fonts do a special ligature that combines ij, which was in th= e Dutch alphabet when I was a kid, 'y' was not. Unicode U+0132 and U+013= 3 describe this symbol, but I've never seen them used. Fonts combining i= j to one visual entity is more common. I imagine most languages and cultures have these kind of edge-cases. --5ba9b3f1d27d4b798c20d37e91168ef1 Content-Type: text/html;charset=utf-8 Content-Transfer-Encoding: quoted-printable

=
So... if y= ou want to help make people more aware of the grapheme_* 
=
functions, one place to start would be editing the documentation fo= r the 
various string, mbstring, and grapheme functio= ns to use consistent 
terminology, and sign-post each= other more clearly. 
Yes I agree, Also I've edited documentation before in the s= vn days. I already planned to read up on how this is working nowadays.

Also I'm planning an outline for a conferenc= e talk on the subject. I've educated people on unicode related subjects = before, and think I have a few very good stories that can give insight i= nto this for unsuspecting developers.

I lov= e the analogy that most Europeans understand. For the city of Cologne, t= here are two equally valid ways to write it's German name. K=C3=B6ln and= Koeln. (Used when hindered by technical limitations, or maybe in inform= al conversation) Every German can extra_e_decode() and extra_e_encode().= Same for Stra=C3=9Fe and Strasse.

Ligature= s in fonts make it harder though, sometimes they intentionally obfuscate= what's happening in the unicode layer. You might know this from special= programming fonts with glyphs for =3D=3D=3D, <> and such.

Some Dutch fonts do a special ligature that combin= es ij, which was in the Dutch alphabet when I was a kid, 'y' was not. Un= icode U+0132 and U+0133 describe this symbol, but I've never seen them u= sed. Fonts combining ij to one visual entity is more common.

I imagine most languages and cultures have these kind = of edge-cases.


--5ba9b3f1d27d4b798c20d37e91168ef1--