Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124883 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: by qa.php.net (Postfix, from userid 65534) id D97AC1A00BD; Mon, 12 Aug 2024 09:50:38 +0000 (UTC) To: internals@lists.php.net Date: Mon, 12 Aug 2024 12:50:38 +0300 Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? References: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com> <47D63911-3C48-4514-9296-F1CAAC9597B9@rwec.co.uk> Content-Language: en-US In-Reply-To: <47D63911-3C48-4514-9296-F1CAAC9597B9@rwec.co.uk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Posted-By: 147.235.192.174 Message-ID: <20240812095038.D97AC1A00BD@qa.php.net> From: danielhaber@gmail.com (Daniel Haber) On 8/12/2024 9:53 AM, Rowan Tommins [IMSoP] wrote: > > > On 11 August 2024 16:50:52 BST, Nick Lockheart wrote: >> It seems that if everything on the Internet is multi-byte encoded now, >> then all of the PHP string functions should be multi-byte safe. > > The phrase "multibyte safe" may have made sense about 30 years ago, when it was thought that a "universal character set" could just be a "wide ASCII", encoding a straightforward list of characters, just more of them. > > Modern Unicode is so much more than that, because the world's writing systems don't all work the same way. Should strlen() measure bytes, code points, or graphemes? Should strtoupper() accept a locale, so it can handle cases like Turkish "dotless i" where "I" is not the uppercase of "i"? And so on, and so on. > > I've seen plenty of languages boast that they are "Unicode aware" but few actually engaging with the question of what that actually means. Often they equate "character" with "code point" and stop there, which leads to results that are just as useless to most of the world as if they'd equated it with "byte". > > Regards, > Rowan Tommins > [IMSoP] Feels appropriate to link to this: "The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)" https://tonsky.me/blog/unicode/