Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124862 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id EA5EA1A00F6 for ; Sun, 11 Aug 2024 16:58:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723395626; bh=M/0TB362F+2Ehe/BS/2Yi6vTtgSmjCtUwotg3Fd9eUs=; h=Date:From:To:In-Reply-To:References:Subject:From; b=SrBQTAwpS6C1FSHz9y4dV6MlwFHmdy7fyE2cUmPYxGMM9ymu0V3uY7nNUzHi52uFJ KZVIZxgzuV9/UhN7tNy+xnt3LP3IeGne9klyxQZYgWnpgGvfczrpShSyyDEtH+Wyjc 5XIITOWdRsCs4qA1cCJSzd6x7r+hHR164Ziz6ziDwDVbg3/LaDwnuZopr2n8ZFpwQw JOlBreNk0TUyRsoVg+4ZamILkZ2JcZCGbRIbomR06bHjD/s1GBsaVd9N5rD7U9TnLI ZZE7KNZc+5CXGf4UAvxhTuckYCE2WuHnr9GD5ZTcyZ+acTCjgTdow+aWGRl76EKSCk RfxuSpjvkOoyw== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 66CDD1804BC for ; Sun, 11 Aug 2024 17:00:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,RCVD_IN_DNSWL_LOW, SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from fout1-smtp.messagingengine.com (fout1-smtp.messagingengine.com [103.168.172.144]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 11 Aug 2024 17:00:20 +0000 (UTC) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailfout.nyi.internal (Postfix) with ESMTP id B35A5138CED9 for ; Sun, 11 Aug 2024 12:58:34 -0400 (EDT) Received: from wimap23 ([10.202.2.83]) by compute3.internal (MEProxy); Sun, 11 Aug 2024 12:58:34 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= garfieldtech.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to; s=fm3; t=1723395514; x=1723481914; bh=botKhyhweSblZGJ9/6e3C pu+8jnhB6h/SJsbaV8uT78=; b=mhpGegdGzr+O3V0fQLN6DPIF7AVXte4jf6RHm 5qk50hIbAS7Klf596pYmkmAHAmyt2vzRgrx8zTjrsOBhFBgD3SgW/RCyjRfukgtI kG7TLaVxDp3agNmDII8bRPSpkAXcQ6rh8GMcuUaSx8VyXBYVMks1yR8cN/uPIE3U 6otbPIQBaTI4yG3rXQdoCiQzARJ5Qpi/VrjoKTwnKvLpb/AFd4ivDW5Ofu0/avSS WwB04dV5lQZd4S1JwWjvAsDQVIoCUEtif/qj5DIDdo58Qna/prdB3hV8cIJsj4HJ llXWdWov1LhSt65Xc+sFVPdrQX/xCkfrO5l+UgSgKyuxp98aA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1723395514; x= 1723481914; bh=botKhyhweSblZGJ9/6e3Cpu+8jnhB6h/SJsbaV8uT78=; b=Q daq0UtiOnm3a7qv7jHCnOkHoFH9h182omNlZc3tGeN+DTvAqAJ+jgFnGDnSctNPh ZNDj3yaYrNIry1OGdtvNSYTau7Dw2aMVRszSo6Hgmf2pqfE3XFRjhPj6mATTq3QW zq+C5lj9CBRpsZaJVB5IhGIkOgwoY5MwEcS4hghjYv68UQERjk3dornuz5Lr0vCd eCCZjW3JjK7HyZHLsRACxeGU3gWSHBKfl7x/YhLC8kF+3NIH4qInF2RuMhZJjYrp O9BZaySuud2qwy2ShQ2kHhx8qzAYfU880KszHgfowhp7ouYsU5OFjNTaYba57366 /x7SyWvzs4VStvPOipygg== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddrleekgddutdekucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdggtfgfnhhsuhgsshgtrhhisggvpdfu rfetoffkrfgpnffqhgenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnh htshculddquddttddmnecujfgurhepofggfffhvffkjghfufgtgfesthejredtredttden ucfhrhhomhepfdfnrghrrhihucfirghrfhhivghlugdfuceolhgrrhhrhiesghgrrhhfih gvlhguthgvtghhrdgtohhmqeenucggtffrrghtthgvrhhnpeelgeeigefggffgffehueev heehvdfgvedtleejjefhveekgedvfeevleelhfefffenucffohhmrghinhepfihhrghtfi hgrdhorhhgpdhmohiiihhllhgrrdhorhhgpdhphhhprdhnvghtpdhshihmfhhonhihrdgt ohhmnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheplh grrhhrhiesghgrrhhfihgvlhguthgvtghhrdgtohhmpdhnsggprhgtphhtthhopedupdhm ohguvgepshhmthhpohhuthdprhgtphhtthhopehinhhtvghrnhgrlhhssehlihhsthhsrd hphhhprdhnvght X-ME-Proxy: Feedback-ID: i8414410d:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 709EB2920063; Sun, 11 Aug 2024 12:58:34 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 Date: Sun, 11 Aug 2024 11:58:13 -0500 To: "php internals" Message-ID: <410a8188-06bf-439f-bdab-c47e73d1db70@app.fastmail.com> In-Reply-To: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com> References: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com> Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? Content-Type: text/plain Content-Transfer-Encoding: 7bit From: larry@garfieldtech.com ("Larry Garfield") On Sun, Aug 11, 2024, at 10:50 AM, Nick Lockheart wrote: > HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports > the UTF-8 multi-byte character encoding. > > It seems like there's still a lot of string functions that assume that > a character is a single byte, and these may actually work as expected > when dealing with Latin characters, but may fail unexpectedly if a > sequence is more than one byte. > > Are there any use cases for PHP where **single-byte** characters are > the norm? > > It seems that if everything on the Internet is multi-byte encoded now, > then all of the PHP string functions should be multi-byte safe. > > > The WHATWG Encoding Standard: > > https://encoding.spec.whatwg.org/ > > Also, according to Mozilla, "[The meta charset] attribute declares the > document's character encoding. If the attribute is present, its value > must be an ASCII case-insensitive match for the string "utf-8", because > UTF-8 is the only valid encoding for HTML5 documents." > > https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#charset Some background and history, for those not familiar... After PHP 5.2, there was a huge effort to move PHP to using Unicode internally. It was to be released as PHP 6. Unfortunately, it ran into a whole host of problems, among them: 1. It tried to use UTF-16 internally, as there were good libraries for it but it was much much slower than was acceptable. 2. It required rewriting basically everything. 3. Trying to support two string variants at the same time (because binary strings are still very useful) in almost the same syntax turned out be, um, kinda hard. After a number of years of work, it was eventually concluded that it was a dead end. So the non-Unicode-related bits of what would have been PHP 6 got renamed to PHP 5.3 and released to much fanfare, kicking off the PHP Renaissance Era. When PHP 5.6+1 was released, there was a vote to decide if it should be called 6 or 7. 7 won, mainly on the grounds that a number of very stupid book publishers had released "PHP 6" books in anticipation of PHP 6's release that were now completely useless and misleading. So we skipped 6 entirely, and PHP 6-compatibility is a running joke among those who have been around a while. Fortunately, the vast majority of single-byte strings are ASCII, and ASCII is, by design, a strict subset of UTF-8, so in practice the lack of native UTF-8 strings rarely causes an issue. Trying to introduce Unicode strings to the language now as a native type would... probably break just as much if not more. If anything it's probably harder today than it was in 2008, because the engine and existing code to not-break has grown considerably. A much better approach would be something like this RFC from Derick a few years ago: https://wiki.php.net/rfc/unicode_text_processing If you need something today, then Symfony has a user-space approximation of it: https://symfony.com/doc/current/string.html --Larry Garfield