Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124858 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 7758A1A00B7 for <internals@lists.php.net>; Sun, 11 Aug 2024 16:38:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723394414; bh=+Nx10eUbb0dYCHfHye8hMnp6GDwiUb3wMO2YEOdiz8Y=; h=Date:Subject:To:References:From:In-Reply-To:From; b=mMfBa0YvkzDlTT4HVGojddKHSsa+supD8qUeQkV3E5YMPsL0VamZHyBukjXi0yEyH iaGdfLQzDNzh7rluveYB0lEpJ+2i1AqtKBrNHYJ9qEkVc1NzNbBB/IZ6/aWVxw4PU0 6i5r5meUmtCCexEYU665r4UdNh1pk/GD42UOtO1QY/58ORNXP6PET8q5etOQgmqnL6 m1oW2cY/xlgZdLG08qjmLjH3TpcK+6l7u0zvv/dbEnxqlU8H2+dDDFc+7+bXiIbopt 2KmEVsVbLMYknKUJwxH4wTOjeDqZ992+s45mmD1GY+Tg4vsESH3LbEA4Q+KDbsEkpd uigikiMrFbHWA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 03F8018003B for <internals@lists.php.net>; Sun, 11 Aug 2024 16:40:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: <sandfox@sandfox.me> Received: from mail-108-mta132.mxroute.com (mail-108-mta132.mxroute.com [136.175.108.132]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for <internals@lists.php.net>; Sun, 11 Aug 2024 16:40:13 +0000 (UTC) Received: from filter006.mxroute.com ([136.175.111.3] filter006.mxroute.com) (Authenticated sender: mN4UYu2MZsgR) by mail-108-mta132.mxroute.com (ZoneMTA) with ESMTPSA id 191424e39040000a78.002 for <internals@lists.php.net> (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384); Sun, 11 Aug 2024 16:38:28 +0000 X-Zone-Loop: 95f133809b88b600600db0d022217e27a3279385c4f4 X-Originating-IP: [136.175.111.3] DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=sandfox.me; s=x; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:From:References:To: Subject:MIME-Version:Date:Message-ID:Sender:Reply-To:Cc:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=9SsiRAkYu5p5eh1GQ/pT8xcDIWBICXsiv0zcFA6Uqps=; b=f478e9OXoK+hPeBJ3FjtebxEzi Sg5K/tqavMff2ZcWv78yD7DsuspwvVYIjELycT2+A4qKvvqlI3zeT/wPfCX8yqNgTCY5CMQePgxKx rNcpbOBJItCDMtK6PG/nh42QFhtlT3RXtPPdP46xYhoSCfWPF9C87+OFc1y2Dvr/msgfdSlflhF3R Hfkw3s1EGUPLaCh4xs8b7k96V88M6rlQQD98E/qp6/mdArwHDLkNDoIrl0a+R9LOuSLY8fydAKJmu jgIRu6jSEFOW0zJP+HDfvfftZRP6ZszTg24nFBIrMzG3c+dFgleyTe4YWiHaAr0II2BVl4D7Qxf9C DXeO6Q9A==; Message-ID: <fe3aaa5d-109e-46be-afd4-93b10e66f8f1@sandfox.me> Date: Sun, 11 Aug 2024 19:38:25 +0300 Precedence: bulk list-help: <mailto:internals+help@lists.php.net list-unsubscribe: <mailto:internals+unsubscribe@lists.php.net> list-post: <mailto:internals@lists.php.net> List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? To: Nick Lockheart <lists@ageofdream.com>, internals@lists.php.net References: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com> Content-Language: en-US In-Reply-To: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Authenticated-Id: sandfox@sandfox.me From: sandfox@sandfox.me (Anton Smirnov) Hi Nick, As a developer who often deals with binary data (like bencode, ipv6 addresses and my own hacks for multibyte arithmetic) I would prefer that functions and syntaxes that allow me to work with bytes keep working with bytes, not characters or code points. So the closest solution would be separate binary/text strings, but then we have PHP6 all over again. Maybe this time it might work in some form, who knows. On 8/11/24 18:50, Nick Lockheart wrote: > > HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports > the UTF-8 multi-byte character encoding. > > It seems like there's still a lot of string functions that assume that > a character is a single byte, and these may actually work as expected > when dealing with Latin characters, but may fail unexpectedly if a > sequence is more than one byte. > > Are there any use cases for PHP where **single-byte** characters are > the norm? > > It seems that if everything on the Internet is multi-byte encoded now, > then all of the PHP string functions should be multi-byte safe. > > > The WHATWG Encoding Standard: > > https://encoding.spec.whatwg.org/ > > Also, according to Mozilla, "[The meta charset] attribute declares the > document's character encoding. If the attribute is present, its value > must be an ASCII case-insensitive match for the string "utf-8", because > UTF-8 is the only valid encoding for HTML5 documents." > > https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#charset