Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:124858
X-Original-To: internals@lists.php.net
Delivered-To: internals@lists.php.net
Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5])
	by qa.php.net (Postfix) with ESMTPS id 7758A1A00B7
	for <internals@lists.php.net>; Sun, 11 Aug 2024 16:38:30 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail;
	t=1723394414; bh=+Nx10eUbb0dYCHfHye8hMnp6GDwiUb3wMO2YEOdiz8Y=;
	h=Date:Subject:To:References:From:In-Reply-To:From;
	b=mMfBa0YvkzDlTT4HVGojddKHSsa+supD8qUeQkV3E5YMPsL0VamZHyBukjXi0yEyH
	 iaGdfLQzDNzh7rluveYB0lEpJ+2i1AqtKBrNHYJ9qEkVc1NzNbBB/IZ6/aWVxw4PU0
	 6i5r5meUmtCCexEYU665r4UdNh1pk/GD42UOtO1QY/58ORNXP6PET8q5etOQgmqnL6
	 m1oW2cY/xlgZdLG08qjmLjH3TpcK+6l7u0zvv/dbEnxqlU8H2+dDDFc+7+bXiIbopt
	 2KmEVsVbLMYknKUJwxH4wTOjeDqZ992+s45mmD1GY+Tg4vsESH3LbEA4Q+KDbsEkpd
	 uigikiMrFbHWA==
Received: from php-smtp4.php.net (localhost [127.0.0.1])
	by php-smtp4.php.net (Postfix) with ESMTP id 03F8018003B
	for <internals@lists.php.net>; Sun, 11 Aug 2024 16:40:14 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net
X-Spam-Level: 
X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_MISSING,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=4.0.0
X-Spam-Virus: No
X-Envelope-From: <sandfox@sandfox.me>
Received: from mail-108-mta132.mxroute.com (mail-108-mta132.mxroute.com [136.175.108.132])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by php-smtp4.php.net (Postfix) with ESMTPS
	for <internals@lists.php.net>; Sun, 11 Aug 2024 16:40:13 +0000 (UTC)
Received: from filter006.mxroute.com ([136.175.111.3] filter006.mxroute.com)
 (Authenticated sender: mN4UYu2MZsgR)
 by mail-108-mta132.mxroute.com (ZoneMTA) with ESMTPSA id 191424e39040000a78.002
 for <internals@lists.php.net>
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384);
 Sun, 11 Aug 2024 16:38:28 +0000
X-Zone-Loop: 95f133809b88b600600db0d022217e27a3279385c4f4
X-Originating-IP: [136.175.111.3]
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=sandfox.me;
	s=x; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:From:References:To:
	Subject:MIME-Version:Date:Message-ID:Sender:Reply-To:Cc:Content-ID:
	Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
	:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe:
	List-Post:List-Owner:List-Archive;
	bh=9SsiRAkYu5p5eh1GQ/pT8xcDIWBICXsiv0zcFA6Uqps=; b=f478e9OXoK+hPeBJ3FjtebxEzi
	Sg5K/tqavMff2ZcWv78yD7DsuspwvVYIjELycT2+A4qKvvqlI3zeT/wPfCX8yqNgTCY5CMQePgxKx
	rNcpbOBJItCDMtK6PG/nh42QFhtlT3RXtPPdP46xYhoSCfWPF9C87+OFc1y2Dvr/msgfdSlflhF3R
	Hfkw3s1EGUPLaCh4xs8b7k96V88M6rlQQD98E/qp6/mdArwHDLkNDoIrl0a+R9LOuSLY8fydAKJmu
	jgIRu6jSEFOW0zJP+HDfvfftZRP6ZszTg24nFBIrMzG3c+dFgleyTe4YWiHaAr0II2BVl4D7Qxf9C
	DXeO6Q9A==;
Message-ID: <fe3aaa5d-109e-46be-afd4-93b10e66f8f1@sandfox.me>
Date: Sun, 11 Aug 2024 19:38:25 +0300
Precedence: bulk
list-help: <mailto:internals+help@lists.php.net
list-unsubscribe: <mailto:internals+unsubscribe@lists.php.net>
list-post: <mailto:internals@lists.php.net>
List-Id: internals.lists.php.net
x-ms-reactions: disallow
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become
 Multi-Byte Safe?
To: Nick Lockheart <lists@ageofdream.com>, internals@lists.php.net
References: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com>
Content-Language: en-US
In-Reply-To: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Authenticated-Id: sandfox@sandfox.me
From: sandfox@sandfox.me (Anton Smirnov)

Hi Nick,

As a developer who often deals with binary data (like bencode, ipv6 
addresses and my own hacks for multibyte arithmetic) I would prefer that 
functions and syntaxes that allow me to work with bytes keep working 
with bytes, not characters or code points. So the closest solution would 
be separate binary/text strings, but then we have PHP6 all over again. 
Maybe this time it might work in some form, who knows.

On 8/11/24 18:50, Nick Lockheart wrote:
> 
> HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
> the UTF-8 multi-byte character encoding.
> 
> It seems like there's still a lot of string functions that assume that
> a character is a single byte, and these may actually work as expected
> when dealing with Latin characters, but may fail unexpectedly if a
> sequence is more than one byte.
> 
> Are there any use cases for PHP where **single-byte** characters are
> the norm?
> 
> It seems that if everything on the Internet is multi-byte encoded now,
> then all of the PHP string functions should be multi-byte safe.
> 
> 
> The WHATWG Encoding Standard:
> 
> https://encoding.spec.whatwg.org/
> 
> Also, according to Mozilla, "[The meta charset] attribute declares the
> document's character encoding. If the attribute is present, its value
> must be an ASCII case-insensitive match for the string "utf-8", because
> UTF-8 is the only valid encoding for HTML5 documents."
> 
> https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#charset