Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:113650 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 69298 invoked from network); 21 Mar 2021 16:56:45 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 21 Mar 2021 16:56:45 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 1B0CD1804D8 for ; Sun, 21 Mar 2021 09:51:48 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from wout2-smtp.messagingengine.com (wout2-smtp.messagingengine.com [64.147.123.25]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 21 Mar 2021 09:51:47 -0700 (PDT) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.west.internal (Postfix) with ESMTP id 5D4C212BF for ; Sun, 21 Mar 2021 12:51:46 -0400 (EDT) Received: from imap8 ([10.202.2.58]) by compute4.internal (MEProxy); Sun, 21 Mar 2021 12:51:46 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; bh=plAGG43p4XuyoQYFyTmEnG/cNQ3UQI1WjlcqbMVeR R8=; b=FanCZ8IGgefUeXPfrefoiWMP6+GYcFg3X2MJyd8mPxyeaXqu4SV1/iOJg H2uNVwsly/g2BP3BU3jdtTfCJxoLJR5+NoYAq3gW1BrWbq48KQXC0l9tl/hfepT3 /ucNOD6B0D8DuCfYdedgypuMs/VKDPsqGZHYkNKsMjhCp37Uo1Okk45scezx7dER hAf0W8splB5TwFHr6vgq7a/W/axvVC7zBN6To7jBqHY4K/DckYrgAO1upwhhTcE8 DfOJS5n2Iw1gTe4sRbISHbU/33laC2MSdQV4so3/QgE6QYqLBMbhM27HsatjmNWC qwP+r2CKzoFL/ETgsPX5IEhoDYt/Q== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduledrudegvddgleefucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepofgfggfkjghffffhvffutgfgsehtqhertderreejnecuhfhrohhmpedfnfgr rhhrhicuifgrrhhfihgvlhgufdcuoehlrghrrhihsehgrghrfhhivghlughtvggthhdrtg homheqnecuggftrfgrthhtvghrnhephfejtedulefhfeefteejgfeivdelffetudeijeeg iedugeelgfeivdffvedujefhnecuffhomhgrihhnpegvgihtvghrnhgrlhhsrdhiohdpgh hithhhuhgsrdgtohhmnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghi lhhfrhhomheplhgrrhhrhiesghgrrhhfihgvlhguthgvtghhrdgtohhm X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id B814F3A056B; Sun, 21 Mar 2021 12:51:45 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.5.0-alpha0-206-g078a48fda5-fm-20210226.001-g078a48fd Mime-Version: 1.0 Message-ID: <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com> In-Reply-To: References: Date: Sun, 21 Mar 2021 11:51:25 -0500 To: "php internals" Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode? From: larry@garfieldtech.com ("Larry Garfield") On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote: > Hi all, >=20 > The functions utf8_encode and utf8_decode are historical oddities, whi= ch=20 > almost certainly would not be accepted if proposed today: >=20 > * Their names do not describe their functionality, which is to convert= =20 > to/from one specific single-byte encoding. This leads to a common=20 > confusion that they can be used to "fix" UTF-8 encoding problems, whic= h=20 > they generally make worse. > * That single-byte encoding is ISO 8859-1, not its common cousins=20 > Windows-1252 or ISO 88159-15. This means, for instance, that they do n= ot=20 > handle the Euro sign: utf8_decode('=E2=82=AC') returns '?' (i.e. unmap= pable)=C2=A0=20 > not "\x80" (Windows-1252) or "\xA4" (8859-15) >=20 > On the other hand, they are commonly used, both correctly and=20 > incorrectly, so removing them is not easy. >=20 > A previous proposal to remove them [1] resulted in Andrea making two=20= > significant improvements: moving them from ext/xml to ext/standard [2]= =20 > and rewriting the documentation to explain them properly [3]. My genui= ne=20 > thanks for that. >=20 > However, it hasn't stopped people misunderstanding them, and quite=20 > reasonably: you shouldn't need to look up every function you use in th= e=20 > manual, to make sure it actually does what its name suggests. >=20 >=20 > I can see three ways forward: >=20 > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provid= e=20 > a specific replacement, but recommend people look at iconv() or=20 > mb_convert_encoding(). There is precedent for this, such as=20 > convert_cyr_string(), but it may frustrate those who are using the=20 > functions correctly. >=20 > B) Introduce new names, such as utf8_to_iso_8859_1 and=20 > iso_8859_1_to_utf8; immediately make those the primary names in the=20= > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation=20= > notices for the old names, either immediately or in some future releas= e.=20 > This gives a smoother upgrade path, but commits us to having these=20 > functions as outliers in our standard library. >=20 > C) Leave them alone forever. Treat it as the user's fault if they mess= =20 > things up by misunderstanding them. >=20 >=20 > I am happy to put together an RFC for either A or B, if it has a chanc= e=20 > of reaching consensus. I would really like to avoid option C. >=20 >=20 > [1] https://externals.io/message/95166 > [2] https://github.com/php/php-src/pull/2160 > [3]=20 > https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b262= 95a8238 >=20 > Regards, I lost several days of my life to exactly this problem, many years ago. = I am still triggered by it. I am mostly OK with option A, but with a big caveat: The root problem here is "You keep using that function. I do not think = it means what you think it means." As Rowan notes, what people actually *want* most of the time is "I got t= his string from a user and have NFI what it's encoding is, but my system= needs UTF-8, so gimmie this string in UTF-8." And they use utf8_encode= (), which then fails *sometimes* in exciting and mysterious ways, becaus= e that's not what it is. Removing utf8_encode() may keep people from misusing it, but that doesn'= t mean the problem space they were trying to solve goes away. If anythi= ng, people who still don't realize that it's the wrong solution will get= angry that we're taking away a "useful" tool and replacing it with "meh= , go look at library X," which is admittedly a pretty rude answer. If we're removing a bad answer to the problem, we should also replace it= with a good answer. Someone will, I'm sure, pop in at this point and declare "if you don't k= now the character encoding you're receiving, then you're doing it wrong = and are already lost and we can't help you." While that may be technica= lly correct, it's also an entirely useless answer because strings receiv= ed over HTTP very frequently do not tell you what their encoding is, or = they lie about what their encoding is. (The header may say it's ISO8859= , or UTF8, or whatever, but someone copy-pasted from MS Word into a text= box and now it's Windows-1252 within a wrapper that says ISO8859 but is= mostly UTF8 except for the Windows-1252 part. Like, that's literally t= he problem I lost several days to.) "Your own fault" is not even an acc= urate answer at that point. So if we're going to take away people's broken hammer, we need to be ver= y clear about what hammer to use instead. The initial answer is probably "here's how to use a series of mb_string = functions together to produce a reasonably good guess-my-encoding-and-co= nvert-to-utf8 routine" documentation. Which... may exist, but if it doe= s I've never found it. So at bare minimum the encode_utf8() documentati= on needs to include a "use this code snippet instead" description, and n= ot just link to the mbstring extension. Glancing through the mbstring d= ocs right now, it looks like it's not already a single function call, bu= t some combination of several, and has some global flags that get set (v= ia mb_detect_order()), I think. It's not as easy to use as utf8_encode(= ), even if utf8_encode() is wrong. That suggests we may want to try and= simplify the mbstring API, or internalize some function that handles th= e most common case in a way that doesn't rely on global flags. So, let's make that easier to use, so that we can change "this function = is wrong, we're taking it away from you" to "this function is wrong, her= e's a way better alternative that you can use instead (while we quietly = take the wrong one away from you while you're distracted by the new shin= y)." I don't know the mbstring API well enough to say what that alternative i= deally looks like, but if we can answer that it would make killing off t= he old functions much more palatable. --Larry Garfield