Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:116710 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 86189 invoked from network); 21 Dec 2021 22:17:14 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 21 Dec 2021 22:17:14 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 414C61804B4 for ; Tue, 21 Dec 2021 15:21:02 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 21 Dec 2021 15:20:58 -0800 (PST) Received: by mail-lj1-f172.google.com with SMTP id p8so792988ljo.5 for ; Tue, 21 Dec 2021 15:20:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zxe9osDIZGZ65mVNO/Xw008vMqrdlFgMirVC7Tb8Dz8=; b=ZEJx2nfs0kgv8QSy0gCVxtnQT84P8mxusuyugBUY9GOFXsHamuonHy8YwdpLO8jj/C 5GjpL0UmD5pySvhAvNLUN+SxWaYLsYo9Y9KjrfEF3rbCuZoDfRR3BSl0Ugyue8RnxxJU DvLbo9dSS+JXn17T2Q3evglxTXJoQeWrrdXVzjayBfOYLKHFslVrBSc934EMWC42HNSP VuMNGvkbvdo2yokxy46UbxlbS3ZSKsnPY1kXYEGQQLvPd+qBPxq8ESXA098H5j57HivU p80MDa9yKm56aVDMb50CDktwaJD/PbieRYQKwaAHwZ/2QiL5sOZ1/GCn/hW+0gYgk+Ft P/2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zxe9osDIZGZ65mVNO/Xw008vMqrdlFgMirVC7Tb8Dz8=; b=4Y4ZLw8NqCu28XIBP1yaIDuQ/oIBw9DOTN3Yl1bx6TZ3QyJbOt8bXJBY0/vBvSzgpf H3yNlbQ8J43aNOnCpzqWWgL3Z4bnog248T83cvjkE2aXFGqkGxCO6vztsy3lfKmUPG8/ Bv+RmZQv2QcMFXwNIiFFfiMGVySbWfsX4eE3oL399j9mYlVYPEUxyOVYLKro0Frzgj8/ +RGH8Jue73Se/mDHFiaKB1wPGesilHMkrM+j+EOAoOECML+2fYrg1Kpp3jAEXUn8VG/M tnpHYDQwI8TvGN+qQW020p9GBSDV3JjXspv+T/kmH0HPcMI94H1DNWF5iZLbewo/Xd/K BmBA== X-Gm-Message-State: AOAM530pJXSIBE3vtxgPJrgokL6mceqP4rFeNCK9F61vp5RbXMFizgSL otCzSY0j0XuApSjPcZP7FR13rFRr4NvDyJ/nZQLe8rntRRQ= X-Google-Smtp-Source: ABdhPJzQw+NQlx12Qg1bmp6b/C0taoqPYfl/E4MB588Y7a3byXtwsK2FUNmfQrQpMdLLJ8CAQbeXfu+zGTcLpbIKq/A= X-Received: by 2002:a2e:2e0e:: with SMTP id u14mr453876lju.28.1640128857214; Tue, 21 Dec 2021 15:20:57 -0800 (PST) MIME-Version: 1.0 References: <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com> In-Reply-To: <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com> Date: Tue, 21 Dec 2021 15:20:45 -0800 Message-ID: To: Larry Garfield Cc: php internals Content-Type: multipart/alternative; boundary="000000000000eb919d05d3b0417e" Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode? From: wrossmann@gmail.com (Wade Rossmann) --000000000000eb919d05d3b0417e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield wrote: > On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote: > > Hi all, > > > > The functions utf8_encode and utf8_decode are historical oddities, whic= h > > almost certainly would not be accepted if proposed today: > > > > * Their names do not describe their functionality, which is to convert > > to/from one specific single-byte encoding. This leads to a common > > confusion that they can be used to "fix" UTF-8 encoding problems, which > > they generally make worse. > > * That single-byte encoding is ISO 8859-1, not its common cousins > > Windows-1252 or ISO 88159-15. This means, for instance, that they do no= t > > handle the Euro sign: utf8_decode('=E2=82=AC') returns '?' (i.e. unmapp= able) > > not "\x80" (Windows-1252) or "\xA4" (8859-15) > > > > On the other hand, they are commonly used, both correctly and > > incorrectly, so removing them is not easy. > > > > A previous proposal to remove them [1] resulted in Andrea making two > > significant improvements: moving them from ext/xml to ext/standard [2] > > and rewriting the documentation to explain them properly [3]. My genuin= e > > thanks for that. > > > > However, it hasn't stopped people misunderstanding them, and quite > > reasonably: you shouldn't need to look up every function you use in the > > manual, to make sure it actually does what its name suggests. > > > > > > I can see three ways forward: > > > > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide > > a specific replacement, but recommend people look at iconv() or > > mb_convert_encoding(). There is precedent for this, such as > > convert_cyr_string(), but it may frustrate those who are using the > > functions correctly. > > > > B) Introduce new names, such as utf8_to_iso_8859_1 and > > iso_8859_1_to_utf8; immediately make those the primary names in the > > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation > > notices for the old names, either immediately or in some future release= . > > This gives a smoother upgrade path, but commits us to having these > > functions as outliers in our standard library. > > > > C) Leave them alone forever. Treat it as the user's fault if they mess > > things up by misunderstanding them. > > > > > > I am happy to put together an RFC for either A or B, if it has a chance > > of reaching consensus. I would really like to avoid option C. > > > > > > [1] https://externals.io/message/95166 > > [2] https://github.com/php/php-src/pull/2160 > > [3] > > > https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a= 8238 > > > > Regards, > > I lost several days of my life to exactly this problem, many years ago. = I > am still triggered by it. > > I am mostly OK with option A, but with a big caveat: > > The root problem here is "You keep using that function. I do not think i= t > means what you think it means." > > As Rowan notes, what people actually *want* most of the time is "I got > this string from a user and have NFI what it's encoding is, but my system > needs UTF-8, so gimmie this string in UTF-8." And they use utf8_encode()= , > which then fails *sometimes* in exciting and mysterious ways, because > that's not what it is. > > Removing utf8_encode() may keep people from misusing it, but that doesn't > mean the problem space they were trying to solve goes away. If anything, > people who still don't realize that it's the wrong solution will get angr= y > that we're taking away a "useful" tool and replacing it with "meh, go loo= k > at library X," which is admittedly a pretty rude answer. > > If we're removing a bad answer to the problem, we should also replace it > with a good answer. > > Someone will, I'm sure, pop in at this point and declare "if you don't > know the character encoding you're receiving, then you're doing it wrong > and are already lost and we can't help you." While that may be technical= ly > correct, it's also an entirely useless answer because strings received ov= er > HTTP very frequently do not tell you what their encoding is, or they lie > about what their encoding is. (The header may say it's ISO8859, or UTF8, > or whatever, but someone copy-pasted from MS Word into a text box and now > it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8 > except for the Windows-1252 part. Like, that's literally the problem I > lost several days to.) "Your own fault" is not even an accurate answer a= t > that point. > > So if we're going to take away people's broken hammer, we need to be very > clear about what hammer to use instead. > > The initial answer is probably "here's how to use a series of mb_string > functions together to produce a reasonably good > guess-my-encoding-and-convert-to-utf8 routine" documentation. Which... m= ay > exist, but if it does I've never found it. So at bare minimum the > encode_utf8() documentation needs to include a "use this code snippet > instead" description, and not just link to the mbstring extension. > Glancing through the mbstring docs right now, it looks like it's not > already a single function call, but some combination of several, and has > some global flags that get set (via mb_detect_order()), I think. It's no= t > as easy to use as utf8_encode(), even if utf8_encode() is wrong. That > suggests we may want to try and simplify the mbstring API, or internalize > some function that handles the most common case in a way that doesn't rel= y > on global flags. > > So, let's make that easier to use, so that we can change "this function i= s > wrong, we're taking it away from you" to "this function is wrong, here's = a > way better alternative that you can use instead (while we quietly take th= e > wrong one away from you while you're distracted by the new shiny)." > > I don't know the mbstring API well enough to say what that alternative > ideally looks like, but if we can answer that it would make killing off t= he > old functions much more palatable. > > --Larry Garfield > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: https://www.php.net/unsub.php > > As an encoding nerd and perennial complainer regarding these functions I would like nothing more than to see them immediately disappear, but I do recognize the BC-breaking potential for something like that. However, I do have a suggestion that I've not seen mentioned yet that should at least address some of the misconceptions that people get from the current functions. I would suggest adding optional source/destination encoding parameters to the functions, eg: utf8_encode(string $string, string $source_encoding =3D "ISO-8859-1") utf8_decode(string $string, string $destination_encoding =3D "ISO-8859-1") and, if you'll forgive the hand-waving due to my unfamiliarity with PHP internals, they could simply be passed through to an underlying mb_convert_encoding() call. Eg: mb_convert_encoding($string, 'UTF-8', $source_encoding) mb_convert_encoding($string, $destination_encoding, 'UTF-8') This would preserve BC while also making the function header and documentation much more descriptive of what the function actually does, allow more flexible use of the functions, and potentially drive people to use the mb_* functions instead. This could also be used as a gradual pathway to deprecating the functions, where, for example, a deprecation notice could be raised when the function is called without the source/destination encoding explicitly given. I know that there is also some resistance to the idea of requiring mbstring as it is an optional extension, as well as resistance to bringing mbstring into core due to design and/or history. This could be worked around by [once again, apology for handwaving] only requiring mbstring for conversions involving an encoding other than ISO-8859-1 and falling back to the existing implementation otherwise. --000000000000eb919d05d3b0417e--