Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:116711 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 89928 invoked from network); 21 Dec 2021 23:27:32 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 21 Dec 2021 23:27:32 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8A8381804E3 for ; Tue, 21 Dec 2021 16:31:20 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 21 Dec 2021 16:31:17 -0800 (PST) Received: by mail-pj1-f51.google.com with SMTP id gj24so765533pjb.0 for ; Tue, 21 Dec 2021 16:31:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=B1TMbi6aUg6B7ZjZtlV9VOIBdx9PEaNah+gBZpz4aQs=; b=kOuyrb4Vq3xJH+ujuyU8Rh+FtmHlP7oHR7lOwOB42PN6xGJIcvc/9G7vVkiDID5UKo gZ+O80LvbJWyI98ht56ki+UAUB2yD1AwxSVBWtgYqQDArGVLmfzHPrAC0vrePx6CkYqi MW8a41hdODTjSWbv9KIW8Gwk6bUu6w6LvuhIgK3Ez4Ovd2Ttqm4KTRP0BiWMmufgW4L6 Q53paS/9P4oXmTWpTmoO0hr+V6aM5dcjubo2gxU/FzJ7+2LTgSp+zaRUCQBBiKAFWEmw Zpstnnwy3Z3aOe9vDorqAAERNmPGxAVwqDLlMqMa/FqB0DpCz5LPa5GKlT4fLiJOSSUs htxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=B1TMbi6aUg6B7ZjZtlV9VOIBdx9PEaNah+gBZpz4aQs=; b=sHv0bT8WvlhswKBJ9xriXEEcR1fNt9e3VGj+CaCUGNM1SYG9bEuT7IGQeLlhg9TOs7 1vChbnULFERs/LJ+iw5JF1luAGEJlmumKoAfCWgYOugJLLAmhttp2BKKTfVHMfKkF3+8 s7unVIrWDUZ1W6F9gjs+/u+ONqqomHMv2GEG7uYtZQ3fxHDA+1Ka3xWWD5tvNf6iHCwF arEgisR0d4pv+4v46ySaGm2E0/JZRTpG8lbi7TPtyfmIyUuNZrsrkIuMtu+1EeFMCKTj ZLuA/LxD1OCI8J8KhH1ye+b3P1OAAqJzY5gP/geEtlyTE1uSpweE5xm5yoYdzOWO+Zae 6RDg== X-Gm-Message-State: AOAM53348+CXpXiWvhpeQAVkpPp2VFRoG2eXHOAZlTB4Q+LI8bbeimx3 p4I6eBlpI55FtdX0OTQrN1CJFsAC4ZPfOEuctbY= X-Google-Smtp-Source: ABdhPJyaGBUu0/Zsh9ImDhIbWEwLmBGaPhKH/GKeUr2nblqjPZa8bNPr1EULZAkEGSSpOjC7CSbFGItE3qrQq8Gd494= X-Received: by 2002:a17:902:c652:b0:148:f1a5:b7c6 with SMTP id s18-20020a170902c65200b00148f1a5b7c6mr662088pls.30.1640133075887; Tue, 21 Dec 2021 16:31:15 -0800 (PST) MIME-Version: 1.0 References: <3a4d89fc-c5f8-4720-b2e0-f6f3c28684f9@www.fastmail.com> In-Reply-To: Date: Tue, 21 Dec 2021 16:31:00 -0800 Message-ID: To: Wade Rossmann Cc: Larry Garfield , php internals Content-Type: multipart/alternative; boundary="0000000000005f6c7805d3b13dff" Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode? From: kris.craig@gmail.com (Kris Craig) --0000000000005f6c7805d3b13dff Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Dec 21, 2021 at 3:21 PM Wade Rossmann wrote: > On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield > wrote: > > > On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote: > > > Hi all, > > > > > > The functions utf8_encode and utf8_decode are historical oddities, > which > > > almost certainly would not be accepted if proposed today: > > > > > > * Their names do not describe their functionality, which is to conver= t > > > to/from one specific single-byte encoding. This leads to a common > > > confusion that they can be used to "fix" UTF-8 encoding problems, whi= ch > > > they generally make worse. > > > * That single-byte encoding is ISO 8859-1, not its common cousins > > > Windows-1252 or ISO 88159-15. This means, for instance, that they do > not > > > handle the Euro sign: utf8_decode('=E2=82=AC') returns '?' (i.e. unma= ppable) > > > not "\x80" (Windows-1252) or "\xA4" (8859-15) > > > > > > On the other hand, they are commonly used, both correctly and > > > incorrectly, so removing them is not easy. > > > > > > A previous proposal to remove them [1] resulted in Andrea making two > > > significant improvements: moving them from ext/xml to ext/standard [2= ] > > > and rewriting the documentation to explain them properly [3]. My > genuine > > > thanks for that. > > > > > > However, it hasn't stopped people misunderstanding them, and quite > > > reasonably: you shouldn't need to look up every function you use in t= he > > > manual, to make sure it actually does what its name suggests. > > > > > > > > > I can see three ways forward: > > > > > > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provi= de > > > a specific replacement, but recommend people look at iconv() or > > > mb_convert_encoding(). There is precedent for this, such as > > > convert_cyr_string(), but it may frustrate those who are using the > > > functions correctly. > > > > > > B) Introduce new names, such as utf8_to_iso_8859_1 and > > > iso_8859_1_to_utf8; immediately make those the primary names in the > > > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation > > > notices for the old names, either immediately or in some future > release. > > > This gives a smoother upgrade path, but commits us to having these > > > functions as outliers in our standard library. > > > > > > C) Leave them alone forever. Treat it as the user's fault if they mes= s > > > things up by misunderstanding them. > > > > > > > > > I am happy to put together an RFC for either A or B, if it has a chan= ce > > > of reaching consensus. I would really like to avoid option C. > > > > > > > > > [1] https://externals.io/message/95166 > > > [2] https://github.com/php/php-src/pull/2160 > > > [3] > > > > > > https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a= 8238 > > > > > > Regards, > > > > I lost several days of my life to exactly this problem, many years ago. > I > > am still triggered by it. > > > > I am mostly OK with option A, but with a big caveat: > > > > The root problem here is "You keep using that function. I do not think > it > > means what you think it means." > > > > As Rowan notes, what people actually *want* most of the time is "I got > > this string from a user and have NFI what it's encoding is, but my syst= em > > needs UTF-8, so gimmie this string in UTF-8." And they use > utf8_encode(), > > which then fails *sometimes* in exciting and mysterious ways, because > > that's not what it is. > > > > Removing utf8_encode() may keep people from misusing it, but that doesn= 't > > mean the problem space they were trying to solve goes away. If anythin= g, > > people who still don't realize that it's the wrong solution will get > angry > > that we're taking away a "useful" tool and replacing it with "meh, go > look > > at library X," which is admittedly a pretty rude answer. > > > > If we're removing a bad answer to the problem, we should also replace i= t > > with a good answer. > > > > Someone will, I'm sure, pop in at this point and declare "if you don't > > know the character encoding you're receiving, then you're doing it wron= g > > and are already lost and we can't help you." While that may be > technically > > correct, it's also an entirely useless answer because strings received > over > > HTTP very frequently do not tell you what their encoding is, or they li= e > > about what their encoding is. (The header may say it's ISO8859, or UTF= 8, > > or whatever, but someone copy-pasted from MS Word into a text box and n= ow > > it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8 > > except for the Windows-1252 part. Like, that's literally the problem I > > lost several days to.) "Your own fault" is not even an accurate answer > at > > that point. > > > > So if we're going to take away people's broken hammer, we need to be ve= ry > > clear about what hammer to use instead. > > > > The initial answer is probably "here's how to use a series of mb_string > > functions together to produce a reasonably good > > guess-my-encoding-and-convert-to-utf8 routine" documentation. Which... > may > > exist, but if it does I've never found it. So at bare minimum the > > encode_utf8() documentation needs to include a "use this code snippet > > instead" description, and not just link to the mbstring extension. > > Glancing through the mbstring docs right now, it looks like it's not > > already a single function call, but some combination of several, and ha= s > > some global flags that get set (via mb_detect_order()), I think. It's > not > > as easy to use as utf8_encode(), even if utf8_encode() is wrong. That > > suggests we may want to try and simplify the mbstring API, or internali= ze > > some function that handles the most common case in a way that doesn't > rely > > on global flags. > > > > So, let's make that easier to use, so that we can change "this function > is > > wrong, we're taking it away from you" to "this function is wrong, here'= s > a > > way better alternative that you can use instead (while we quietly take > the > > wrong one away from you while you're distracted by the new shiny)." > > > > I don't know the mbstring API well enough to say what that alternative > > ideally looks like, but if we can answer that it would make killing off > the > > old functions much more palatable. > > > > --Larry Garfield > > > > -- > > PHP Internals - PHP Runtime Development Mailing List > > To unsubscribe, visit: https://www.php.net/unsub.php > > > > > As an encoding nerd and perennial complainer regarding these functions I > would like nothing more than to see them immediately disappear, but I do > recognize the BC-breaking potential for something like that. However, I d= o > have a suggestion that I've not seen mentioned yet that should at least > address some of the misconceptions that people get from the current > functions. > > I would suggest adding optional source/destination encoding parameters to > the functions, eg: > > utf8_encode(string $string, string $source_encoding =3D "ISO-8859-1") > utf8_decode(string $string, string $destination_encoding =3D "ISO-8859-1"= ) > > and, if you'll forgive the hand-waving due to my unfamiliarity with PHP > internals, they could simply be passed through to an underlying > mb_convert_encoding() call. Eg: > > mb_convert_encoding($string, 'UTF-8', $source_encoding) > mb_convert_encoding($string, $destination_encoding, 'UTF-8') > > This would preserve BC while also making the function header and > documentation much more descriptive of what the function actually does, > allow more flexible use of the functions, and potentially drive people to > use the mb_* functions instead. This could also be used as a gradual > pathway to deprecating the functions, where, for example, a deprecation > notice could be raised when the function is called without the > source/destination encoding explicitly given. > > I know that there is also some resistance to the idea of requiring mbstri= ng > as it is an optional extension, as well as resistance to bringing mbstrin= g > into core due to design and/or history. This could be worked around by > [once again, apology for handwaving] only requiring mbstring for > conversions involving an encoding other than ISO-8859-1 and falling back = to > the existing implementation otherwise. > Now might be a good time to make this into an RFC. :) --Kris --0000000000005f6c7805d3b13dff--