Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:113660 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 33910 invoked from network); 22 Mar 2021 10:28:50 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 22 Mar 2021 10:28:50 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A88031804F6 for ; Mon, 22 Mar 2021 03:24:04 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 22 Mar 2021 03:24:04 -0700 (PDT) Received: by mail-ej1-f53.google.com with SMTP id b7so20323816ejv.1 for ; Mon, 22 Mar 2021 03:24:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding:content-language; bh=gPIPv7Xx/Bk8dBd5cyoTDP0ma+cjjsH1QESIT46kng8=; b=aOe8V43yLi7PTdDjEyZb+Mlpge8Hm23vW4b8htmRkGKm0IB4oYWglLjn/5Nz6twp5Y njuNsk3OmRNMiSzmEzBmrTPO71kZKyAUwLDFgKWdPrutQFmuRdhC/R1kwAzKmAz8IEht uqBdRcGAMfdUcNoDs41iV9fC5Z2K6JmJcOZKbxtzr7L1afiB8FSjpkaXjQqfwKbjPYZd SKAVo1Gb2YP2SDFd9aiLr/Q0CePiRtRuRprybTei1/YYvdtReoGydhEsHzv87/Hh7Xct tNnNMnAG6s4+tJ3Poe/OZxa4O53Jx0Imx4MSSLaNWj+xJsbEmZt2/TDXYkqImyudx5FN Okrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=gPIPv7Xx/Bk8dBd5cyoTDP0ma+cjjsH1QESIT46kng8=; b=gQk6skG5gbo68WDQQNi5WBEEl+rgjgDrvnegpimGnrnDhuVzlVqBh8RzUF5HongV2m blMheC7rjDf0d6oh3IweinixIJPx61Vl6qFIdIF23fkNqRZPzcM23+v0jCwJXaD8OJza MnxY+KbGUg19W9QrS7T1hSNvl6kXTaREd6iJIm1RIWq2rERHIBzHw8llvOR0NVQsYgsB niuoaXbey7czGybnnWiPUpUgGWrXjz5v8qJ+ze9zl7pTXnUiq1v8MfZr21KPW+8pyjwh o76LsCEhEbfPN5BikM7LYO8ru1BWUCRG7EgSmd9AqVfwaeM5tezbuTA85Tnwet1zMt0p l3/Q== X-Gm-Message-State: AOAM532G5ULlpv1mOg029vjmsSoC7X/lRkKsOeW+e5g5TGhaawGiz+p6 2a33zIhmm967IPGXGSt1gpx/BU9PqiA= X-Google-Smtp-Source: ABdhPJzPT29wsndw/qarqEz4ec7pab45fg6PLxEJsN6Vg+Cr8Gx6oCDQOP0tjC7Sc/JFDFdX/6RcPQ== X-Received: by 2002:a17:906:894:: with SMTP id n20mr18337473eje.57.1616408642264; Mon, 22 Mar 2021 03:24:02 -0700 (PDT) Received: from [192.168.0.22] (cpc104104-brig22-2-0-cust548.3-3.cable.virginm.net. [82.10.58.37]) by smtp.googlemail.com with ESMTPSA id r25sm10712712edv.78.2021.03.22.03.24.01 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 22 Mar 2021 03:24:01 -0700 (PDT) To: PHP Internals References: Message-ID: Date: Mon, 22 Mar 2021 10:24:00 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-GB Subject: Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode? From: rowan.collins@gmail.com (Rowan Tommins) On 22/03/2021 01:15, Sara Golemon wrote: > My preference is for a deprecation notice (but not necessarily removal > ever -- We can argue that part a little). I'm strongly against any concept of "indefinite deprecation". I consider any deprecation notice a commitment to remove the feature in the future, even if a specific timeline for that removal is not given. If we want to have a separate status of "will be kept indefinitely, but you shouldn't use it", then we need a separate E_DISCOURAGED, or some boilerplate in the manual which doesn't use the word "deprecated". > As for details, I don't love iso_8859_1_to_utf8(), but we can use the > common alias for iso-8859-1 known as latin1 and call the new > functions: utf8_from_latin1() and utf8_to_latin1() with the caveat > that the later will throw a ValueError for codepoints which are out of > range (one of the more problematic issues with utf8_decode()). That > makes this not just a simple rename for clarity, but what I'd consider > a bug-fix for an unfortunately unfixable function. While I can see the temptation here, I'm not sure who the target audience for the new function would be: * People who just want to replace calls to utf8_decode won't want to go through every call and make it exception safe. * People who want to write a polyfill couldn't use it, because they wouldn't be able to recover the remainder of the string after an error is thrown. * People who want transcoding without any optional extensions will be disappointed to find only this one encoding supported. You'd effectively be adding a completely new core function just for those people who work with Latin1 text, and are confident that it's not Windows-1252 in disguise. It's tempting to make any C1 control characters an error as well - although technically valid in Latin1, these are very rarely used, and it's much more likely that any bytes in that range are intended as characters in Windows-1252. But that would feel very odd without having a corresponding utf8_from_windows1252 function to use instead, at which point we're into designing a whole new conversion library. And of course, once you've got that UTF-8 string, you can't do much with it, because PHP's native string functions are all byte-based, so you've basically got to re-invent large chunks of ext/mbstring... Regards, -- Rowan Tommins [IMSoP]