Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:116062 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 70461 invoked from network); 17 Sep 2021 08:03:43 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 17 Sep 2021 08:03:43 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 159DF1804CF for ; Fri, 17 Sep 2021 01:43:38 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 17 Sep 2021 01:43:37 -0700 (PDT) Received: by mail-ed1-f46.google.com with SMTP id j13so27199793edv.13 for ; Fri, 17 Sep 2021 01:43:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8v4S4CoErk2y6ZdtCzKtPLsgi+EVywh86NsRCb8VMOk=; b=ZbAJslU6rvGkTiLaA6xujv7QKUjXOb8T03/A/UuiyQpSMeUo1BpYG/g9+tqURcDUuY IgFMWI9Er0Ayq8TkxVxNpstJE87O9455gl5k8CyU/rlEsFjHk3TV2cx9jRkttN/p88bS SZas23f0AbNak2ZOYQT4CsVhkdjtMqmmLEls8hJxoGbDYi6DU3LoXUXmuu3hSTqlXhoZ jP/cUvABzHD4eAJYvAd91IxhAsk+MkDDautcNfOo3S57GEZOvYIp0DGCVE3GyWrz1k6x yzgXLoFBnJ+SuFfxjONFkjp5nlJvREd2PfT3LSZnwHBp76mIbT9C1/iIj7nwBmQ/nHPK euPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8v4S4CoErk2y6ZdtCzKtPLsgi+EVywh86NsRCb8VMOk=; b=n4zNS6eA03EA9HVvf0Iwvvo2Lzc+cHEO7zpKcrOKAjHNgZWAKaR2Ya2NEtYFIA2fOt 0Zj6lvZjfuLLax2BWfnMTGCkRwa9QdkbQg8K/tq38/s1B5Q5teY5tZ5aBdASFf1Tp19w 9LutVBqxRQRDYVSf9O8ZSrTmWRXaTbn5eaEEhx9R0jJAend2p94FOMefvngXWZM+ga1n tGU4Bg51BeLefwGjgYPWNdkgYqIHow9OiayyjW3QzMlPVwaF5n0fJ8EVjRk3hWVsJ3E+ H6lPZJnO2PNp0pQ5tK8NbR6e1suF2H0Zbe8JU4fx1quWAD57zzol2/RjNZKNu66qOT6Q +eHw== X-Gm-Message-State: AOAM531iFXRhNGWz5GFEhQU+5z/4u/2LsEsrXDT9kUvt+Pk79Vo9BHY2 PUkUMpZBzjjuS2/7Sme15WwtMs5BOIONUqe5ei+092xL X-Google-Smtp-Source: ABdhPJzcgPqm/Kfrm7feVP9owadnf1Smcm3u+i2CVDJtT3B79/O5IOy8Jta8ZXfPYh1gPo4bNNoBYr9PAS3jSIGnYJE= X-Received: by 2002:a05:6402:3587:: with SMTP id y7mr11201060edc.362.1631868214699; Fri, 17 Sep 2021 01:43:34 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Fri, 17 Sep 2021 10:43:18 +0200 Message-ID: To: Tim Starling Cc: "internals@lists.php.net" Content-Type: multipart/alternative; boundary="00000000000041ea9705cc2ced33" Subject: Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII From: nikita.ppv@gmail.com (Nikita Popov) --00000000000041ea9705cc2ced33 Content-Type: text/plain; charset="UTF-8" On Fri, Sep 17, 2021 at 4:59 AM Tim Starling wrote: > I would like to know if a patch to make strtolower and strtoupper do > plain ASCII case conversion would be accepted, or if an RFC should be > created. > > The situation with case conversion is inconsistent. > > The following functions do ASCII case conversion: strcasecmp, > strncasecmp, substr_compare. > > The following functions do locale-dependent case conversion: > strtolower, strtoupper, str_ireplace, stristr, stripos, strripos, > strnatcasecmp, ucfirst, ucwords, lcfirst. > > I would make them all do ASCII case conversion. > > Developers need ASCII case conversion, because it is used internally > by PHP for things like class name comparison, and because it is a > specified algorithm in HTML 5 and related standards. > > The existing options for ASCII case conversion are: > > * Never call setlocale(). But this breaks non-ASCII characters in > escapeshellarg() and can't be guaranteed in a library. > > * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also > can't be guaranteed in a library. > > * Use strtr(). But this is ugly and slow. > > If mbstring has a way to do it, I can't find it. I tested > mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii'). > > Note that locale-dependent case conversion is almost never a useful > feature. Strings are passed through tolower() one byte at a time, to > be interpreted with some legacy 8-bit character set. So the result > will typically be mojibake even if the correct locale is selected. > > strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I > made a full list at . The > UTF-8 locales mostly work, except for the Turkish ones, which mangle > ASCII strings. > > At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My > general recommendation is to avoid locales and locale-dependent > functions, as locales are a fundamentally broken concept." I agree > with that. I think PHP should migrate away from locale dependence. > When PHP was young, it was convenient to use the C library, but we've > progressed well past that point now. > > -- Tim Starling > We've been slowly moving away from locale-dependent functionality. Since PHP 8 we no longer inherit any locales from the environment and have made float to string conversion locale-independent. I would very much support making strtolower() and friends a simple ASCII case conversion operation. mb_strtolower() etc already offer full Unicode-compliant case conversions that work correctly with multi-byte encodings. The locale-sensitivity of strtolower() only works with legacy single-byte encodings and as such is of questionable usefulness even in cases where it is not actively harmful. That said, I do think this change requires an RFC. Regards, Nikita --00000000000041ea9705cc2ced33--