Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:116051 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 16585 invoked from network); 17 Sep 2021 02:18:56 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 17 Sep 2021 02:18:56 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 763391804C9 for ; Thu, 16 Sep 2021 19:58:48 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 16 Sep 2021 19:58:48 -0700 (PDT) Received: by mail-pg1-f177.google.com with SMTP id n18so8170399pgm.12 for ; Thu, 16 Sep 2021 19:58:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org; s=google; h=to:from:subject:message-id:date:user-agent:mime-version :content-language:content-transfer-encoding; bh=2MVTbl//cyr9aHDr+bxlVjNNHxP7DAAgRsrAlJiYlMw=; b=rH+x0XDAnXHVv2m5KDgvBZR5z2dI3IohD4VhcnmsyDYULP8IKUfmjr6DUreJkZfbgZ ALRGaqRdtS41AmfrUPjBtbbYDMvQkoC9eR9onYviSXABhU4I+UzFTiJniCfqzKwawzSN mYeuOSG+1f2f4qBNEUg3ocZmQDk7scbffGp2g= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:to:from:subject:message-id:date:user-agent :mime-version:content-language:content-transfer-encoding; bh=2MVTbl//cyr9aHDr+bxlVjNNHxP7DAAgRsrAlJiYlMw=; b=MnIhqw4j2PwPCC6BF8qDF2/ac6JhWpDJ/lgTP2pRfXB6QUA2+/ZLk7xpLjS0vH4RxV EXZ7PsHJUVnI0BflcX21K9s1aKIKEUmvsIh095/9a/arb2mv7Npiwx87TtwZnfyQ5SFU YPhfbHfk98qz41aIAwCFtzQ+l6YIoxMrD/y7nc62ktKQKKOcLD2HkjyYKjbcBPoVrc8h Sf/NreEYMhFlOGLFHINnDPP6tWzrSBTMxh9ceuBkzu89ZSOMTa0LOHtoxQ2NvyPnCpqE BgxpCOwKNcuiq2H17aZR5aoWTP9qa0VQZS1syPs3p1MF5CrMmp4t7wnY2uYz99I6g6TA 1IQA== X-Gm-Message-State: AOAM5314YLYEpKhJtmTizTI53XjWUvTD891DF+UbUhi6V1DMp3lf7789 l3RNlNjgUp3AfgEa//cjeDNB+dkwaHb6+g== X-Google-Smtp-Source: ABdhPJxXlp6bK6poxwiEE+nPKhdO2/b+FxX5hN9ZuuxF+3NicX626RtnSci2aN7wVKrhhqC9gihqeg== X-Received: by 2002:a63:1d63:: with SMTP id d35mr7735889pgm.238.1631847526356; Thu, 16 Sep 2021 19:58:46 -0700 (PDT) Received: from [10.1.1.45] (124-168-141-168.dyn.iinet.net.au. [124.168.141.168]) by smtp.gmail.com with ESMTPSA id g12sm8894225pja.28.2021.09.16.19.58.45 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 16 Sep 2021 19:58:45 -0700 (PDT) To: "internals@lists.php.net" Message-ID: Date: Fri, 17 Sep 2021 12:58:43 +1000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Make strtolower/strtoupper just do ASCII From: tstarling@wikimedia.org (Tim Starling) I would like to know if a patch to make strtolower and strtoupper do plain ASCII case conversion would be accepted, or if an RFC should be created. The situation with case conversion is inconsistent. The following functions do ASCII case conversion: strcasecmp, strncasecmp, substr_compare. The following functions do locale-dependent case conversion: strtolower, strtoupper, str_ireplace, stristr, stripos, strripos, strnatcasecmp, ucfirst, ucwords, lcfirst. I would make them all do ASCII case conversion. Developers need ASCII case conversion, because it is used internally by PHP for things like class name comparison, and because it is a specified algorithm in HTML 5 and related standards. The existing options for ASCII case conversion are: * Never call setlocale(). But this breaks non-ASCII characters in escapeshellarg() and can't be guaranteed in a library. * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also can't be guaranteed in a library. * Use strtr(). But this is ugly and slow. If mbstring has a way to do it, I can't find it. I tested mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii'). Note that locale-dependent case conversion is almost never a useful feature. Strings are passed through tolower() one byte at a time, to be interpreted with some legacy 8-bit character set. So the result will typically be mojibake even if the correct locale is selected. strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I made a full list at . The UTF-8 locales mostly work, except for the Turkish ones, which mangle ASCII strings. At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My general recommendation is to avoid locales and locale-dependent functions, as locales are a fundamentally broken concept." I agree with that. I think PHP should migrate away from locale dependence. When PHP was young, it was convenient to use the C library, but we've progressed well past that point now. -- Tim Starling