Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:116087 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 88935 invoked from network); 18 Sep 2021 13:53:21 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 18 Sep 2021 13:53:21 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 804121804C8 for ; Sat, 18 Sep 2021 07:33:34 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_HOTMAIL_RCVD2, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS8075 40.80.0.0/12 X-Spam-Virus: No X-Envelope-From: Received: from NAM11-CO1-obe.outbound.protection.outlook.com (mail-co1nam11olkn2060.outbound.protection.outlook.com [40.92.18.60]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 18 Sep 2021 07:33:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=BTjCgmznwpmFMhsKCsJP8DJTUBXgE+PyI7+kX5318KTK5c9mLg3/5Z2syY28tAHp528U+OzAncGItP9DqoS0RJRmLQ+kA6E3zrkek6XFRxU38gPgnXODGlXxTf9KNeclBA9HqtaqLznq4JoJl5+SrwbcKdGHTeyDo/vBa1me7x9DqugqabhssmXpUN+BwRdwtTxs09UGxyoJza4ktQNjbsrIjmaXz4kB9IpxIKLNIu/TcS0VvDN0G48S2kHUA54mI9bNLfqYTwqZ+ryZ0N5cTgvBTvm4nO0Uo89lxge9AMplDuMyTLOBGMUyXwGChWdkqB0qORzltD3W+p869Kw+/w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=5NdAF355gcPmQgiQDD19SgnYDgxcTsntwkxIH4TadyY=; b=QPgPWIsaZ9G6MoacfKT+i9ARYNvykdcndemfUOhVbgY+yiCFbMaUeDHZuxC1z4S98aDf9orFSfOjaA3BSgMOLARN1N9AzC93fNhZaZxb9Ze7GaXtLcoWzrdsonBPoPrjk++YJbQAOojwtlN/4RvHg1yo057gRCVHEwP5sY+4zMjSsHHeZCxN70W59KciP5l2F1+uRKVfdH1Kj2FtkmUCYu8TDjpsUsL3uORyWXcMrdrUOe6vd5mMWowoEKWY8h1+alCTuWgBi62fvu3i8YMBuJZx00L4yG82341+ta4BL6SJjR1tPM1ttbcxn5NfPMNCTqm5026i4mNCSrFvt0tuIQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5NdAF355gcPmQgiQDD19SgnYDgxcTsntwkxIH4TadyY=; b=qpul1CuAHJzirKBcHu7yUCt4JkoP3Epk+zDb067/fvlest0KYpve3Oyqa9I6WGk0ELQfBQUUfXAB1szt/FbMKX+NGkdq9sQy4bsuHMeqf4BXovyeZT9n7IAhaoAp03X0pPq2IS6lzEpd08ocsAHpY4DTeiAw7Efvuz4dGhlyJJNSmdEbcT5aWY0ALV7SWYkgN0FicXK5Zu2ddiSg4X4e1T0BouAS23xDN4LNxLfweUqNP3nfkwabNOK9mhzYFlbd8nksiAaU7elAFNPXNpsdtXqmvjm7ZXaB+gIqCLKvZNYM4GIk953QGca/cnyoBrPnYmlrilTLaZrNCJYoPMDmGg== Received: from DM6PR07MB6618.namprd07.prod.outlook.com (2603:10b6:5:1cf::26) by DM5PR07MB3564.namprd07.prod.outlook.com (2603:10b6:4:63::24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4523.18; Sat, 18 Sep 2021 14:33:32 +0000 Received: from DM6PR07MB6618.namprd07.prod.outlook.com ([fe80::7181:60db:c4d2:835]) by DM6PR07MB6618.namprd07.prod.outlook.com ([fe80::7181:60db:c4d2:835%7]) with mapi id 15.20.4523.018; Sat, 18 Sep 2021 14:33:32 +0000 To: Tim Starling , "internals@lists.php.net" Thread-Topic: [PHP-DEV] Make strtolower/strtoupper just do ASCII Thread-Index: AQHXq2/2hhUH7eNzz0a1G9OmTPhcq6unoPi6 Date: Sat, 18 Sep 2021 14:33:31 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-CA, en-US Content-Language: en-CA X-MS-Has-Attach: X-MS-TNEF-Correlator: suggested_attachment_session_id: 48fe1512-4133-81d0-b18a-67e1f5d860b9 x-ms-exchange-messagesentrepresentingtype: 1 x-tmn: [7CORGQgh4Qx3JwyDhMazrGaKVfA+hd5PC4qQmHAL/P+v8TD834S+jtNH7WYZcvB7wds3FbwK1j4=] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: ff6bf200-e0b5-4677-baa1-08d97ab14a1f x-ms-traffictypediagnostic: DM5PR07MB3564: x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: l5JX0xCPFUarGgYKQNtQTj/KnOfuN/EHyuHMz9QvtkW1rg4A/EMT6KYbXgMW6T3S6a3TalxBZCvwNXyajRgnYHvDsjG8H3eeJ7T+7ewyNBXq8McWXmPzHTCmG1lf9OBGv/bhRMhrEMRgIKC9/QGtbej8rQNHFMIVw5MDXdguIBJ8kSkp5uLExTGKRoe5f/kJreZNozlpNAz9IqtAJJ5F6sOpVjU51PlgWXzHreV+j2uBe1f3+RZ4XiqqeBnTYMS7Rqz9jhx5yLIGmNrmCCDPGerv3JH3gp9E0p8gPe4OIySRMmvYtuvPvoQXNZ0YfufYDjkcArwyDQy+kVWGIrk+LyKz+WTV3mYcWDQCCdS67fe79pcbDZsi/kxbvttjCoN2dTOiCPlUwZePwytYs4CRrjv39yUcdhc1mWmaTag3Xw2XpsHpdW5Ye7HqoadCElkReO2hsqodcM7wrcT5gq5vW9VS7LLi5q/VjWZKTtxma5s= x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: XSb9xiu8ZKUNCR1kWeNc776uvFH3ZL48H3xtD9uQp9ZCQ1EIU1QiE5Oog78eOWfEd915RTZG7G+uJsia0+lPC6BMOPhIra4rYfpbsSfOYkMXVVKVGixhpFkOVCYpGUnhcKR21X0d7JNi1XquNirY6FKPi+t9TiWDDsXjTxp1SMTNj2t6fS9UjgcqpJ/2JGnxw/JdqiQ7KnBdtAEehUjJ7A== x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: sct-15-20-3174-20-msonline-outlook-35401.templateTenant X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: DM6PR07MB6618.namprd07.prod.outlook.com X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-CrossTenant-Network-Message-Id: ff6bf200-e0b5-4677-baa1-08d97ab14a1f X-MS-Exchange-CrossTenant-originalarrivaltime: 18 Sep 2021 14:33:31.9979 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR07MB3564 Subject: Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII From: tysonandre775@hotmail.com (tyson andre) Hi Tim Starling,=0A= =A0=0A= > I would like to know if a patch to make strtolower and strtoupper do=0A= > plain ASCII case conversion would be accepted, or if an RFC should be=0A= > created.=0A= > =0A= > The situation with case conversion is inconsistent.=0A= > =0A= > The following functions do ASCII case conversion: strcasecmp,=0A= > strncasecmp, substr_compare.=0A= > =0A= > The following functions do locale-dependent case conversion:=0A= > strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,=0A= > strnatcasecmp, ucfirst, ucwords, lcfirst.=0A= > =0A= > I would make them all do ASCII case conversion.=0A= > =0A= > Developers need ASCII case conversion, because it is used internally=0A= > by PHP for things like class name comparison, and because it is a=0A= > specified algorithm in HTML 5 and related standards.=0A= > =0A= > The existing options for ASCII case conversion are:=0A= > =0A= > * Never call setlocale(). But this breaks non-ASCII characters in=0A= escapeshellarg() and can't be guaranteed in a library.=0A= > =0A= > * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also=0A= can't be guaranteed in a library.=0A= > =0A= > * Use strtr(). But this is ugly and slow.=0A= > =0A= > If mbstring has a way to do it, I can't find it. I tested=0A= > mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').=0A= > =0A= > Note that locale-dependent case conversion is almost never a useful=0A= > feature. Strings are passed through tolower() one byte at a time, to=0A= > be interpreted with some legacy 8-bit character set. So the result=0A= > will typically be mojibake even if the correct locale is selected.=0A= > =0A= > strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I=0A= > made a full list at . The=0A= > UTF-8 locales mostly work, except for the Turkish ones, which mangle=0A= > ASCII strings.=0A= > =0A= > At https://bugs.php.net/bug.php?id=3D67815 , Nikita Popov wrote: "My=0A= > general recommendation is to avoid locales and locale-dependent=0A= > functions, as locales are a fundamentally broken concept." I agree=0A= > with that. I think PHP should migrate away from locale dependence.=0A= > When PHP was young, it was convenient to use the C library, but we've=0A= > progressed well past that point now.=0A= =0A= I think it's a good idea (But would still require an RFC)=0A= As you said, the way it acts on bytes rather than codepoints seems like it'= s almost always incorrect outside a narrow range=0A= (except for rare charsets such as https://en.wikipedia.org/wiki/ISO/IEC_885= 9-1)=0A= =0A= The behavior of strtolower is inconvenient for common uses in=0A= - filesystem paths, where strolower('I') isn't 'i' in tr_TR=0A= - username validation, if it's possible to create a new account that is con= sidered the same case-insensitive strings in some locales but not others=0A= - etc.=0A= =0A= When implementing this, Zend/Optimizer/sccp.c has optimizations for functio= ns such as str_contains, etc to optimize.=0A= After removing locale dependence, those optimizations could be safely added= for functions that would be locale independent as a result of your change.= =0A= - This would allow eliminating more dead code, and make code calling those = functions (on constant arguments) faster by caching the resulting strings i= n opcache.=0A= =0A= The function `zend_string_tolower` can safely be used to efficiently conver= t strings to lowercase in a case-insensitive way.=0A= (zend_string_toupper hasn't been needed yet due to not yet having any use c= ases in php-src's internals, but could be added in such a PR)=0A= =0A= ```=0A= 841: || zend_string_equals_literal(name, "str_contains")=0A= 842: || zend_string_equals_literal(name, "str_ends_with")=0A= 843: || zend_string_equals_literal(name, "str_replace")=0A= 844: || zend_string_equals_literal(name, "str_split")=0A= 845: || zend_string_equals_literal(name, "str_starts_with")=0A= ```=0A= =0A= Thanks,=0A= Tyson=