Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:116259 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 91696 invoked from network); 11 Oct 2021 06:52:48 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 11 Oct 2021 06:52:48 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 00377180504 for ; Mon, 11 Oct 2021 00:38:41 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 11 Oct 2021 00:38:40 -0700 (PDT) Received: by mail-ed1-f44.google.com with SMTP id p13so64715069edw.0 for ; Mon, 11 Oct 2021 00:38:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=wyyGUgNCXmqgU5TxMP6W7cg3WeqZeM49om9om0evo5U=; b=H9ckj6rNV+VCIHRPum81YWQPze6oQvYK72KD+5rUlfWixUotNvdY0ddwOx99jD2xo7 BXRC6fJAoJP7DvrooVFEbSoxAYXHjJm7neAxg7Sc7YjemXhHQhGsH4Eb+1x777d0OJ/g wTRMojOmnJGYMBrycvvfU3eAECMbBoacoTqaiFjk8aCNcAkMRp5/84+q3rdV30Vv/XEa 3qW4VaSGVOCvZKYCcY5xJ1RJFMN/t0A2YIQQuInIGIK8OgRwl07E+/bRB/KnNJ4WXAaO 8K6B7F9rTOAEfOl5xAqy7eXWvhEkFLmIm5h/QlR9cBf90Cz0LJ3VJcfobEnrbsWkF/IN yadQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=wyyGUgNCXmqgU5TxMP6W7cg3WeqZeM49om9om0evo5U=; b=Ot4T8wIaKpaFsTkgMRqqTAVieNmfMIJZ6bGZT843i2h14bmqx5Jn1nYeO+nBy+8ff8 GvsrKlmcqWRKRGrisGwcsThmBodb18JlVFFDVJjdkU/gfPbJHFcF/UEBbeMuirRyKegT 7iwoqHPqkbobVoPSfFF3QOMUChglwRiEpkMYDwEpNRkgZYwVWo1Q6Ik3csTwjjCuv1bE dYDKga8FBcfxFcz/C08Kq/Xgivqz9SA6foq2OvaBoD8oPlipGoXuM7rAozaDkzmxGgWv UZIx60FFbHldyMv4SN4+SXYSX4W8iGUlqXwQKOpVk3r1WWx3+J9XDHE95xlnmXah22dk IzRQ== X-Gm-Message-State: AOAM532fU1WqZP9kzQq8G89LtkBr+lA95GcBW9md92TFVgl6aJU/USiG etOT7AjqD4FWPqIwUJI8ihs6WCd7F6D/TIYMO74= X-Google-Smtp-Source: ABdhPJwzCGBKoMG3u1NpBoeyjTTQEv3Jh/9YLtOuoH159MFak0eny7G8pMCA6x0U44ZATOoO6CyXalcC7ahK9ErxfPQ= X-Received: by 2002:a05:6402:438f:: with SMTP id o15mr34110324edc.301.1633937918189; Mon, 11 Oct 2021 00:38:38 -0700 (PDT) MIME-Version: 1.0 References: <88b5171e-48b3-0176-47de-ee1499832b57@wikimedia.org> In-Reply-To: <88b5171e-48b3-0176-47de-ee1499832b57@wikimedia.org> Date: Mon, 11 Oct 2021 09:38:25 +0200 Message-ID: To: Tim Starling Cc: PHP internals Content-Type: multipart/alternative; boundary="00000000000032e35e05ce0ed1d8" Subject: Re: [PHP-DEV] [RFC] Locale-independent case conversion From: nicolas.grekas+php@gmail.com (Nicolas Grekas) --00000000000032e35e05ce0ed1d8 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Le lun. 11 oct. 2021 =C3=A0 03:33, Tim Starling a =C3=A9crit : > On 4/10/21 9:08 pm, Nikita Popov wrote: > > > > Hi Tim, > > > > Thanks for creating this proposal, it looks great! > > > > I think this is a very beneficial change, and the amount of > > incorrect locale-dependent calls we had just in php-src further > > convinced me of this: We're generally aware of the problem, and we > > still made this mistake. Many times. > > > > The only open question I have is regarding the ctype_* functions. > > One might argue that these functions should be locale-independent as > > well. Certainly, whenever I have used ctype_digit() I only intended > > it to match [0-9]. It seems like some people try to use > > ctype_alpha() in a locale-sensitive way > > ( > https://stackoverflow.com/questions/19929965/php-setlocale-not-working-fo= r-ctype-alpha-check > > < > https://stackoverflow.com/questions/19929965/php-setlocale-not-working-fo= r-ctype-alpha-check > >) > > and then fail because it doesn't support UTF-8. > > > OK, I removed ctype_tolower() and ctype_toupper() from the RFC and the > PR since they would be incompatible with a move towards a > locale-independent ctype extension. > > The non-controversial parts of the PR were split and merged, so I > rebased the PR and updated the RFC accordingly. > > Do you think the RFC is ready for voting now? > > > > PS: Regarding escapeshellarg(), are you aware of the array command > > support for proc_open() that was added in PHP 7.4? That does away > > the need to escape arguments. > > It doesn't really help us. I recently wrote a new shell command > execution system for MediaWiki called Shellbox. As part of that > project, I reviewed how shell execution is used in the MediaWiki > ecosystem. There are a lot of callers which are using shell features, > for example redirecting inputs or outputs, or constructing pipelines. > I didn't really want to break them all or reimplement those features > without the shell. And we have security and containerization wrappers > which depend on construction of a shell command string. So we need to > be able to construct shell command strings safely. > > After studying locale sensitivity for this RFC, I decided to get rid > of escapeshellarg() from MediaWiki. Instead we are doing our own shell > escaping: > > https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/722548 > > I also made MediaWiki use a fixed locale, instead of being configurable. > Hi Tim, thanks for the RFC and for the above pointers, I'm going to have a look at Symfony Process to follow your lead! About the RFC, I just have one note: > I didn't include strnatcasecmp() and natcasesort() in this RFC, because they also use isdigit() and isspace(), and because they are intended for natural language processing. They could be migrated in future. Despite their name, I never used *natcase* functions for natural language processing. I use them eg to sort lists of files in a directory, to account for numbers mainly. But that's not what I would call natural language processing. I'm not aware of anyone using them for that actually. I'm wondering if it's a good idea to postpone migrating them to an hypothetical future as to me, the whole reasoning of the RFC applies to them. Regards, Nicolas --00000000000032e35e05ce0ed1d8--