Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:115337 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 38952 invoked from network); 7 Jul 2021 09:01:23 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 7 Jul 2021 09:01:23 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 3B9A21804C8 for ; Wed, 7 Jul 2021 02:23:21 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 7 Jul 2021 02:23:20 -0700 (PDT) Received: by mail-lf1-f46.google.com with SMTP id p21so2857075lfj.13 for ; Wed, 07 Jul 2021 02:23:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=VaYJR1NcGbeyFOw9LuJPQrUiz1dPSUxWk56k7uOnMGE=; b=NXD5SXpljW91ke9jnHsor3RlvML0nOa5iyJYBW5zV/eTCm6vnM5x+U/08mYtfJKa2m MhgaIW7EShAX/P4dj8KJFrxXgS8S7YWpU6btP+E5neOgrBMV+MMhhZF17QRJd7INVCpD 8A/PcfdGdpnd4IwEdN3Drz6Qg/9Ncfm5Vexm7/j2EDb3sjSGrfSfKdKZLukSvHxOQL5J Xjf7DqxSCapRRMjv/QYijJ0O05GhtpXNy5PxP61yit2vprcU9oOuAsWFrd8izmpLv36j lTJwCDfhRHpkyqfde4Gk2+bWIcEvabt1vmsFhvylKCZwx2Ml9M7E/muxz+fQ3i6Hpc28 RXdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=VaYJR1NcGbeyFOw9LuJPQrUiz1dPSUxWk56k7uOnMGE=; b=DtJETvRHHXf00GPtPosalo+VQ7FKwog/CLmGSV2P3GIgv6nHhFpzy1oTTaSq9J16GH pmr8njqelLlW1ffCXQASFyiLIjfo3sKQ9FkBfRMB/Uc53q8mEGr9V4uHKkgFX6z4eof8 Di2P6fEBDbOVSR8HLhwl7Gz8atGDlKyfO9MJdt0Gvkj/ynFJjXYWonoqZhsYib67Qf4k uWK1sfGgP6uP1iecu0Lx1fGwaMZHqPtYeTx5kDqqqJ/IV8qc7vEhei3DC3yWa6ro6Xdc vqdg/a0pAZGNDzmoj/somy2na1p+FULsX3eFwK4X4nP7+nuse67VIrY0/iTt2GzEjbFT TGPg== X-Gm-Message-State: AOAM5315hdPqqhEuGAUVGIWoLI2/PAgmasZV4m+Y3COqnMddos5fbYWY km9KL3VO3ccN2F86Pb/+27uXIxLQkcxJTt2OCIL78OublFUXhg== X-Google-Smtp-Source: ABdhPJxFZQ4W+n5ylLNn9sWzzMC8jAihyr3jm5bQBaWtJxIpFBhjDrkB3iXSRwVyOFJmbs1YJzMITz9/5uKnYbShSvE= X-Received: by 2002:a19:7d05:: with SMTP id y5mr18183505lfc.159.1625649798645; Wed, 07 Jul 2021 02:23:18 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Wed, 7 Jul 2021 11:23:02 +0200 Message-ID: To: Craig Francis Cc: PHP internals Content-Type: multipart/alternative; boundary="000000000000c70b6705c685164a" Subject: Re: [PHP-DEV] [RFC] [VOTE] is_literal From: nikita.ppv@gmail.com (Nikita Popov) --000000000000c70b6705c685164a Content-Type: text/plain; charset="UTF-8" On Mon, Jul 5, 2021 at 8:15 PM Craig Francis wrote: > Hi Internals, > > I have opened voting on https://wiki.php.net/rfc/is_literal for the > is-literal function. > > The vote closes 2021-07-19 > > The proposal is to add the function is_literal(), a simple way to identify > if a string was written by a developer, removing the risk of a variable > containing an Injection Vulnerability. > > This implementation is for literal strings ONLY (after discussion over > allowing integers) and, thanks to the amazing work of Joe Watkins, now > works fully with compiler optimisations, interned strings etc. > The new implementation, while being very predictable, also comes at a cost. The RFC doesn't really explain how this works, so let me explain it here (based on a cursory look through the patch): Strings have a new flag that indicates whether they are "literal". The problem is that PHP performs string interning, which means that certain strings, mostly those seen during compilation of PHP code, are deduplicated. If you write "foo" two times in a piece of PHP code, there will only be one interned string "foo". Now, what happens if there is both a literal string "foo" and a non-literal string "foo" and we want to intern both? I believe what the original implementation did is to just assume that all interned strings are literal strings (this is an educated guess, I don't know where to find the old implementation). This makes things technically very simple, and is a pretty sensible assumption in practice. The reason is that interned strings are almost always derived from literal strings in the program. The main exception to that are a) single character strings, where a lot of code will prefer fetching an interned string rather than allocating a new one and b) the result of optimizations in conjunction with opcache, which will render strings like $a=1; $b="Foo".$a; interned as a result of constant propagation. What the new implementation does is to actually allow two separate interned strings for "foo"(literal) and "foo"(not literal). Part of the hashtable implementation needs to be duplicated, because it now needs to distinguish strings that are nominally equal. The complexity involved here doesn't look terrible, but I'm unsure about the performance and memory usage impact. Where per-request strings are concerned, I expect duplication to be low, because most per-request interned strings are going to be literal. However, I don't think this holds for permanent interned strings, which will be predominantly (or entirely?) non-literal. At least it doesn't look to me like names of internal functions/classes/etc are considered literal. I think this may be problematic, in that now the permanent non-literal interned string in the function/class tables will be different from the per-request literal interned string used by user code to reference it. This means that all of these are going to be duplicated, and will no longer benefit from efficient identity comparison, and will have to be compared based on string contents instead. I'm not sure what the actual impact on performance would be. The RFC indicates that the impact is minor, but I believe those measurements were made with the original version of the RFC, which did not try to distinguish literal and non-literal interned strings. Overall, this is a no vote from my side. While I was not entirely convinced by the RFC, I was willing to accept the original approach based on sheer simplicity -- tracking the "probably literal" flag didn't really cost us anything in terms of either technical complexity or performance. With the new implementation this isn't the case anymore. I could be convinced that the technical implications here are unproblematic based on further analysis, but the current RFC doesn't really discuss this point at all. Regards, Nikita --000000000000c70b6705c685164a--