Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:115339 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 43508 invoked from network); 7 Jul 2021 09:28:31 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 7 Jul 2021 09:28:31 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 97A361804C8 for ; Wed, 7 Jul 2021 02:50:29 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-oi1-f171.google.com (mail-oi1-f171.google.com [209.85.167.171]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 7 Jul 2021 02:50:29 -0700 (PDT) Received: by mail-oi1-f171.google.com with SMTP id z3so1863257oib.9 for ; Wed, 07 Jul 2021 02:50:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=KW4O+q0jA1ulG1ZJk/BCXVA5PHaJ0ny9adqUBJhAFV4=; b=RdJFtIU8S6Oo/tEAgqGPzXXkMIcZsPOpgUv1fs6oMgkt2+a5ZQF33YnGJoE7BRm8j8 JKEvm8tZY01S8fvYO+20RVAI8lQU52aFIwKXdnzIP1J7BQukCrTHJliIPFpZGAhH6NGb Yewv3+WVTxhbTRtK+C5dBiT+xenptwf8cXOAXcpk6fn2JPyLUKbVIQQTSg1sr5UBeUB7 yCUiyNYDE3hFwgo3s/tfM/62HdHfeaXq6zzLWNI1+jE26QKzrhiH69HC3uL2ofgLUov+ +1dLu5ML8dqtI79H5kmFBLwPtFCeSA7/ZuXbwDRFOWUM6dqujM4hlW2ODlWPdJv06Ts/ ZNwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=KW4O+q0jA1ulG1ZJk/BCXVA5PHaJ0ny9adqUBJhAFV4=; b=jCSjtLPnpg36Xeko2WhcxRcCnnpZu9UEygVMmJxUyWDfLTv3N87+mxCKGn8FAe91Mx zLP580+qqXleAm/DHqWblg2AjL4Pb9Pvx1Ms/6fSh9w+amYIE9fLua5MQ1KL71F+/2DL gBaUvXDuJYyR/D+hoDJE4Vi8Ygpu5+p0jPQaryOUrbhu58AMBMKA7u5v1TkN5ct0pssV +cG+VagLhil8oqWEY4CZH6QegWfckzOyAStHgKZAMJoPYhUz+D0cskKWjZVVXe0JzLCL ATwHlxlWBUwBC6Rl4PrnD0/rPog0TqmU/JVXgsQB2ZzW5J+9zRvSqp5gFuTf30St2lwA 5jcA== X-Gm-Message-State: AOAM530qna7b/DjPDvC/dyLRjboMtun+4PxdSBvV7Ps3TS5pA/BsX/Wf HEchVF6EeOOeEN6W7ZU2wtoAJeEFkI4KTWpKbtU= X-Google-Smtp-Source: ABdhPJz4nMHaryUttm8Ej2OKj48H3KuyjThypM+noVycUKYhPBFQZYeF07BHmk0XMXliOMPRBiNTs/e9Ji5xM3i2mDs= X-Received: by 2002:aca:b484:: with SMTP id d126mr4251518oif.80.1625651428744; Wed, 07 Jul 2021 02:50:28 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a05:6839:16cd:0:0:0:0 with HTTP; Wed, 7 Jul 2021 02:50:28 -0700 (PDT) In-Reply-To: References: Date: Wed, 7 Jul 2021 11:50:28 +0200 Message-ID: To: Nikita Popov Cc: Craig Francis , PHP internals Content-Type: multipart/alternative; boundary="000000000000f0610005c6857740" Subject: Re: [PHP-DEV] [RFC] [VOTE] is_literal From: krakjoe@gmail.com (Joe Watkins) --000000000000f0610005c6857740 Content-Type: text/plain; charset="UTF-8" The perf numbers are for current implementation. Predictability seems more important than simplicity, considering what the feature is for. The only way to make the simpler implementation predictable in the same kind of way was found to be unacceptable during discussion. Cheers Joe On Wednesday, 7 July 2021, Nikita Popov wrote: > On Mon, Jul 5, 2021 at 8:15 PM Craig Francis > wrote: > > > Hi Internals, > > > > I have opened voting on https://wiki.php.net/rfc/is_literal for the > > is-literal function. > > > > The vote closes 2021-07-19 > > > > The proposal is to add the function is_literal(), a simple way to > identify > > if a string was written by a developer, removing the risk of a variable > > containing an Injection Vulnerability. > > > > This implementation is for literal strings ONLY (after discussion over > > allowing integers) and, thanks to the amazing work of Joe Watkins, now > > works fully with compiler optimisations, interned strings etc. > > > > The new implementation, while being very predictable, also comes at a cost. > The RFC doesn't really explain how this works, so let me explain it here > (based on a cursory look through the patch): > > Strings have a new flag that indicates whether they are "literal". The > problem is that PHP performs string interning, which means that certain > strings, mostly those seen during compilation of PHP code, are > deduplicated. If you write "foo" two times in a piece of PHP code, there > will only be one interned string "foo". Now, what happens if there is both > a literal string "foo" and a non-literal string "foo" and we want to intern > both? > > I believe what the original implementation did is to just assume that all > interned strings are literal strings (this is an educated guess, I don't > know where to find the old implementation). This makes things technically > very simple, and is a pretty sensible assumption in practice. The reason is > that interned strings are almost always derived from literal strings in the > program. The main exception to that are a) single character strings, where > a lot of code will prefer fetching an interned string rather than > allocating a new one and b) the result of optimizations in conjunction with > opcache, which will render strings like $a=1; $b="Foo".$a; interned as a > result of constant propagation. > > What the new implementation does is to actually allow two separate interned > strings for "foo"(literal) and "foo"(not literal). Part of the hashtable > implementation needs to be duplicated, because it now needs to distinguish > strings that are nominally equal. > > The complexity involved here doesn't look terrible, but I'm unsure about > the performance and memory usage impact. Where per-request strings are > concerned, I expect duplication to be low, because most per-request > interned strings are going to be literal. However, I don't think this holds > for permanent interned strings, which will be predominantly (or entirely?) > non-literal. At least it doesn't look to me like names of internal > functions/classes/etc are considered literal. I think this may be > problematic, in that now the permanent non-literal interned string in the > function/class tables will be different from the per-request literal > interned string used by user code to reference it. This means that all of > these are going to be duplicated, and will no longer benefit from efficient > identity comparison, and will have to be compared based on string contents > instead. > > I'm not sure what the actual impact on performance would be. The RFC > indicates that the impact is minor, but I believe those measurements were > made with the original version of the RFC, which did not try to distinguish > literal and non-literal interned strings. > > Overall, this is a no vote from my side. While I was not entirely convinced > by the RFC, I was willing to accept the original approach based on sheer > simplicity -- tracking the "probably literal" flag didn't really cost us > anything in terms of either technical complexity or performance. With the > new implementation this isn't the case anymore. I could be convinced that > the technical implications here are unproblematic based on further > analysis, but the current RFC doesn't really discuss this point at all. > > Regards, > Nikita > --000000000000f0610005c6857740--