Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:115337
MIME-Version: 1.0
References: <CAFv4g+HRHrvCOTgpbVfCXj06De1kL+1+RwnGgq6tcyLNAtLfQA@mail.gmail.com>
In-Reply-To: <CAFv4g+HRHrvCOTgpbVfCXj06De1kL+1+RwnGgq6tcyLNAtLfQA@mail.gmail.com>
Date: Wed, 7 Jul 2021 11:23:02 +0200
Message-ID: <CAF+90c8DbBajg7yukYkyXeJp2RU3JngLcr_zJdP=tOw9ebb6ag@mail.gmail.com>
To: Craig Francis <craig@craigfrancis.co.uk>
Cc: PHP internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary="000000000000c70b6705c685164a"
Subject: Re: [PHP-DEV] [RFC] [VOTE] is_literal
From: nikita.ppv@gmail.com (Nikita Popov)

--000000000000c70b6705c685164a
Content-Type: text/plain; charset="UTF-8"

On Mon, Jul 5, 2021 at 8:15 PM Craig Francis <craig@craigfrancis.co.uk>
wrote:

> Hi Internals,
>
> I have opened voting on https://wiki.php.net/rfc/is_literal for the
> is-literal function.
>
> The vote closes 2021-07-19
>
> The proposal is to add the function is_literal(), a simple way to identify
> if a string was written by a developer, removing the risk of a variable
> containing an Injection Vulnerability.
>
> This implementation is for literal strings ONLY (after discussion over
> allowing integers) and, thanks to the amazing work of Joe Watkins, now
> works fully with compiler optimisations, interned strings etc.
>

The new implementation, while being very predictable, also comes at a cost.
The RFC doesn't really explain how this works, so let me explain it here
(based on a cursory look through the patch):

Strings have a new flag that indicates whether they are "literal". The
problem is that PHP performs string interning, which means that certain
strings, mostly those seen during compilation of PHP code, are
deduplicated. If you write "foo" two times in a piece of PHP code, there
will only be one interned string "foo". Now, what happens if there is both
a literal string "foo" and a non-literal string "foo" and we want to intern
both?

I believe what the original implementation did is to just assume that all
interned strings are literal strings (this is an educated guess, I don't
know where to find the old implementation). This makes things technically
very simple, and is a pretty sensible assumption in practice. The reason is
that interned strings are almost always derived from literal strings in the
program. The main exception to that are a) single character strings, where
a lot of code will prefer fetching an interned string rather than
allocating a new one and b) the result of optimizations in conjunction with
opcache, which will render strings like $a=1; $b="Foo".$a; interned as a
result of constant propagation.

What the new implementation does is to actually allow two separate interned
strings for "foo"(literal) and "foo"(not literal). Part of the hashtable
implementation needs to be duplicated, because it now needs to distinguish
strings that are nominally equal.

The complexity involved here doesn't look terrible, but I'm unsure about
the performance and memory usage impact. Where per-request strings are
concerned, I expect duplication to be low, because most per-request
interned strings are going to be literal. However, I don't think this holds
for permanent interned strings, which will be predominantly (or entirely?)
non-literal. At least it doesn't look to me like names of internal
functions/classes/etc are considered literal. I think this may be
problematic, in that now the permanent non-literal interned string in the
function/class tables will be different from the per-request literal
interned string used by user code to reference it. This means that all of
these are going to be duplicated, and will no longer benefit from efficient
identity comparison, and will have to be compared based on string contents
instead.

I'm not sure what the actual impact on performance would be. The RFC
indicates that the impact is minor, but I believe those measurements were
made with the original version of the RFC, which did not try to distinguish
literal and non-literal interned strings.

Overall, this is a no vote from my side. While I was not entirely convinced
by the RFC, I was willing to accept the original approach based on sheer
simplicity -- tracking the "probably literal" flag didn't really cost us
anything in terms of either technical complexity or performance. With the
new implementation this isn't the case anymore. I could be convinced that
the technical implications here are unproblematic based on further
analysis, but the current RFC doesn't really discuss this point at all.

Regards,
Nikita

--000000000000c70b6705c685164a--