Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:88637
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain craigfrancis.co.uk designates 209.85.212.177 as permitted sender)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
In-Reply-To: <CAAyV7nGnqXbc0wM6aCrjMZXEW6J8tpnerqUHiruVThTOCTWaAg@mail.gmail.com>
Date: Fri, 2 Oct 2015 10:42:02 +0100
Cc: Arvids Godjuks <arvids.godjuks@gmail.com>,
 Christopher Owen <christopher.owen@live.com>,
 Kalle Sommer Nielsen <kalle@php.net>,
 Internals <internals@lists.php.net>,
 wietse@porcupine.org,
 Xinchen Hui <laruence@php.net>
Content-Transfer-Encoding: quoted-printable
Message-ID: <B7ACE3A5-1040-4AA4-8BF2-94DAE8914677@craigfrancis.co.uk>
References: <BLU437-SMTP762BF68D5F013E39BF6AC6E95D0@phx.gbl> <CAJW__o3mCdAdm=jaLpLCSpWC4ekmc98CszcUKNr98qu5JgZBeQ@mail.gmail.com> <BLU437-SMTP66DB73C07E0C570EC13C4CE95D0@phx.gbl> <EBCBA455-F469-4D2A-B21A-B6EA3CE4FAEB@craigfrancis.co.uk> <CAL0xaBEsc4eB9eYdkOr-ksfqEtniBX7TCQF3BT+3DWzGhKvv7A@mail.gmail.com> <CAAyV7nGnqXbc0wM6aCrjMZXEW6J8tpnerqUHiruVThTOCTWaAg@mail.gmail.com>
To: Anthony Ferrara <ircmaxell@gmail.com>
Subject: Re: [PHP-DEV] taint
From: craig@craigfrancis.co.uk (Craig Francis)

While skim reading emails (just got back from holiday), I wanted to =
add...


On 15 Sep 2015, at 17:23, Anthony Ferrara <ircmaxell@gmail.com> wrote:

> All,
>=20
> On Tue, Sep 15, 2015 at 11:15 AM, Arvids Godjuks
> <arvids.godjuks@gmail.com> wrote:
>> I fully support your effort to get this into the PHP to be part of =
core
>> extensions, or at least one of those that keep up with the language
>> releases.
>> This is a very good tool to have, and you can actually run it in =
production
>> to catch things that may slipped the stating (things happen). And =
it's
>> invaluable during during testing.
>=20
> I'm against any form of tainting for a few reasons.
>=20
> First, it's only really useful as a first-order heuristic. A failure
> does not indicate a security vulnerability, and a non-failure does not
> indicate the absence of one. An example of a false-positive would be
> if (preg_match('(^[a-z0-9]+$)', $input)) echo $input;. Taint-based
> systems would flag that as a possible XSS, where it's clearly not in
> any context.


Personally I would argue that if the `echo $input` was for printing a =
variable in a HTML construct, so it MUST go though one of the HTML =
encoding functions, because that regexp could be changed later.

e.g. an inexperienced developer being told that usernames on a website =
can now support angle brackets, so "just edit the regexp for the =
username validation".

This is keeping in mind that the regexp is now usually kept separate =
from the view code (e.g. MVC style)



> An example of a false-negative would be the following:
>=20
>    <a href=3D<?php echo htmlspecialchars($input); ?>>Bar</a>. If the
> input is "foo onclick=3Ddosomethingbad();", an XSS is still executed.


True, but without htmlspecialchars, it will ALWAYS be bad... so while a =
tainting system won't be able to pick up all problems, it can at least =
identify more mistakes than it not existing (percentage wise, well that =
will depend on the programmer).

Maybe you could think of it like having a second pair of eyes looking =
over your shoulder, politely asking you if you forgot something (a bit =
like how pair programming helps improve code quality).



>=20
> Another example of a false negative would be:
>=20
>    $query =3D "SELECT * FROM foo WHERE bar =3D " .
> mysqli_real_escape_string($c, $input) . ";";
>=20
> A false positive is potentially OK. A false negative is not.
>=20


All of the examples have been about using pg_escape_identifier(), =
because that adds the quote marks... mysqli_real_escape_string by itself =
would not be enough to say that it safe for an SQL string :-)



> Second, it will encourage the improper pattern of "sanitize" functions
> to de-taint input. These are functions which call
> `mysqli_real_escape_string(htmlspecialchars(whatever(etc($input))))`.
> No matter what the context the variables are used in. This doesn't
> improve security (and in many cases can demonstrably harm it). It just
> encourages people to work around it. And that ignores that there's an
> `untaint()` function that they can call on any variable to "silence
> the errors".


I don't think a sanitize() function would be implemented though, as it =
would make a mess of variables being passed though more than once... =
resulting in things like double html encoding (e.g. &amp;amp;), and =
these stand out much better than the current sanitize() functions which =
do things like silently stripping "bad" characters like double quotes =
and less than signs.

If anything, a proper tainting system would help us move away from these =
horrible functions (says he having had to remove a couple before now).



> Third, it ignores context. This is related to the first two, but I
> think is a separate concern. An example from the taint RFC
> (https://wiki.php.net/rfc/taint) is the shell-execution. If the
> variable is used in the context of command, one escape function is
> needed. If it's used as an argument, another is needed. Detecting
> which is not something that's trivial for a language-level taint
> function. And calling the wrong function defeats the purpose. So if we
> can't detect the context, we can't reliably say whether it's safe or
> not.


Not sure I understand, if anything this gives the strings a context?

Rather than every variable being a simple string (number, etc), they =
start off being plain text (or a string defined as a constant in PHP), =
and their escaping is checked in the context of the function that was =
called (e.g. exec needing escapeshellarg).



> In the hands of an experienced engineer/security researcher, taint
> systems can be a really handy tool to help find possible issues. In
> the hands of a junior engineer, it can actually make things worse
> because it encourages them to just "silence the errors".


While I originally hoped that it would be enabled by default, I think =
for now it would have to be off by default, and probably only used by =
"experienced engineer/security researchers".

At the moment, static analysis seems to be the closest alternative I can =
find (and they are pretty awful).



> Considering that it's not a reliable tool, and is only a heuristic, I
> would suggest that this be implemented as an extension, and not in the
> core of a programming language.


Maybe, but I think with some discussions, having it part of the core =
language would help get all the other extensions to use it (if needed), =
and help improve viability.

See it like having Notices present, but off by default (which they are).

The other positive is that distros such as Debian / RedHat will then =
ship with it by default (disabled), rather than having to ask/hope that =
they will provide a version with it (keeping in mind that they provide =
security updates without needing to re-compile).


> With that said, if there are engine hooks necessary to implement this
> feature better as an extension, I'm all for adding those hooks. I'm
> just against adding something to the language that won't actually be
> able to accomplish what it promises.
>=20
> Escaping is always context-sensitive, in every case. Tainting as has
> been proposed in the past doesn't (and literally can't in the general
> case) know the context. Therefore it can't reliably know the correct
> way to escape.


I think it can, but I agree that there are some edge cases that it won't =
be able to catch, such as:

   $url =3D $_GET['u']; // ?u=3Djavascript:alert(document.cookies);

   echo '<a href=3D"' . htmlentities($url) .'">Cookies</a>';


> Consider the case of addslashes() and magic_quotes_gpc. The reason
> that they are insecure is that they don't know the context that the
> variable is used in. This leads to character set issues, as well as
> cases where backslashes enter contexts they don't escape (like HTML
> output). We learned years ago that context-insensitive escaping
> doesn't work and is literally bad. Let's not make the same mistakes
> again (even if it's for the correct reasons this time).


I am repeating myself now, but yes, magic_quotes_gpc did not know about =
the context those variables were going into... tainting on the other =
hand is checking the variables, which are going into a known context, =
have been escaped (with a couple of possible issues, but they can be =
addressed though other methods).

Craig