A while ago I posted a proposal to add support for tainted variables
to PHP, to alert programmers at run-time when they make the common
mistake of using uncleansed input with include, echo, system, open,
etc. A copy of the original post can be found on-line, for example,
at http://marc.info/?l=php-internals&m=116621380305497&w=2
After working on this over the summer in bursts, I now have a first
implementation that adds taint support to the core engine and to
a selection of built-ins and extensions. Not surprisingly, the
initial plan needed to be updated in the light of experience. The
good news is that performance is better than I hoped it would be.
A quick example
To give an idea of the functionality, consider the following program
with an obvious HTML injection bug:
<?php
$username = $_GET['username'];
echo "Welcome back, $username\n";
?>
With default .ini settings, this program does exactly what the
programmer wrote: it echos the contents of the username request
attribute, including all the malicious HTML code that an attacker
may have supplied along with it.
When I change one .ini setting:
taint_error_level = `E_WARNING`
the program produces the same output, but it also produces a warning:
Warning: echo(): Argument contains data that is not converted
with `htmlspecialchars()` or `htmlentities()` in /path/to/script
on line 3
Changing E_WARNING
into E_ERROR
causes execution to terminate, as
one would expect. These and other settings can appear in php.ini,
on the PHP command line, or they can be made with ini_set()
calls
in the application itself.
Current status
My plan is to release something in the next few days, and to go
through a several iterations while adjusting the course based on
the feedback I receive.
The performance is good: the overhead for "make test" is within the
measurement error of 1-2% (when comparing user-mode CPU time). I
do realize that a portion of the total "make test" time is spent
in non-PHP processing, but with the current numbers it doesn't
really matter if taint overhead is 1% or 2%. If someone has a
better "macro" benchmark, then of course I'm interested.
Although the code is still green, its present form already shows
potential to do things like labeling privacy sensitive data (social
security or credit card numbers) and preventing such data from
flying out the window via HTML web pages.
Multiple flavors of taint
As many know, a PHP program should use htmlspecialchars()
or
htmlentities()
before rendering data as HTML output. And before
rendering data in other "output" like contexts, a program should
use the proper conversion function, too: escapeshellcmd()
etc.
for shell commands, mysqli_real_escape_string()
for mysqli queries.
These conversions are not only needed for security reasons, they
are also required for robustness; shell or SQL commands may fail
unexpectedly when given a legitimate name such as O'Reilly.
What I have implemented encourages programmers to do exactly that:
using the RIGHT conversion function. To achieve this goal, the code
implements multiple flavors of taint.
In my initial post I proposed black-and-white taint, but I changed
my mind for human interface reasons. To help programmers eliminate
security holes from their code, they need to be told how to fix the
problem. With black-and-white taint, the system just didn't know
if the programmer had chosen the right conversion function. The
tool was too simplistic for PHP's complex environment.
Low-level implementation
Taint support is implemented with some of the unused bits in the
zval data structure. The zval is the PHP equivalent of a memory
cell. Besides a type (string, number etc.) and value, each zval has
a reference count and a flag that says whether the zval is a reference
to yet another zval that contains the actual value.
Right now I am using seven bits, but there is room for more: 32-bit
UNIX compilers such as GCC add 16 bits of padding to the current
zval data structure, and this amount isn't going to be smaller on
64-bit architectures. If I really have to squeeze the taint bits
in-between the existing bits, the taint support performance hit
goes up to perhaps 2% in the macro benchmarks that I mentioned
above (but again, this number is barely above the 1-2% measurement
error, so don't take it too literally).
The preliminary configuration user interface is rather low-level,
somewhat like MS-DOS file permissions :-( This is good enough for
testing and debugging the taint support itself, but I would not
want to have wires hanging out of the machine like this forever.
The raw bits will need to be encapsulated so that applications can
work with meaningful names and abstractions.
To give an idea of what the nuts and bolts look like, this is the
preliminary list of bits, or should I say: binary properties,
together with the parameters that control their handling:
-
TC_HTML:
- By default, data with this bit is not allowed in HTML output
(parameter: taint_checks_html). This requirement is not
enforced with the default setting of taint_error_level = 0. - The
htmlspecialchars()
andhtmlentities()
functions produce
output without this bit. - By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms)
or from elsewhere (parameter: taint_marks_other).
- By default, data with this bit is not allowed in HTML output
-
TC_SHELL:
- By default, data with this bit is not allowed in shell commands
(parameter: taint_checks_shell). This requirement is not
enforced with the default setting of taint_error_level = 0. - The
escapeshellarg()
andescapeshellcmd()
functions produce
output without this bit. - By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms)
or elsewhere (parameter: taint_marks_other).
- By default, data with this bit is not allowed in shell commands
-
TC_MYSQL:
- By default, data with this bit is not allowed in mysql_query()
(parameter: taint_checks_mysql). This requirement is not
enforced with the default setting of taint_error_level = 0. - The mysql_real_escape_string() function produces output without
this bit. - By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms) or
elsewhere (parameter: taint_marks_other).
- By default, data with this bit is not allowed in mysql_query()
-
TC_MYSQLI:
- Like TC_MYSQL, except this one is for the
mysqli_query()
API.
The TC_MYSQLI bit is removed withmysqli_real_escape_string()
.
- Like TC_MYSQL, except this one is for the
-
TC_SELF:
- By default, data with this bit cannot be used for internal
control operations such as eval, include, as file name argument,
as network destination, or in other contexts where someone could
take away control from the application (parameter:
taint_checks_self). This requirement disappears with
taint_error_level = 0. - I haven't yet implemented a dedicated conversion function;
to shut up the warnings, this data needs to be marked as
"safe" with a low-level untaint($foo, TC_SELF) call. - By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs).
- By default, data with this bit cannot be used for internal
-
TC_USER1, TC_USER2: These are labels that an application can set
on specific data. For example, it could set these bits when credit
card or social security numbers come out of a database. The
taint_checks_html policy for HTML output (see above) would then
be configured to disallow data with not only with the TC_HTML
property, but also with TC_USER1 or TC_USER2. Obviously some
polished user interface would need to be built on top of this to
make application-defined attributes usable.
Taint propagation policy
Before implementing the above policies, the first order of business
was adding taint propagation to the PHP core: for each operator,
including type conversion, a decision had to made how to propagate
taint from source operands to results.
The general taint propagation rules are:
-
Pure arithmetic and pure string operations propagate all the taint
bits from their operands to their results. The rules become more
complicated with operators whose operands have different types. -
Conversions from string to number or to boolean remove all but a
few taint bits (by default, only the TC_SELF bit stays).This prevents silly warnings about having to use
htmlspecialchars()
or mysql_real_escape_string() when rendering non-string data in
SQL/HTML/shell context, while still protecting the application
against some control hijacking attacks. -
Conversions from number or boolean to string preserve all the
taint bits. -
Comparison operators currently ignore taint bits.
Most of this taint propagation is done, but there are a few minor
issues that still need to be resolved.
-
Something needs to be done when functions like
parse_str()
are
given tainted data: the question is how to represent the taintedness
of the resulting hash table lookup keys. These strings could be
harmful when used in file names, in database names, or in other
critical information. -
Something needs to be done about null strings; it would be silly
to insist that data composed from a null string be converted with
htmlspecialchars()
etc. On the other hand, a null string does
change the syntactical structure of information, so we have to
be careful.
While adding taint propagation I found that a lot of PHP source
code fails to use the official macros when initializing a zval. In
these cases I added another line of code to initialize the taint
bits by hand. Also, more internal documentation (other than empty
man page skeletons) could have reduced development time significantly.
Major loose ends
I already mentioned the loose wires hanging out of the machine; the
user interface for taint policy control will need to be made more
suitable for people who aren't primarily interested in PHP core
hacking.
Support for tainted objects is still incomplete. In particular,
conversions from objects to non-objects may lose taint bits.
For now, I manually added taint support to a number of standard
built-ins (file, process, *scanf, *printf, and a subset of the
string functions) and extensions (mysql, mysli). I hope this will
be sufficient to get some experience with taint support.
Right now, taint-unaware extensions will work properly as long as
taint checks are disabled (the default), and as long as they are
recompiled with the patched PHP header files. When taint checking
is turned on, some extensions may cause false alarms when they fail
to use the official macros to initialize zval structures, thereby
leaving some taint bits at uninitialized values.
I still hope that it will somehow be possible to annotate extensions
so that taint support can be added without modifying lots of extension
source code. However, having multiple flavors of taint, instead of
just one, will make the job so much more interesting.
Distant future
Currently, only data is labeled (currently, only with binary
attributes). No corresponding attributes exist for sources and sinks
(files, network, databases, etc.). If we knew that a connection
is encrypted, or whether something is an intranet or extranet
destination, or some other property, then we could implement more
sophisticated policies than the simple MS-DOS like file permissions
that I have implemented now.
But all this is miles beyond the immediate problem that I am trying
to solve today: helping programmers find the holes in their own code
before other people do.
Wietse
Low-level implementation
Taint support is implemented with some of the unused bits in the
zval data structure. The zval is the PHP equivalent of a memory
cell. Besides a type (string, number etc.) and value, each zval has
a reference count and a flag that says whether the zval is a reference
to yet another zval that contains the actual value.
One possible problem area is that I'm using some of those very same
bits for storing GC information. I'm not using very many, so there
should be plenty to go around, but I'm currently using 3 bits in
is_ref. 2 for the "color" (the GC state of the object) and 1 for
whether or not the object is internally buffered. Just a heads up.
Some details: I have the leftmost bit in is_ref be the indicator of
is_ref. This is so I can test is_ref by just testing (z->is_ref >=
0x80). The reason I'm doing it like this is because I find that this
comparison test is a lot faster than doing a bitwise test. I then use
the other top three bits for what I need, so the bottom 4 bits are
free.
We really should conserve space, though. The biggest problem with
increasing the size of the zval struct seems to be (incredibly) L1
cache misses, as verified with cachegrind. Even increasing the size of
the current zval struct by one byte has a measurable impact (my goal
was to keep Zend/bench.php from showing any hit at all).
David
Hello Wietse,
Wednesday, October 3, 2007, 2:35:33 AM, you wrote:
Changing
E_WARNING
intoE_ERROR
causes execution to terminate, as
one would expect. These and other settings can appear in php.ini,
on the PHP command line, or they can be made withini_set()
calls
in the application itself.
Sounds like a pretty good solution to me.
Current status
My plan is to release something in the next few days, and to go
through a several iterations while adjusting the course based on
the feedback I receive.
The performance is good: the overhead for "make test" is within the
measurement error of 1-2% (when comparing user-mode CPU time). I
do realize that a portion of the total "make test" time is spent
in non-PHP processing, but with the current numbers it doesn't
really matter if taint overhead is 1% or 2%. If someone has a
better "macro" benchmark, then of course I'm interested.
Although the code is still green, its present form already shows
potential to do things like labeling privacy sensitive data (social
security or credit card numbers) and preventing such data from
flying out the window via HTML web pages.
~2% sounds even better to me.
Low-level implementation
- TC_MYSQL:
- By default, data with this bit is not allowed in mysql_query()
(parameter: taint_checks_mysql). This requirement is not
enforced with the default setting of taint_error_level = 0.- The mysql_real_escape_string() function produces output without
this bit.- By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms) or
elsewhere (parameter: taint_marks_other).
- TC_MYSQLI:
- Like TC_MYSQL, except this one is for the
mysqli_query()
API.
The TC_MYSQLI bit is removed withmysqli_real_escape_string()
.
Couldn't we do a general SQL thing here? Or at leat a small set of different
SQL behaviors to capture all databases we support? However we should at
least have a verison that works for PDO/
- TC_SELF:
- By default, data with this bit cannot be used for internal
control operations such as eval, include, as file name argument,
as network destination, or in other contexts where someone could
take away control from the application (parameter:
taint_checks_self). This requirement disappears with
taint_error_level = 0.- I haven't yet implemented a dedicated conversion function;
to shut up the warnings, this data needs to be marked as
"safe" with a low-level untaint($foo, TC_SELF) call.- By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs).
To me this sounds more like TC_PHP, TC_SELF just doesn't sounds right.
- TC_USER1, TC_USER2: These are labels that an application can set
on specific data. For example, it could set these bits when credit
card or social security numbers come out of a database. The
taint_checks_html policy for HTML output (see above) would then
be configured to disallow data with not only with the TC_HTML
property, but also with TC_USER1 or TC_USER2. Obviously some
polished user interface would need to be built on top of this to
make application-defined attributes usable.
Taint propagation policy
While adding taint propagation I found that a lot of PHP source
code fails to use the official macros when initializing a zval. In
these cases I added another line of code to initialize the taint
bits by hand. Also, more internal documentation (other than empty
man page skeletons) could have reduced development time significantly.
Right. The same is the problem with the GC patches. Once again I think we
should disallow all direct access to zval members by inserting different
prefixes to zval members. That way wrong usage would make all other compiles
fail. So you'd just need to check whether debug and non debug works and be
safe. Another approach is to hide the actual zval struct and only provide
access functions. But that is slower.
Best regards,
Marcus
(Wietse Venema) wrote:
To give an idea of the functionality, consider the following program
with an obvious HTML injection bug:<?php $username = $_GET['username']; echo "Welcome back, $username\n"; ?>
With default .ini settings, this program does exactly what the
programmer wrote: it echos the contents of the username request
attribute, including all the malicious HTML code that an attacker
may have supplied along with it.When I change one .ini setting:
taint_error_level = `E_WARNING`
the program produces the same output, but it also produces a warning:
Warning: echo(): Argument contains data that is not converted with `htmlspecialchars()` or `htmlentities()` in /path/to/script on line 3
A PHP application doesn't always generate HTML : it can generate JSON,
CSV, PDF etc.. In this case, we don't have to call htmlspecialchars etc..
Is this warning appearing also when you want to output datas other than
HTML ? If no, how your code guess the output type ? If yes, how can we
disable this warning in pages which produce JSON etc. ?
Laurent