Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:95244
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain ohgaki.net designates 180.42.98.130 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <7795ca21-bd70-fe65-9519-af95fdfee33f@gmail.com>
References: <CAGa2bXarw2jTj0yhXpywWNQOJLR2pdhfaqSdwj_btT2giEL15Q@mail.gmail.com>
 <CAGa2bXadVagTwBzEym9PZROmXu3SKxg=huCAQNK3k1-jZeG_pg@mail.gmail.com>
 <CAGa2bXY-vmEgEFrE0OxQPSx=FKHeHdCayrDXxtTkHUVaJbV+-A@mail.gmail.com>
 <CA+kxMuRiOBQpmTeKqNyV8rX0GKCLrYixi--y5TcYUkdqpT746w@mail.gmail.com>
 <CAGa2bXa78fkZw8gtepmBDu+VDwHW_WiDDv65=zzgLTFqYxF+DA@mail.gmail.com>
 <CADyq6s+Y-OeKEyMus5s7xp9MqY7Fq7udqd5BRwtyGXbYCY1Uag@mail.gmail.com>
 <CADyq6sJuHCrLOoL22MWHv+cfrQQx9z=avswCpDVmnuyBCAchsA@mail.gmail.com>
 <CADyq6sKs=e_1Pc_CCLuThsU-qsvRseQcpx=sWKB_uhWsg=aRmQ@mail.gmail.com> <7795ca21-bd70-fe65-9519-af95fdfee33f@gmail.com>
Date: Wed, 17 Aug 2016 08:09:12 +0900
Message-ID: <CAGa2bXZMrO__dVd=qVoKjuPmcvVJ-q4xa8XKc0+aj1kH7Xd2Fw@mail.gmail.com>
To: Stanislav Malyshev <smalyshev@gmail.com>
Cc: Marco Pivetta <ocramius@gmail.com>, Dan Ackroyd <danack@basereality.com>, 
	PHP Internals List <internals@lists.php.net>
Content-Type: text/plain; charset=UTF-8
Subject: Re: [PHP-DEV] Re: [RFC][VOTE] Add validation functions to filter module
From: yohgaki@ohgaki.net (Yasuo Ohgaki)

Hi Stas,

On Mon, Aug 15, 2016 at 2:17 PM, Stanislav Malyshev <smalyshev@gmail.com> wrote:
>> It seems there is misunderstanding.
>> These new functions are intended for "secure coding input validation" that
>> should never fail. It means something unexpected in input data that
>> cannot/shouldn't keep program running. Why do you need to parse
>> message?
>
> I think the problem here is as follows: assume you accept use input. You
> want it to conform to some set of rules. If it does not, you may want to
> inform the user that the input is wrong, in an informative way. Now, if
> you say these functions "should never fail", it implies that before
> them, there would be other functions filtering user input (because user
> input could always violate whatever rules you'd have) - and then the
> question is, would you really want *two* sets of validators? You'd
> probably want one.
> Now, when you have one, you probably want it to validate the data and
> return some information that would be useful for informing the user what
> has gone wrong. That seems to be the issue here.
> I do think having strong input validation is a good thing. However, we'd
> also need to have them in a way that would make them useful in above
> scenario - otherwise people would avoid them because they fail "too
> hard" and the app does not retain enough control over the outcome.

I think this discussion relates to following questions.
I'll try to explain there.

>
>> There is misunderstanding on this.
>> As I wrote explicitly in the RFC, input validation and user input
>> mistakes must be handled differently.
>>
>> "The input validation (or think it as assertion or requirement) error"
>> that this RFC is dealing, is should never happen conditions (or think
>> it as contract should never fail).
>
> This is what I'm not sure I understand - when this approach would be
> used? I.e. if I get data from the user, I surely can not claim I can
> impose any conditions on the data that would never fail. Is it assumed
> I'd pre-filter the data before passing it to this filter?

How and what rules could be imposed to inputs varies depending on
what kind of data should be sent from outsides of a software including
human users.

Let's say your app validate user written/chosen "Date" on client side by
JavaScript. Then browser must send whatever "Date" format you impose
to client. It may be "YYYYMMDD", for example.

Then programer should not accept "Date" format other than "YYYYMMDD"
because other format is invalid. Accepting format other than "YYYYMMDD"
does only bad and increase risks of program malfunctioning. i.e. All kinds
of injections like JavaScript, SQL, Null char, Newline, etc.

The basic idea of secure coding input validation is to remove all unnecessary
security risks at "Input Validation".

Even when "Date" field is plain <input> that user can write any chars,
Null char, CR/LF, TAB or any CNTRL chars should not be in there. There will
be no users type in 100 chars for "Date" field unless they were trying to tamper
application.

"Input validation" should reject all of them and does not have to inform users
(attackers) to "there is invalid input". If you need to tell  legitimate users
"There is invalid input", then it should be treated by "Business logic", not by
"Input validation".

>
>> The point of having the input validation is accept only inputs that
>> program expects and can work correctly. Accepting unexpected
>> data that program cannot work correctly is pointless.
>
> Well, that depends on what you mean by "accepting". The program should
> exhibit sane behavior (i.e., useful error message, not whitescreen or
> something like that) on bad input. That behavior can be different -
> i.e., if you are given wrong password, you shouldn't be too helpful and
> say "this password is wrong, the right password is this: ...." (you'd
> laugh but there *was* a real application doing this, no, I have no idea
> what the developers were thinking :) but at least you could say
> "authentication details are wrong".

User authentication could do the similar to "Date" field for "User name"
and "Password".

"User name" and "Password" shouldn't have CNTRL chars or invalid char
encoding. Even when fields are plain <input>, there shouldn't be 500 chars
long inputs for them.

Anything else for "User name" and "Password" should be handled by
"Business logic". Logic part should display nice and proper error messages
like

 - User name is too long for 100 chars name.
 - Password is too long for 100 chars password.
 - User name and/or Password is wrong and failed to authenticate.


>> Don't misunderstood me. I'm not saying "You should reject user input
>> mistakes".
>> "User input mistakes" and "input validation error" is totally different
>> error.
>
> Here, again, I am not sure I understand the difference.

The reason why I propose to divide input error checks into "Input validation"
and "Business logic" is for simplicity and maintainability.

"Input validation" should be done not only for human entered inputs, but
also automatically generated inputs by system.

Generally speaking, developers should not accept request that has

Invalid browser headers:
 - Invalid REFERER contains Illegal/CTNRL chars and/or too many chars.
 - Invalid ACCEPT-CHARSET contains Illegal/CNTRL chars and/or too many chars.
 - Invalid ACCEPT-ENCODING contains Illegal/CNTRL chars and/or too many chars.
 - Invalid ACCEPT-LANGUAGE contains Illegal/CNTRL chars and/or too many chars.
 - and so on.

Invalid POST/GET request:
 - Lacks required field by your program. e.g. If you set CSRF token
for POST always, but it's missing.
 - Multi page form inputs and lack/have invalid data that should have
been validated previously. Note: there is design choice for this
where/how to deal with invalid inputs.
 - Program written data is invalid. e.g.
//php.net/show_bug.php?id=[string contains CNTRL chars and/or 100
chars or more]
 - $_POST/$_GET has more than 20 elements. Note: most apps/code would
not have this many elements.

Invalid COOKIE:
 - $_COOKIE has more than 20 elements. Note: normal apps would not
have this many cookies.
 - Lacks required field by your program.
 - Invalid chars. e.g. CNTRL chars.

All of these have history of abuse by attackers and programs should not
accept them. Please note that secure coding requires to output
securely. Input validation and output sanitization should be treated
as individual task. e.g. Escape all variables at "Output" code when
you output something to other software. Never assume, "This var is
validated at input, so it is safe without escaping."

It's developer's choice how to validate inputs, e.g. they don't use
"CONNECTION" HTTP header at all and don't care, but all of secure
coding related guides that I know of recommends/requires to validate
"all inputs".

Validating all inputs that are irrelevant to "Business logic" makes
programs complicated and hard to maintain. Broken char encoding, too
long/short, CNTRL chars for <form> inputs are better to handled by
"Input validation" because the same thing might be done by different
<form>s repeatedly.

There are many possibility for software design. This RFC is designed
to encourage to do certain validation. However, this RFC does not
impose developers to do certain validation, but provides tools that
are needed for validations.

I would not encourage users to disable exception from
filter_require_var()/filter_require_var_array(), but I've changed them
not to raise exception optionally as a last minute change. This allows
developers to use new validator for wider purposes.

Regards,

P.S. I'll extend vote period because there is ongoing discussion.

BTW, ISO 27000/ISMS requires/recommends proposed input validation.
Latest ISO 27000 mentioned as "adopt secure programming". Older
ISO 27000 explained how to validate inputs. New ISO 27000 removed
detailed input validation method explanation because secure programming
is widely adopted and standardized.

--
Yasuo Ohgaki
yohgaki@ohgaki.net