$Martin \"eto\" Misuth$ 8 years ago by Martin \"eto\" Misuth — view source

unread

This is concept for RFC improve filter extension further

Aim of this RFC is to give filter extension small nudge.

Introduction:

Status quo

Status quo: input "variables" source control

Author of this RFC is one of those "three weird people on the internet",
who (for his own php projects) almost completely disable "automagic"
request variable registration in php.ini, by setting:
<code>
variables_order = "CS"
request_order = ""
</code>
or even:
<code>
variables_order = "S"
request_order = ""
</code>
depending on deployment type. That way, most "magic" variables are not
even created. This roughly equates to "by default deny" principle.

For example, because REQUEST source for filter extension is not even
implemented yet, it is always crystal clear, which "input source"
each variable is fetched from. No programmer on team can somehow
mix or swap two variable's sources, unless consciously trying.

Some processing time is also "shaved off", because interpreter is not doing
any (useless) string processing for variables, that don't even make sense for
given request handler (php script behind given URL). Each request handler is
expected to be fully aware of variables it requires for further processing.

Status quo: input variable filtering

All variables entering the php application are filtered through
filter_*() calls, making great use of this extension.

For majority of input values, default filters are sufficent and are
heavily used, but for some variables, specialty filters are needed.
For those, currently provided FILTER_CALLBACK filter is suboptimal,
as it occupies whole 'options' field by callable, "breaking
generalisation" of filter API and unability to be passed custom
options.

Status quo: 'options' confusion

Because of function calls parameter name '$options' and $options
array/object field 'options' key/property, there is some confusion
among users, on how to "construct" filters.

Status quo: array/object duality

In php, both array and object (properties) are essentially built
around core structure of HashTable. From simplistic "viewpoint",
object can be seen as an "glorified" array. This is advantage.
It allows one to pass object as parameter to functions that
accept arrays.

Although php is equipped with "interface" and "trait", both are
orthogonal and quite useless when object is used "as array".
"interface" is missing machinery to express public poperties,
"trait" is not standalone entity on it's own.

However none of this is problem, if we consider object as special
case of array. When object is used as array, in php, one can
"intuitively" assume structural like (property based) type
system. This is feature.

One just needs to pass an object, without having to muck around
with interfaces and whatnot. If object has required public
properties set, it is processed as such, if not, it's same, as
if array, with keys missing or keys having null values, was provided.

Magic properties, and others 'specials', are not usually processed.
By defining class one can easily enforce required fields to be
existing, but null. Many array consuming APIs can consume object
of any class. Thanks to this, these APIs ending up pretty general.

Unfortunately current filter extension doesn't allow use of objects
(instead of array) in all contexts.

Proposed improvements:

Introduction of filter 'definition' concept and parameters cleanup.
Introduction of new 'callback_extended' filter
(while keeping compatibility with old code)
Ability to consume both arrays and objects in 'definition'
parameters.

1. Introduction of filter 'definition' concept and cleanup.

Ambigous parameter $type is renamed to $input_source.
Parameter $filter is renamed into $filter_or_definition.
Parameter $options is renamed to $definition.
For each function call where 'int $filter' value is currently
passed, new logic for processing $filter_or_definition
is employed:

Parameter $filter_or_definition can be of type (int), (array)
or (object).

Filter "usability" validation algorithm is as follows:

check if $filter_or_definition is an (int)
- if yes, it is expected to be filter_id
  - if $definition is passed it is processed
    same way as $options were before
check if $filter_or_definition is (array|object)
- if yes, check it for property/key (int) $filter
  - if yes, filter_id is extracted from
    $filter_or_definition->filter
    property(or key) and filter "definition" is
    considered "usable"
    - internally $definition is made point to
      $filter_or_definition and $definition
      from function call list is ignored.
for everything else, function call fails

Further processing continues as currently, extracting flags
and so on. Modification of C function code is minor.
Because none of these parameters is named $options anymore,
confusion is lessened.

Function signatures thus became:
<code>

filter_has_var(int $input_source, $variable_name)

filter_input(int $input_source,
string $variable_name,
$filter_or_definition,
$definition = null)

filter_var($variable,
$filter_or_definition,
$definition = null)

filter_input_array(int $input_source,
$definition = null,
$add_empty = true)

filter_var_array(array $data,
$definition = null,
$add_empty = true)

filter_list()

</code>

filter_input call now has following possible invocations:
<code>

filter_input(INPUT_GET, 'MY_VAR',
FILTER_VALIDATE_BOOLEAN);

filter_input(INPUT_GET, 'MY_VAR',
FILTER_VALIDATE_BOOLEAN,
FILTER_NULL_ON_FAILURE]);

filter_input(INPUT_GET, 'MY_VAR',
['filter'=>FILTER_VALIDATE_BOOLEAN,
'flags'=>FILTER_NULL_ON_FAILURE,
'options'=>['default'=>false]
]);

$dfn = new stdClass();
$dfn->filter = FILTER_VALIDATE_BOOLEAN;
$dfn->flags = FILTER_NULL_ON_FAILURE;
$dfn->options = new stdClass();
$dfn->options->defaul = false;

filter_input(INPUT_GET, 'MY_VAR', $defn);
</code>

2. Introduction of new 'callback_extended' filter

New filter FILTER_CALLBACK_EXTENDED:"callback_extended"
is introduced. It expects 'definition' defined as such:
<code>

$defn = [
'filter' => (int) FILTER_CALLBACK_EXTENDED,
'flags' => (int) FILTER_NULL_ON_FAILLURE,
'callback' => (callable) $callable_ex,
'options' => [
'default' => 42,
'min'=> -1,
'max'=> 64,
],
];
</code>

This filter has new id (FILTER_CALLBACK_EXTENDED=FILTER_CALLBACK++).

Instead of "abusing" field 'options', for storing callable, it
inspects 'definition' itself, searching for new field 'callback',
that is "outside" of 'options' subcomponent. Field 'options' is
passed as is, as second parameter to $callable_ex callable.

Thus callable prototype call looks like this:
$filtered_value = $callable_ex($value, $options)

This design immensely(!) simplifies development of
'per input variable type' configurable callback filters.

It also allows user to tie everything related to variable
filtering, validation and sanitization with single unified
API interface provided by filter extension.

In essence, in case of FILTER_CALLBACK_EXTENDED, filter identity
is unique value, actually composed from two subvalues:
filter_id (FILTER_CALLBACK_EXTENDED) and
$callback callable signature.

Besides for allowing "huge from" processors using nested
$definition array like in case of filter_input_array() API,
it allows other, much more flexible uses.

For example, if using objects to store filter 'definitions',
highly, expressive, "composable" and reusable "filter libraries"
can be constructed:
<code>

$def_v1 = (object) [
'filter' => FILTER_CALLBACK_EXTENDED,
'callback' => $filter_v1_handler,
'options' => (object) ['x'=>1,'y'=>2],
];
$def_v2 = (object) [
'filter' => FILTER_CALLBACK_EXTENDED,
'callback' => $filter_v2_handler,
];
$def_s1 = (object) [...];
$def_s2 = (object) [...];

$usr_validating_filters = [UVFLT_1=>$def_v1, UVFLT_2=>$def_v2];
$usr_sanitizing_filters = [USFLT_2=>$def_s1, USFLT_2=>$def_s2];

filter_input(INPUT_GET, 'MY_VAR', $usr_validating_filters[UVFLT_1]);
</code>

By moving callack's callable storage outside of 'options' component,
proper semantic separation is achieved, and sensible hierarchy of filter
'definition' is maintained, while at the same time, callback is allowed
much needed, invocation customisations.

Actual implementation is relatively straight forward. Requiring only one
new internal function addition, while reusing much of the filter extension
machinery already present (with slight modification).

3. Ability to consume both arrays and objects in 'definition'

parameters.

Extension code was reread, and what could be called 'definition' processing,
was modified, to allow both array and object consumption, by means of
HASH_OF() macro.

Conclusion

Experimental implementation seems pretty usable, passing all current
ext/filter/tests (with small modifications due to modified semantics).

More experiments are to be done, especially stress testing memory access
for usage and corruption. So far debug+maintainer-zts builds
have not found problems.

Logic, usability and compatibility was prioritised over performance.

Still, some small performance gains might be actually observed,
as prameter parsing was converted to FAST_ZPP. Especially
for high cadence of successive filter_has_var().
However no effort was done on this front.

Compared to advantages gained, code changes are relatively minor.

Attempt was made to maintain backwards compatibility, when using
FILTER_CALLBACK, although users will be suggested to "upgrade" to
FILTER_CALLBACK_EXTENDED.

Hidden errors in legacy scripts, due to change of $filter
(now $filter_or_definition) prameter processing, are not evaluated,
and are considered severe bugs anyway. $filter should have
been an (int).

Reflection API using sniffers will break (if they expect certain
filter API layout), but that is expected (or should be expected),
by reflection consumers and thus is not considered a problem.

Nobody should be, probably, using Reflection API to drive 'decision
tree' in production code, invoked several hundreaths (or thousands)
requests per second.

Should this RFC pass, filter documentation is going to be updated
to match new semantics.

By posting this draft, I am asking for comments.

Should this draft be considered worth inclusion among RFCs, I am asking
for karma, to be able to add it into wiki.

After that, git fork will be provided, for reviewers, to evaluate the code.

After successful review, I am asking for final voting.

My intended upstream inclusion target window is "before" PHP_7.2.

However I am not interested into speed of inclusion as much,
as I am in sensibly improving quality of (awesome) filter extrension.

It would be great, if it went through, given advantages it has for userland
consumers.

Thank you for reading and consideration, in advance.

eto

8 years ago by Dan Ackroyd — view source

unread

By posting this draft, I am asking for comments.

What is the argument for doing this as part of PHP core, rather than
doing it as a userland library?

cheers
Dan

(apologies for possible duplicate reply)

$Martin \"eto\" Misuth$ 8 years ago by Martin \"eto\" Misuth — view source

unread

By posting this draft, I am asking for comments.

What is the argument for doing this as part of PHP core,

Sorry, when asked to explain myself, I am prone to "infodumping"
and am often told that I do sound quite "combative" as if having
low selfesteem. While that is not my conscious intention, I
apologise, if this is hard to read, due to that.

rather than doing it as a userland library?

cheers
Dan

(apologies for possible duplicate reply)

We actually have userland library exactly like that. By library
I mean "shelves" of functions doing various things to input vars.

I believe my proposal would make maintaining such library
much easier while making filter extension "better", still
keeping backward compatibility at the same time.

First I will explain our situation:

We have "CMS system" operating several thousand domains by now.
Some domains have thousands of subpages and relatively big images.
Origins date back more than decade. System is currently operating
at php 7.1. As such, it is not really that "big", but it's not
that "small" either. It acts somewhat like "static site" generator,
but with a twist.

Instead of using markdown or other more "current" templating
language, the whole thing is built on top of "old" xml data files
and xlst preprocesor. More "dynamic" data are stored as json fragments.

At administration side, php acts like kind of mixer/preprocessor.
It prepares xml variables, xml data files, json data files
and db sources and transforms the data through xslt templates into "the site".
Pages are processed in batches.

Even though xslt pipeline might not be fastest there is, the point is,
eventually static site output is generated (with few special dynamic handlers).
This is "published" to "outside servers".

There, data is "handed out" to clients very quickly, but naturally some
requests still have to be processed. Or said other way: what can be served
statically is served as such, with dynamic stuff handling only special cases.

This dynamic stuff includes ajax handlers, form handlers and search.

Xml and xslt combination proved to be extremely resilient,
withstood test of time, and survived various webdesign trend shifts
very well. Many customers have some xml data files (containing content)
with mtimes dating several years back.

Users have completely dynamic web administration at their disposal,
that allows them to construct sites, point and click.

One of the major features is ability to build arbitrary forms.

As you can imagine, relatively simple xml data file, with 50-200 "virtual
controls", can be very easily "blown up" by xslt processor, into quite
complex html markup containining ajaxy controls, "subwidgets",
plenty of css doodads and whatnot, scattering various parts of "the
thing" into various targets (markup body, head, external scriptfiles,
cssfiles, php array filter template, and so on). Thus output
forms can get very complex, very quickly.

Major customers, that are getting major hits, as well, also love to
make these things huge (them being so easy to make).

Most handling of this output naturally happens on client side, at least
until one submits.

For that, we implemented universal form handler that does nothing more,
than processing these (sometimes humongous) forms, using 'filter
definitions' cached in deeply nested arrays. On submit it does also
does subtitutions in final output template.

Due to mentioned user editability, any form can end up having many
specialty validators, that have to be implemented using callbacks.

These validators and sanitizers are, for example, for zip codes of various
countries (these things are not compatible with each other and each one
can be special snowflake) their various display modes, of for various
resource schemes and stuff like that.

So that is the current situation. I am excusing for this overly verbose
description, but I hope, it conveyed our situation as clearly as possible.

Whole system is not perfect, but it's not completely horrible either,
there seem to be something to it.

Now, the thing is:

Even if I store "form definition" arrays in some memory cache (I tried apcu
and ramdisk), so that their instantiation is relatively quick, at rate of
requests we are getting, I see with debugging output enabled, that we are
constantly marshalling and unmarshalling things from definition arrays
(roughly mapping to xml tree), to accomodate various
filter quirks (especially callback).

I am reminding this is happening on "ouput" side of things, where there is no
reason to burn cycles in generating markup, unless form is completely validated
(for confirmation display).

It would be immensely valuable to me, if we could have filter extension
just intelligent enough, that it would process all request data
according to template, from each input source in succession, on it's own,
in one go (for each input source), only occasionally consulting callback,
instead of building layer of hacks, on top of it (filter), that are
marshaling and unmarshaling data from "nice" structure to "badly nested one",
using foreach, array_walk or whatever.

I believe this '$definition concept' + 'callback_extended' handler allows one
to do exactly this.

This way, filter can walk the hierarchy on it's own, while same filter
definition "object" (it's really just hierarchy of hashes) can be passed
to filter_input|var() directly.

Anyway, I also believe, this modified behaviour would be useful to other
filtering library writers as well, otherwise I wouldn't, naturally, bother
you here.

Some filter based, userland libraries are ridiculous (using classes
with FILTER_CALLBACK_METHODS_LIKE_THIS, or massively abusing regexps
(with catastrophic backtracking and such), just because callback (I came to
believe) sucks). It doesn't mix well with other filters that can take options
and flags. Also there is no way to "pull" filter being currently used at given
defintion array index and give it to singular call to filter_var() (like in
case of ajax) without transformation of it's structure.

Although default filters are pretty powerful on their own, at certain point,
you must resort to hacks like that, to cope with specialty corner cases.

If anything, $filter_or_definition is single Z_TYPE_P(filter_or_definition) ==
IS_LONG test, keeping filter stuff working as before. Once processing get past
that, I don't know what is worse: whether to search for filter_id in hash on C
side or build marshalling layer on top of it in php. In my eyes it's the php
userland layering that is wrong place for this.

Experimenting with this, I was surprised, how relatively small amount of C
level modification was needed, to blow off many lines of my php hacks.

But I consider myself pretty dumb person, so there is high possiblity that I am,
doing something wrong or seeing this in wrong light.

I am not personally confident about my approach either, but I honestly do
believe, that overall concept is pretty sound.

As I am no wizard coder either, I would be very grateful for reviewers.

If general opnion is that this is crappy approach overall, I am okay
with being told NO. No problem.

Anyway this sums my opinion and rationale. I hope you are still here.

As you see I am, by default, ready for negative conclusion regarding
this rfc draft, but I am also curious, why this might be perceived as
problematic? Is it because of potential compatability breakage or it
goes against spirit of filter extension too much?

Thank you for your read even, and any reply in advance, and sorry for
long post, hoping it answers most questions.

eto

8 years ago by Dan Ackroyd — view source

unread

would make maintaining such library much easier

No. Maintaining stuff in PHP core is much more difficult than
maintaining it in userland. 'Maintaining' doesn't just mean fixing
bugs, it includes thinking about how any changes to APIs get rolled
out to users, while minimising BC problems, and keeping code
up-to-date with the PHP engine.

These things are much easier to do in userland than they are in PHP core.

It would be immensely valuable to me, if we could have filter extension
just intelligent enough, that it would process all request data
according to template, from each input source in succession, on it's own,
in one go

While I can see that could be useful to you the question, again, is
why can't this be done in userland?

Even if I store "form definition" arrays in some memory cache (I tried apcu
and ramdisk), so that their instantiation is relatively quick,

If you want data/code to be retrieved very quickly, I'd recommend
generating PHP code, and then letting OPCache, cache that code, as
well as possibly use a classloader that is a good fit for your use
case, like this one: https://github.com/Danack/LowMemoryClassloader/
rather than a generic classloader. It loads classes that are already
present in OPCache with zero system calls.

that we are
constantly marshalling and unmarshalling things from definition arrays

Well, that sounds like a real problem, but it's really not obvious how
your RFC would solve that problem.

I hope you are still here.

To be honest, I kind of faded out in the middle, but I'm back now.

You might want to consider the phrase "I'm sorry I wrote you such a
long letter; I didn't have time to write a short one". This list is
distributed to thousands of people. Spending some time editing your
email down to a concise one, makes it much more likely to be read.

Also, it's disappointing that you wrote so many words, but didn't
(imho) actually answer why this needs to be in core rather than a
userland library. Even if other people might find it useful, that can
be more easily done through distributing userland code i.e. through
packagist.

If it really can't be done in userland, then possibly writing it as
a PECL extension would be a good way to prove that it is a useful
thing that other people would want to use, as well as being a good way
of iterating the API until it fits everyone's use-cases, rather than
hoping to get the API correct on the first version in PHP core.

cheers
Dan

p.s.

I am reminding this is happening on "ouput" side of things, where there is no
reason to burn cycles in generating markup, unless form is completely validated
(for confirmation display).

It's hard to know for sure, without seeing your actual code, but this
also sounds like a problem that varnish cache, with edge side includes
is designed solve.

$Martin \"eto\" Misuth$ 8 years ago by Martin \"eto\" Misuth — view source

unread

On Fri, 28 Apr 2017 11:06:10 +0100
Dan Ackroyd danack@basereality.com wrote:

would make maintaining such library much easier

No. Maintaining stuff in PHP core is much more difficult than
maintaining it in userland. 'Maintaining' doesn't just mean fixing
bugs, it includes thinking about how any changes to APIs get rolled
out to users, while minimising BC problems, and keeping code
up-to-date with the PHP engine.
Got it. So this is to minimize php team maintenaince burden. Makes sense.

These things are much easier to do in userland than they are in PHP core.

It would be immensely valuable to me, if we could have filter extension
just intelligent enough, that it would process all request data
according to template, from each input source in succession, on it's
own, in one go

While I can see that could be useful to you the question, again, is
why can't this be done in userland?
Understood. Still, from my point of view, filter is missing tiny nudges that
have to be done on module side.

Even if I store "form definition" arrays in some memory cache (I tried apcu
and ramdisk), so that their instantiation is relatively quick,

If you want data/code to be retrieved very quickly, I'd recommend
generating PHP code, and then letting OPCache, cache that code, as
well as possibly use a classloader that is a good fit for your use
case, like this one: https://github.com/Danack/LowMemoryClassloader/
rather than a generic classloader. It loads classes that are already
present in OPCache with zero system calls.
Thank you, I will investigate that suggestion as well.

Well, that sounds like a real problem, but it's really not obvious how
your RFC would solve that problem.
I believe, my approach provides scaffolding, that will allow filter to
be extended later, to advance through definition arrays on it's own,
without need to wrap initial "walk" of definition array, into userland
"iterators" (functions that unpack data from custom definition arrays
and transform them into form suitable for filter consumption).

To be honest, I kind of faded out in the middle, but I'm back now.

You might want to consider the phrase "I'm sorry I wrote you such a
long letter; I didn't have time to write a short one". This list is
distributed to thousands of people. Spending some time editing your
email down to a concise one, makes it much more likely to be read.
Also, it's disappointing that you wrote so many words, but didn't
(imho) actually answer why this needs to be in core rather than a
userland library.

This might sound like void excuse, but I projected my personal experience
with new hires (who all left after very short time working with our codebase)
onto you. After observing that several time, I came to understand, our
system is "shomehow weird" and needs thorough explanation. I had to explain
the system again and again, and got caught up in this loop. I will refrain from
ever doing this again, from now on.

If it really can't be done in userland, then possibly writing it as
a PECL extension would be a good way to prove that it is a useful
thing that other people would want to use, as well as being a good way
of iterating the API until it fits everyone's use-cases, rather than
hoping to get the API correct on the first version in PHP core.

Can I interpret that paragraph this way:

it's okay for me to take ext/filter with my modifications
(even though they modify only part of the extension)
and publish them somewhere else (like github for example)
while refining the api further to be "perfect"
while adding intended usecases and examples
and only after that ask for evaluation by others and attempt RFC again

Correct?

If yes, there is one thing I don't know what to do about, and need your advice.

What about module name clash?

I would prefer, for this alternative module, to be named filter as well
(as there is intent to get modifications back into core, eventually).

Or it's should I go and rename thing to something like filteri ("filter
improved") or filtero ("filter other"), even though big parts of code are
original filter still? I would like to avoid doing that.

cheers
Dan
Thank you for your advice.

p.s.

I am reminding this is happening on "ouput" side of things, where there is
no reason to burn cycles in generating markup, unless form is completely
validated (for confirmation display).

It's hard to know for sure, without seeing your actual code, but this
also sounds like a problem that varnish cache, with edge side includes
is designed solve.
Ah, well form pages are "static" (aka first client render). Form handlers
that actually deal with dynamic part of those forms, and are tied to "private"
user sessions. I fail to see, how can I reliably and securely cache those, if
they are unique per session, form and request.

eto

8 years ago by Dan Ackroyd — view source

unread

there is one thing I don't know what to do about, and need your advice. What
about module name clash? I would prefer, for this alternative module, to be
named filter as well

IMO that violates the principle of least astonishment - a function
that has the same name as an internal PHP function doing something
different.

However if you really want to do that, 'overloading' internal
functions can be achieved by (ab)using namespaces:

namespace Foo {
function filter_var($input, $filter = FILTER_DEFAULT ,$options = []) {
echo "I am a custom filter function";
return $input;
}
}

namespace Bar {
use function Foo\filter_var as filter_var;
$val = filter_var("someInput");
}

Output is "I am a custom filter function"

cheers
Dan

[RFC] concept: further improvement of filter extension, "generalising" filter definitons while adding new callback filter type

Introduction:

Status quo

Status quo: input "variables" source control

Status quo: input variable filtering

Status quo: 'options' confusion

Status quo: array/object duality

Proposed improvements:

1. Introduction of filter 'definition' concept and cleanup.

2. Introduction of new 'callback_extended' filter

3. Ability to consume both arrays and objects in 'definition'

Conclusion