Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:98895
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain ethome.sk from 92.240.253.144 cause and error)
Date: Fri, 28 Apr 2017 04:49:10 +0200
To: internals@lists.php.net
Message-ID: <20170428044910.01fc1d66@eto-mona.office.smartweb.sk>
In-Reply-To: <CA+kxMuTAECR8FM0NNGjwy=Ax3TSksiUUsJ6DtnnppTe9C5jbVw@mail.gmail.com>
References: <20170427115041.06339340@eto-mona.office.smartweb.sk>
	<CA+kxMuTAECR8FM0NNGjwy=Ax3TSksiUUsJ6DtnnppTe9C5jbVw@mail.gmail.com>
Reply-To: et.code@ethome.sk
Organization: ethome.sk
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] [RFC] concept: further improvement of filter
 extension, "generalising" filter definitons while adding new callback
 filter type
From: et.code@ethome.sk ("Martin \"eto\" Misuth")

> On 27 April 2017 at 10:50, Martin "eto" Misuth <et.code@ethome.sk> wrote:
> >
> > By posting this draft, I am asking for comments.
> >  
> 
> What is the argument for doing this as part of PHP core, 

Sorry, when asked to explain myself, I am prone to "infodumping"
and am often told that I do sound quite "combative" as if having 
low selfesteem. While that is not my conscious intention, I 
apologise, if this is hard to read, due to that.

> rather than doing it as a userland library?
> 
> cheers
> Dan
> 
> (apologies for possible duplicate reply)
> 

We actually have userland library exactly like that. By library 
I mean "shelves" of functions doing various things to input vars.

I believe my proposal would make maintaining such library 
much easier while making filter extension "better", still 
keeping backward compatibility at the same time.

First I will explain our situation:

We have "CMS system" operating several thousand domains by now.
Some domains have thousands of subpages and relatively big images. 
Origins date back more than decade. System is currently operating 
at php 7.1. As such, it is not really that "big", but it's not 
that "small" either. It acts somewhat like "static site" generator, 
but with a twist. 

Instead of using markdown or other more "current" templating
language, the whole thing is built on top of "old" xml data files
and xlst preprocesor. More "dynamic" data are stored as json fragments.

At administration side, php acts like kind of mixer/preprocessor.
It prepares xml variables, xml data files, json data files 
and db sources and transforms the data through xslt templates into "the site".
Pages are processed in batches.

Even though xslt pipeline might not be fastest there is, the point is,
eventually static site output is generated (with few special dynamic handlers).
This is "published" to "outside servers".

There, data is "handed out" to clients very quickly, but naturally some 
requests still have to be processed. Or said other way: what can be served
statically is served as such, with dynamic stuff handling only special cases.

This dynamic stuff includes ajax handlers, form handlers and search.

Xml and xslt combination proved to be extremely resilient, 
withstood test of time, and survived various webdesign trend shifts 
very well. Many customers have some xml data files (containing content) 
with mtimes dating several years back.

Users have completely dynamic web administration at their disposal,
that allows them to construct sites, point and click. 

One of the major features is ability to build arbitrary forms.

As you can imagine, relatively simple xml data file, with 50-200 "virtual
controls", can be very easily "blown up" by xslt processor, into quite
complex html markup containining ajaxy controls, "subwidgets", 
plenty of css doodads and whatnot, scattering various parts of "the 
thing" into various targets (markup body, head, external scriptfiles, 
cssfiles, **php array filter template**, and so on). Thus output 
forms can get very complex, very quickly.

Major customers, that are getting major hits, as well, also love to 
make these things huge (them being so easy to make).

Most handling of this output naturally happens on client side, at least
until one submits.

For that, we implemented universal form handler that does nothing more,
than processing these (sometimes humongous) forms, using 'filter 
definitions' cached in deeply nested arrays. On submit it does also
does subtitutions in final output template.

Due to mentioned user editability, any form can end up having many 
specialty validators, that have to be implemented using callbacks.

These validators and sanitizers are, for example, for zip codes of various
countries (these things are not compatible with each other and each one 
can be special snowflake) their various display modes, of for various 
resource schemes and stuff like that.

So that is the current situation. I am excusing for this overly verbose 
description, but I hope, it conveyed our situation as clearly as possible. 

Whole system is not perfect, but it's not completely horrible either, 
there seem to be something to it.

Now, the thing is:

Even if I store "form definition" arrays in some memory cache (I tried apcu
and ramdisk), so that their instantiation is relatively quick, at rate of
requests we are getting, I see with debugging output enabled, that we are
constantly marshalling and unmarshalling things from definition arrays 
(roughly mapping to xml tree), to accomodate various 
filter quirks (especially callback). 

I am reminding this is happening on "ouput" side of things, where there is no
reason to burn cycles in generating markup, unless form is completely validated
(for confirmation display).

It would be immensely valuable to me, if we could have filter extension 
just intelligent enough, that it would process all request data
according to template, from each input source in succession, **on it's own**,
**in one go** (for each input source), only occasionally consulting callback,
instead of building layer of hacks, on top of it (filter), that are 
marshaling and unmarshaling data from "nice" structure to "badly nested one",
using foreach, array_walk or whatever.

I believe this '$definition concept' + 'callback_extended' handler allows one
to do exactly this.

This way, filter can walk the hierarchy on it's own, while same filter
definition "object" (it's really just hierarchy of hashes) can be passed 
to filter_input|var() directly.

Anyway, I also believe, this modified behaviour would be useful to other
filtering library writers as well, otherwise I wouldn't, naturally, bother
you here.

Some filter based, userland libraries are ridiculous (using classes
with FILTER_CALLBACK_METHODS_LIKE_THIS, or massively abusing regexps 
(with catastrophic backtracking and such), just because callback (I came to
believe) sucks). It doesn't mix well with other filters that can take options
and flags. Also there is no way to "pull" filter being currently used at given
defintion array index and give it to singular call to filter_var()  (like in
case of ajax) without transformation of it's structure.

Although default filters are pretty powerful on their own, at certain point,
you must resort to hacks like that, to cope with specialty corner cases.

If anything, $filter_or_definition is single Z_TYPE_P(filter_or_definition) ==
IS_LONG test, keeping filter stuff working as before. Once processing get past
that, I don't know what is worse: whether to search for filter_id in hash on C
side or build marshalling layer on top of it in php. In my eyes it's the php
userland layering that is wrong place for this.

Experimenting with this, I was surprised, how relatively small amount of C
level modification was needed, to blow off many lines of my php hacks.

But I consider myself pretty dumb person, so there is high possiblity that I am,
doing something wrong or seeing this in wrong light.

I am not personally confident about my approach either, but I honestly do
believe, that overall concept is pretty sound.

As I am no wizard coder either, I would be very grateful for reviewers. 

If general opnion is that this is crappy approach overall, I am okay
with being told NO. No problem. 

Anyway this sums my opinion and rationale. I hope you are still here.

As you see I am, by default, ready for negative conclusion regarding 
this rfc draft, but I am also curious, why this might be perceived as 
problematic? Is it because of potential compatability breakage or it 
goes against spirit of filter extension too much?

Thank you for your read even, and any reply in advance, and sorry for 
long post, hoping it answers most questions.

  eto