Sql Object Model Parser & Sanitizer (was [RFC] is_literal)

4 years ago by Mike Schinkel — view source

unread

Le 22/06/2021 à 11:28, Dan Ackroyd a écrit :

Should(n't?) PHP add a basic SQL builder class that can be extended for special cases, e.g. different flavors of SQL?

No. Or at least not yet.

This type of thing is much better done in userland, where the api can
evolve at a fast rate, rather than being limited by the fixed release
schedule of PHP.

Agreed, PHP is probably not the right place for an SQL builder, there's too many dialects, too many standard or non-standard features, in the end SQL builders almost always end-up being opinionated by its designer's original need, philosophy, or SQL usage habits, and tailored for users which use certain paradigms.

An SQL query builder is already business domain tied, in some way. Of course most are very generic, but often don't handle properly SQL advanced or modern features, whereas some other where built for ORMs and such and may a reduced advanced SQL features that fits the original need.

I don't wish to see an SQL query builder in PHP core, instead of favoring usage of modern SQL, it'll in my opinion (of course, that's subjective) lower the expectation of people and probably normalize SQL-92 being the only SQL spoken by people writing PHP (just exaggerating a bit).

Let me clarify by giving what I am referring to a different name: A SQL Object Model with parser and sanitizer functionality. One with a simple interface and one that would need not be tied to or limited by any specific dialect of SQL.

To illustrate what I mean I created the following straw man example. It assumes a hypothetical SqlObjectModel classes that needs to be instantiated with an objects that implement hypothetical SqlDialectInterface and SqlSchemaInterface instances passed to its constructor. The former defines the rules for the MySql dialect, and the latter accesses the schema to validate data and data types.

Some "dialects" and "validators" for very well-known and widely used database servers such as MySQL could be implemented in PHP core, while others could be implemented in userland (the database vendors would be good candidate to implement them, actually.) I don't yet know what those interfaces should look like but I am pretty sure that could be figured out.

So here is the example (I omitted error handling logic for brevity and clarity):

$sql = <<<SQL
SELECT
c.id AS company_id,
c.company_name,
jo.job_title,
COUNT(*) AS job_openings
FROM companies c
JOIN jobs_openings jo ON c.id=jo.company_id
WHERE 1=1
AND city_name = 'Nantes'
AND country_name='France'
AND job_openings > {openings}
GROUP BY
c.company_name
LIMIT
{limit}
SQL;

$conn = mysqli_connect(...);
$som = new SqlObjectModel(new MySqlDialect(), new MySqlSchema($conn));
$som->parse($sql);
$som->set( "limit", $_GET['limit'] );
$som->set( "openings", $_GET['openings'] );
$result = mysqli_query( $conn, $som->sql() );

The SQLObjectModel could potentially have additional features like the ability to get and set the table name, list of fields, joins, where clauses, etc. but the core value-add is not in building SQL but rather parsing and sanitizing with properly escaped and quoted arguments so that known-safe SQL could be generated.

Existing or new SQL builders could use this to take SqlObjectModel() and pass its generated SQL through to ensure their SQL they will output is safe, which by the way the string could now be tagged as "trusted" before the builder returns the SQL to its developer, assuming an is_trusted() RFC were to pass.

Similarly, any ORM could ensure its SQL is fully sanitized and then pass on to whatever functions are needed to execute the SQL.

So, absolutely nothing here would keep people from using "modern SQL," and/or cause people to lower the expectation to normalize SQL-92. This could be a generic query parser and sanitizer that could be configured by the "dialect" and "schema" objects passed to it and its scope it small, simple and straightforward.

I hope that clarifies what I was thinking?

-Mike

4 years ago by Pierre — view source

unread

Le 22/06/2021 à 14:45, Mike Schinkel a écrit :

I think we are talking two different things here.  And I will
admit I am probably the one that is not using exactly the right
term.  I had envisioned something, and then Dan said "SQL Builder"
I used that term.

Let me clarify by giving what I am referring to a different name:
 A SQL Object Model with parser and sanitizer functionality.  One
with a simple interface and one that would need not be tied to or
limited by any specific dialect of SQL.

To illustrate what I mean I created the following straw man
example.  It assumes a hypothetical SqlObjectModel classes that
needs to be instantiated with an objects that implement
hypothetical `SqlDialectInterface` and `SqlSchemaInterface`
instances passed to its constructor.  The former defines the rules
for the MySql dialect, and the latter accesses the schema to
validate data and data types.

Some "dialects" and "validators" for very well-known and widely
used database servers such as MySQL could be implemented in PHP
core, while others could be implemented in userland (the database
vendors would be good candidate to implement them, actually.)  I
don't yet know what those interfaces should look like but I am
pretty sure that could be figured out.

So here is the example (I omitted error handling logic for brevity
and clarity):

$sql = <<<SQL
SELECT
c.id <http://c.id> AS company_id,
    c.company_name,
    jo.job_title,
    COUNT(*) AS job_openings
FROM companies c
    JOIN jobs_openings jo ON c.id <http://c.id>=jo.company_id
WHERE 1=1
    AND city_name = 'Nantes'
    AND country_name='France'
    AND job_openings > {openings}
GROUP BY
    c.company_name
LIMIT
    {limit}
SQL;

$conn = mysqli_connect(...);
$som = new SqlObjectModel(new MySqlDialect(), new MySqlSchema($conn));
$som->parse($sql);
$som->set( "limit", $_GET['limit'] );
$som->set( "openings", $_GET['openings'] );
$result = mysqli_query( $conn, $som->sql() );


The SQLObjectModel *could* potentially have additional features
like the ability to get and set the table name, list of fields,
joins, where clauses, etc. but the core value-add is not in
building SQL but rather parsing and sanitizing with properly
escaped and quoted arguments so that known-safe SQL could be
generated.

Existing or new SQL builders could use this to take
SqlObjectModel() and pass its generated SQL through to ensure
their SQL they will output is safe, which by the way the string
could now be tagged as "trusted" before the builder returns the
SQL to its developer, assuming an is_trusted() RFC were to pass.

Similarly, any ORM could ensure its SQL is fully sanitized and
then pass on to whatever functions are needed to execute the SQL.

So, absolutely nothing here would keep people from using "modern
SQL," and/or cause people to lower the expectation to normalize
SQL-92.  This could be a generic query parser and sanitizer that
could be configured by the "dialect" and "schema" objects passed
to it and its scope it small, simple and straightforward.

I hope that clarifies what I was thinking?

-Mike

Actually, it makes more sense.

It's even funnier if you consider it, it's more or less what I'm doing
in my own SQL builder for find and replace placeholders, after the query
builder actually built the query. I'm not replacing placeholders with
values, but I'm replacing my own API arbitrary placeholders with those
expected by the underlaying dialect or low-level API (I'm not naive
enough to concatenate user strings in SQL, I'm leaving the hard escaping
work to the server or lower-level db connector API using prepared
queries when the low level driver doesn't handle it properly).

But that said, what you describe already intersects with most SQL
builders own features, and since they all have very different
architecture, I'm not sure this kind of helper would be that much
beneficial. But that something that could make sense to be discussed.

In all cases, thanks for answering.

Regards,

Pierre

4 years ago by Kamil Tekiela — view source

unread

Hi Mike,

Please don't do this. We already have PDO with prepared statements. The
data must be bound. This is the secure way of writing SQL queries. The idea
behind SQL builder is to generate SQL, not to allow the data to be
sanitized.
Every time I hear the word sanitize I get goose bumps. You can't remove any
characters from a string to make it safe. If you want to use escaping, then
you need to do it context aware and properly formatted. Don't sanitize
anything. Format the SQL properly instead.

On a general note. Implementing SQL builder in PHP would be an enormous
task, which is not feasible. There are so many dialects, so many options,
and even then it won't ever be accurate as you don't have the full context
in PHP. SQL is a very powerful language, and building a parser for it in
PHP would mean that we either limit it to a subset of valid SQL commands,
or we try to create a super tool that is more powerful than MySQL, Oracle,
PostgreSQL, etc. combined.
There's absolutely nothing wrong with writing SQL in PHP and preparing it
on the server. For database servers that don't support prepared statements
we already have PDO which is an abstraction library that tries to escape
and format data within SQL. It works 99% of the time.

The example you suggested already has a simple syntax in PHP.

$conn = mysqli_connect(...);
$stmt = $conn->prepare($sql);
$stmt->execute([$_GET['openings'], $_GET['limit']]);

Regards,
Kamil

4 years ago by Pierre — view source

unread

Le 22/06/2021 à 15:00, Kamil Tekiela a écrit :

Hi Mike,

Please don't do this. We already have PDO with prepared statements. The
data must be bound. This is the secure way of writing SQL queries. The idea
behind SQL builder is to generate SQL, not to allow the data to be
sanitized.
Every time I hear the word sanitize I get goose bumps. You can't remove any
characters from a string to make it safe. If you want to use escaping, then
you need to do it context aware and properly formatted. Don't sanitize
anything. Format the SQL properly instead.

On a general note. Implementing SQL builder in PHP would be an enormous
task, which is not feasible. There are so many dialects, so many options,
and even then it won't ever be accurate as you don't have the full context
in PHP. SQL is a very powerful language, and building a parser for it in
PHP would mean that we either limit it to a subset of valid SQL commands,
or we try to create a super tool that is more powerful than MySQL, Oracle,
PostgreSQL, etc. combined.
There's absolutely nothing wrong with writing SQL in PHP and preparing it
on the server. For database servers that don't support prepared statements
we already have PDO which is an abstraction library that tries to escape
and format data within SQL. It works 99% of the time.

The example you suggested already has a simple syntax in PHP.

$conn = mysqli_connect(...);
$stmt = $conn->prepare($sql);
$stmt->execute([$_GET['openings'], $_GET['limit']]);

Regards,
Kamil

I fully agree with you. Any attempt to be smart in doing this will
eventually end-up in no-so-smart corner-case bugs that will be very,
very hard to deal with. It's almost an impossible mission. And if by any
luck you'd succeed in making this work, it probably would be a
maintenance nightmare.

Regards,

Pierre

4 years ago by Mike Schinkel — view source

unread

Hi Mike,

Please don't do this. We already have PDO with prepared statements. The data must be bound. This is the secure way of writing SQL queries.

The problem is that over 40% of the web currently runs on PHP code that using mysqli. That CMS does not support PDO nor prepared statements, and is unlikely to switch to it anytime some in the foreseeable future.

A SQL object model parser and sanitizer could more easily be used incrementally by that CMS since PDO does not share connections with mysqli (AFAIK, anyway.)

The idea behind SQL builder is to generate SQL, not to allow the data to be sanitized.

That is why the title of this email said "Parse and Sanitizer" not "Builder."

Every time I hear the word sanitize I get goose bumps. You can't remove any characters from a string to make it safe. If you want to use escaping, then you need to do it context aware and properly formatted. Don't sanitize anything. Format the SQL properly instead.

I believe you are latching onto the word sanitizing as you understand it and ignoring the context of the discussion.

What I believe I am proposing is to implement an object that would be SQL-syntax-aware and — to use your words — "format the SQL properly."

But maybe I am wrong? Can you explain how what I am proposing would not be:

"Removing the characters" needed to make it safe?
Context aware?

Note that I did not specifically talk about removing characters, and I did talk about context. I talked about parsing the SQL so as to ensure it is safe.

On a general note. Implementing SQL builder in PHP would be an enormous task, which is not feasible.

Is it possible you just scanned the first email in this thread and did not fully read it?

For the email to which you replied I literally wrote:

I think we are talking two different things here. And I will admit I am probably the one that is not using exactly the right term. I had envisioned something, and then Dan said "SQL Builder" I used that term (wrongly.)Let me clarify by giving what I am referring to a different name: A SQL Object Model with parser and sanitizer functionality.

What I was discussing would not be the enormous task you are referring to. I was not proposing creating anything at all like these first two Google results for "php sql query builder: [1] and [2]. I was proposing a SQL object model, parser and "safe SQL generator" (I changed the last word since you triggered on it.)

There are so many dialects, so many options, and even then it won't ever be accurate as you don't have the full context in PHP.

My email addressed how dialects and options would be handled. Using dependency injection of objects that define each dialect and provide access to the schema.

SQL is a very powerful language, and building a parser for it in PHP would mean that we either limit it to a subset of valid SQL commands, or we try to create a super tool that is more powerful than MySQL, Oracle, PostgreSQL, etc. combined.

That is a false binary. It would not be a huge undertaking to create a generic query parser that used an object implementing an interface to provide it with the information needed to correctly parse a query enough to sanitize it.

And classes defining any given SQL subset could, if not in PHP core, be written in userland. And the DB vendor is a likely candidate to write them too.

There's absolutely nothing wrong with writing SQL in PHP and preparing it on the server. For database servers that don't support prepared statements we already have PDO which is an abstraction library that tries to escape and format data within SQL. It works 99% of the time.

"Nothing wrong with" (per se), is true.

But for applications that have 15 years of legacy code using mysql(i), PDO is a non-starter.

The example you suggested already has a simple syntax in PHP.

$conn = mysqli_connect(...);
$stmt = $conn->prepare($sql);
$stmt->execute([$_GET['openings'], $_GET['limit']]);

If PHP were to add functionality to core to use prepared statements with the mysqli connection, this discussion _might) be different.

But, as I said, for the CMS that controls >40% of the web PDO is a non-starter.

OTOH, what about when a developer needs to parameterize field and table names in PDO[3]? PDO does not address that aspect of assembling a known-safe SQL string and requires manual sanitization. What I am proposing would address this.

IOW, PDO is a tool for a different albeit overlapping use-case, and could even leverage PDO which as applicable to achieve its objectives.

-Mike

[1] https://github.com/nilportugues/php-sql-query-builder
[2] https://github.com/ClanCats/Hydrahon
[3] https://stackoverflow.com/a/182353/102699

4 years ago by Stephen Reay — view source

unread

Hi Mike,

Please don't do this. We already have PDO with prepared statements. The data must be bound. This is the secure way of writing SQL queries.

The problem is that over 40% of the web currently runs on PHP code that using mysqli. That CMS does not support PDO nor prepared statements, and is unlikely to switch to it anytime some in the foreseeable future.

A SQL object model parser and sanitizer could more easily be used incrementally by that CMS since PDO does not share connections with mysqli (AFAIK, anyway.)

(Resending from on-list address)

Apparently you didn't know mysqli supports parameterised queries?
Wordpress could have adopted parameterised queries when they grudgingly switched to mysqli, years after both it and PDO were introduced.
There’s zero reason to believe they would adopt this unless forced to.

4 years ago by Mike Schinkel — view source

unread

Hi Mike,

Please don't do this. We already have PDO with prepared statements. The data must be bound. This is the secure way of writing SQL queries.

The problem is that over 40% of the web currently runs on PHP code that using mysqli. That CMS does not support PDO nor prepared statements, and is unlikely to switch to it anytime some in the foreseeable future.

A SQL object model parser and sanitizer could more easily be used incrementally by that CMS since PDO does not share connections with mysqli (AFAIK, anyway.)

(Resending from on-list address)

Apparently you didn't know mysqli supports parameterised queries?
Wordpress could have adopted parameterised queries when they grudgingly switched to mysqli, years after both it and PDO were introduced.

Well, yes and no.

I admit I had forgotten it because frankly the WordPress wp-db object wrapper used by 99.7%[1] of WordPress developers does not support it and given I had long ago realized I could not use them in WP I guess my memory blocked out the existence of prepared statements in mysqli.

But it is not relevant in WordPress. Here is where (almost?) all SQL queries go to run in WordPress:

https://github.com/WordPress/WordPress/blob/master/wp-includes/wp-db.php#L2050

Note there are no bind_param() or execute() calls anywhere in that file. Their is a prepare(), but WordPress rolled their own:

https://github.com/WordPress/WordPress/blob/master/wp-includes/wp-db.php#L1302

So saying that parameterized query is a solution leaves out 40% of the web. Because some other guys are the intransigent ones here.

But also because of much legacy code exists in the form of plugins and themes that do not support parameterized queries.

HOWEVER, whether mysqli supports parameterised queries or not is all a moot point because parameterised queries do not allow for parameterizing field names or table names. And the point of this thread was to discuss how to mark SQL that has been composed at run-time to be "safe." Without being able to parameterize field names and table names parameterised queries are not a sufficiently complete solution.

At least not from an is_literal()/is_trusted() perspective.

There’s zero reason to believe they would adopt this unless forced to.

Exactly! WordPress is not going to change, so that leaves it up to PHP developers using WordPress to protect themselves.

Since I am assuming that everyone on this list wants to minimize the number of security issues in PHP applications across the web by whatever reasonable means necessary, I also assume that pointing at WordPress and saying "It's their fault not ours" does not help achieve those security goals? So I am also assuming that finding a pragmatic solution that would allow PHP to be able to empower the developer in a way that WordPress won't might be worth considering.

But tell me if I am assuming wrongly here.

-Mike

[1] Yes, I made that up out of whole cloth. But honestly in 10+ years working with WP I have almost never seen anyone use anything but the $wpdb provided by WP — except for one single plugin — and $wpdb does not support parameterized queries.

4 years ago by Pierre — view source

unread

Le 22/06/2021 à 17:35, Mike Schinkel a écrit :

https://github.com/WordPress/WordPress/blob/master/wp-includes/wp-db.php#L2050
Sorry for the discussion pollution here but, but ouch, plugins are still
using this unsafe form ? Reminds when I was a student, I learnt to
parametrize queries there, it was 20 years ago. I never understood
people not doing that in the first place.
But also because of much legacy code exists in the form of plugins and themes that do not support parameterized queries.
Yes I agree, legacy is legacy, you have to deal with it. But all legacy
code cannot be fixed, and doing a highly complex SQL parsing / escaping
/ vulnerability detection code that explicitly targets legacy code and
not modern code seems weird to me.
HOWEVER, whether mysqli supports parameterised queries or not is all a moot point because parameterised queries do not allow for parameterizing field names or table names. And the point of this thread was to discuss how to mark SQL that has been composed at run-time to be "safe." Without being able to parameterize field names and table names parameterised queries are not a sufficiently complete solution.

Not being able to parametrize table or field names is not only a problem
for mysqli, but it is for PDO and pgsql too. That's a place where
userland query-builders and others DBALs, even the most basic ones do
shine, and brings a real added-value.

But having anyone, writing SQL with user-given table names or column
names, and executing it using something like WP's _do_query() method
seems like they WANT to be hacked. I'm not sure how you will succeed
in plugging the is_trusted() / is_literal() / is_wathever() method
correctly in an SQL Model Parser & Sanitizer anyway, knowing that at
this point, all you'll receive is a huge string issued by some plugin
API which has already done crazy dark (and probably bugguy as well) magic.

I don't see how adding magic in PHP core will avoid the need to fix all
those legacy plugins, they probably would need themselves to use this
new shiny API to benefit from it ? In the opposite, if something alters
the behavior of mysqli implicitly for everyone in order to make it safe,
it sounds like there will be a lot of BC ? In both case, it seems that
it will not do any shiny magic to me.

But, I might be wrong, this thread becomes harder and harder to read,
and I may have missed a few points.

Regards,

Pierre

4 years ago by Mike Schinkel — view source

unread

Le 22/06/2021 à 17:35, Mike Schinkel a écrit :

https://github.com/WordPress/WordPress/blob/master/wp-includes/wp-db.php#L2050
Sorry for the discussion pollution here but, but ouch, plugins are still using this unsafe form ? Reminds when I was a student, I learnt to parametrize queries there, it was 20 years ago. I never understood people not doing that in the first place.
But also because of much legacy code exists in the form of plugins and themes that do not support parameterized queries.
Yes I agree, legacy is legacy, you have to deal with it. But all legacy code cannot be fixed, and doing a highly complex SQL parsing / escaping / vulnerability detection code that explicitly targets legacy code and not modern code seems weird to me.
HOWEVER, whether mysqli supports parameterised queries or not is all a moot point because parameterised queries do not allow for parameterizing field names or table names. And the point of this thread was to discuss how to mark SQL that has been composed at run-time to be "safe." Without being able to parameterize field names and table names parameterised queries are not a sufficiently complete solution.

Not being able to parametrize table or field names is not only a problem for mysqli, but it is for PDO and pgsql too. That's a place where userland query-builders and others DBALs, even the most basic ones do shine, and brings a real added-value.

But having anyone, writing SQL with user-given table names or column names, and executing it using something like WP's _do_query() method seems like they WANT to be hacked.

Are you not familiar with PHPMyAdmin[1] and/or Adminer[2]?

Do they, or anyone else who has a similar use-case want to be hacked?

I'm not sure how you will succeed in plugging the is_trusted() / is_literal() / is_wathever() method correctly in an SQL Model Parser & Sanitizer anyway, knowing that at this point, all you'll receive is a huge string issued by some plugin API which has already done crazy dark (and probably bugguy as well) magic.

I don't see how adding magic in PHP core will avoid the need to fix all those legacy plugins, they probably would need themselves to use this new shiny API to benefit from it ? In the opposite, if something alters the behavior of mysqli implicitly for everyone in order to make it safe, it sounds like there will be a lot of BC ? In both case, it seems that it will not do any shiny magic to me.

It would not affect any of those plugins by itself. They would be left to fend for their own.

What it would do is allow those developers writing new sites and/or plugins or maintaining old ones who proactively choose to sanitize their SQL to be able to do so before they pass their SQL to $wpdb->query(), especially if the RFC is_trusted() and/or is_literal() passes.

But, I might be wrong, this thread becomes harder and harder to read, and I may have missed a few points.

It just started! Note I created a new thread from the is_literal()/is_trusted() thread to talk about SQL.

-Mike

[1] https://www.phpmyadmin.net/
[2] https://www.adminer.org/

4 years ago by Larry Garfield — view source

unread

Hi Mike,

Please don't do this. We already have PDO with prepared statements. The data must be bound. This is the secure way of writing SQL queries.

The problem is that over 40% of the web currently runs on PHP code that
using mysqli. That CMS does not support PDO nor prepared statements,
and is unlikely to switch to it anytime some in the foreseeable future.

WordPress is not going to leverage anything we do here until and unless there is a major change of leadership and culture at that project. Please don't waste any mental effort on it; they clearly waste no mental effort on what the rest of the PHP community considers good, secure practices. Anything involving them is tilting at windmills.

Mike, speaking as someone who has written an SQL abstraction layer and query builder with significant usage (Drupal 7-9), you are grossly under-estimating the complexity of what you describe. It might be possible to hack together for SQL92, aka "what most PHP devs actually use because they haven't noticed that it's not 1992 anymore", but that's already been done. We have DBTNG in Drupal, we have Doctrine, problem solved.

Modern SQL, though, is a stupidly complex and stupidly inconsistent beast. Most of the syntax beyond the basics is different on every damned database. The official spec is not even publicly available, and costs a lot of money to access. And no DB engine actually supports all of it; they all support different subsets with their own different extensions that may or may not be comparable.

Building a tool that parses an arbitrary string to an AST for a spec that is inconsistent, inaccessible, and not implemented correctly by anyone is a fool's errand, and that's just the first part of it. That's not even getting into designing an API for people to modify it, or questions of performance, or compiling the AST back into a DB-specific string, AND then doing parameter binding which varies from one database to another.

You're talking about reimplementing major portions of MySQL, PostgreSQL, Oracle, etc. themselves in PHP, all at the same time. Well, good luck, you're going to need it.

Personally I've long since concluded that database portability is no longer an achievable or even desirable feature. SQL is just too fragmented a language, leaving you with a least common denominator that is grossly under-whelming for modern needs. If you want more than SQL92, it's not really viable anymore.

--Larry Garfield

4 years ago by Mike Schinkel — view source

unread

Hi Mike,

Please don't do this. We already have PDO with prepared statements. The data must be bound. This is the secure way of writing SQL queries.

The problem is that over 40% of the web currently runs on PHP code that
using mysqli. That CMS does not support PDO nor prepared statements,
and is unlikely to switch to it anytime some in the foreseeable future.

WordPress is not going to leverage anything we do here until and unless there is a major change of leadership and culture at that project. Please don't waste any mental effort on it; they clearly waste no mental effort on what the rest of the PHP community considers good, secure practices. Anything involving them is tilting at windmills.

You misunderstand. What I am (likely) "wasting" my mental effort is to discuss features that I would be able to use WITHOUT WordPress having to make ANY changes.

HOWEVER, these would be beneficial beyond WordPress, since parameterized queries cannot parameterize table names nor field names.

This is all trying to address the concerns that Craig Francis brought up off list when he said that you cannot escape or sanitize without knowing context when I was asking his to provide a make_literal() function, add support for a IsLiteral attribute, or support an IsLiteralInterface so that people don't latch on to using is_literal() and make certain edge cases impossible.

When I am trying to address a problem, I come at it from many angles until I find a solution. One potential solution then is to have a class built into PHP that can ensure the SQL returned is indeed safe.

Mike, speaking as someone who has written an SQL abstraction layer and query builder with significant usage (Drupal 7-9), you are grossly under-estimating the complexity of what you describe. It might be possible to hack together for SQL92, aka "what most PHP devs actually use because they haven't noticed that it's not 1992 anymore", but that's already been done. We have DBTNG in Drupal, we have Doctrine, problem solved.

I think what is happening here is you are making an assumption I am proposing a much larger scope than I am.

Think of the scope I am proposing being on par with $mysqli->prepare(), but a bit more to be able to handle more than just values.

Modern SQL, though, is a stupidly complex and stupidly inconsistent beast. Most of the syntax beyond the basics is different on every damned database. The official spec is not even publicly available, and costs a lot of money to access. And no DB engine actually supports all of it; they all support different subsets with their own different extensions that may or may not be comparable.

Both Modern SQL and legacy SQL are both still text-based query languages and they all have grammars that can be represented by BNF rules.

https://github.com/ronsavage/SQL

Those rules could be abstracted into a form accessible via a "dialect" interface and that is how these would literally any version of SQL could be supported.

Could we finish all of a given dialect at once? No. Iteration based on what is found to be supported is how this could be approached. Remember, these dialects could be implemented in userland. By any PHP developer.

Could we ever get them to be perfect? Probably not. But they would be good at the start and then very good and there would like to be many people providing PRs to fix new edge cases for the dialects that get the most use.

Building a tool that parses an arbitrary string to an AST for a spec that is inconsistent, inaccessible, and not implemented correctly by anyone is a fool's errand, and that's just the first part of it.

That is not what I am proposing. I am proposing to build a tool that knows how to parse based on a simplified grammar and then sanitize based on those rules.

There really are only a few contexts in a SQL query to be concerned about. Keywords, field names, table name, table aliases, and typed values (plus maybe a few other things I missed?) The grammars themselves can evolve independently as userland implementations of interfaces. This is not rocket science.

That's not even getting into designing an API for people to modify it,

No API per se, an interface. (Yes that is also an API, but not one that you call; instead it calls you.)

or questions of performance,

Parsing a SQL query should not take nearly as long as executing the query itself, especially if the SqlObjectModel were written in C.

or compiling the AST back into a DB-specific string

Sounds like you are envisioning something I am not.

AND then doing parameter binding which varies from one database to another.

Again, that would be handled by the dialect class.

You're talking about reimplementing major portions of MySQL, PostgreSQL, Oracle, etc. themselves in PHP, all at the same time. Well, good luck, you're going to need it.

You seem to be focused on that fact you think this would be hard instead of if it would be a viable solution if implemented?

Personally I've long since concluded that database portability is no longer an achievable or even desirable feature. SQL is just too fragmented a language, leaving you with a least common denominator that is grossly under-whelming for modern needs. If you want more than SQL92, it's not really viable anymore.

I agree with you. That is why portability is not relevant to what I am proposing.

Maybe this will help. There are a billion XML schemas, but DomDocument and its related classes can process them all. A SqlObjectModel would be similar; it would know how to process text queries where the dialect interface implementors would be the equivalent of the XML schema.

It is not a perfect analogy, but it is close enough.

-Mike

P.S. I put this at the end, because it is important. There is no reason we would have to implement all syntax for a given SQL dialect. I'll bet 5% of any SQL dialect is used 95% of the time. Implement that 5% and then tell developers who need the other 95% of edge cases — like DML —that they have to sanitize those queries manually.

And that ~5% would provide injection protection for that ~95% of most common use-cases. Why let perfect be the enemy of the good?

4 years ago by Kamil Tekiela — view source

unread

While it's true that a lot of the internet is using mysqli due to
WordPress, this doesn't change the fact that PHP already offers a solution
to the problem. Both PDO and mysqli support server-side prepared
statements.

We don't talk about WordPress. They should not hold back PHP. That project
is so behind that they still offer mysql_* API in their abstraction
library. Nonetheless, whatever solution we offer, they're unlikely to pick
it up. They haven't used prepared statements for whatever reason, and they
created their own client-side prepared statements. What makes you think
that if we write "Parser & Sanitizer", they're going to start using it? The
proposal for is_literal is going to be completely useless to WordPress as
they have no place to use it in their codebase. It's a pity we can't help
them, but it's true.

To clarify the proposal from Craig, I would like to point out that
is_literal() will not help with the data being passed to SQL. User input
cannot be trusted in any sense, no matter where it comes from. The proposal
for is_literal() helps libraries that try to build the SQL dynamically,
without the user input. For example, it would help to write a secure method
like this:

function select(mysqli $mysqli, string $table) {
if(!is_literal($table)) {
throw new RuntimeException('The table name has to be a hardcoded string');
}

return $mysqli->query("SELECT * FROM {$table}")->fetch_all();
}

This code would prevent the developer from misusing this function by
passing user input to it.

That is why the title of this email said "Parse and Sanitizer" not
"Builder."

Then we definitely don't need that. Both PDO and mysqli can escape string
literals in a context aware manner (to some extent at least). PDO tried to
implement a parser, but it proved challenging and to this day doesn't work
correctly. As Larry explained, SQL is not a standardised language anymore.
The flavours diverged and created a widely inconsistent language.

To build a context-aware parser for SQL, you would need to know:

The SQL flavour
The database type (MySQL or MariaDB)
The database version
The settings maintained on that database server

For example, in MySQL, you need to know which character is used to escape
quotes in string literals, and you need to know the SQL_MODE. Then you need
to know which syntax is allowed and understood by that specific MySQL
version. The SQL syntax differs a lot between MySQL and MariaDB. Building,
parsing and compiling the SQL in PHP would involve rebuilding the same
parser that MySQL server is using. And you'd have to do it for every single
possible database vendor/version. On top of that, you need to have direct
access to the database/table/column/SP/functions names, so that your parser
can apply the business logic. Not to mention escaping the data if you don't
want to use prepared statements.

OTOH, what about when a developer needs to parameterize field and table
names in PDO

You don't parameterize field and table names. Period. You hardcode these
names in your business logic. Here's an example from DBAL:

$queryBuilder
->select('id', 'name') // <-- Hardcoded
->from('users') // <-- Hardcoded
->where('email = ?') // <-- Hardcoded
->setParameter(0, $userInputEmail) // <-- Parameterized
;

Irrelevant of which parameterization technique is used here, the idea is
that you never parameterize SQL code. You parameterize the data. WordPress
does exactly the same with their prepare() method. If you need to have
dynamic field identifiers, then it's your job as a developer to write the
code that can safely handle such requirement.

A native PHP "SQL Object Model with parser and sanitizer functionality"
wouldn't prevent SQL injection in the same way that userland DB abstraction
libraries don't. You can still create SQL injection with wordpress $wpdp
class, you can do it with DBAL, and you can do it with PDO.

To summarize, we are not looking for a solution to prevent SQL injection or
to help Wordpress. We are looking for a solution that would prevent the
misuse of libraries like DBAL, so that methods like select(), from(), and
where() could generate a warning when used with non-hardcoded data.

Regards,
Kamil

P.S. By "sanitization" I understand removing unwanted information from a
piece of data. SQL injection cannot be prevented by removing data. It can
only be prevented by strict separation of SQL and data, or by careful
formatting of the data when included in the SQL string itself. This
includes quoting, type casting and escaping of quotes. Sanitization cannot
prevent SQL injection!

4 years ago by Mike Schinkel — view source

unread

While it's true that a lot of the internet is using mysqli due to WordPress, this doesn't change the fact that PHP already offers a solution to the problem. Both PDO and mysqli support server-side prepared statements.
<snip>

Hi Kamil,

Clearly for whatever reason your arguments and Larry's arguments are misunderstanding and thus misrepresenting what I was proposing. But given how long our respective emails are getting I do not have confidence we will get to clarification if we continue, so I am going to suggest we bring the original point of this thread to a close.

I will just have to implement a proof-of-concept so that what I have been proposing will not be so hard to misunderstand.

-Mike

4 years ago by Kamil Tekiela — view source

unread

Sounds good to me. Looking forward to your PoC.

4 years ago by Larry Garfield — view source

unread

Maybe this will help. There are a billion XML schemas, but DomDocument
and its related classes can process them all. A SqlObjectModel would be
similar; it would know how to process text queries where the dialect
interface implementors would be the equivalent of the XML schema.

It is not a perfect analogy, but it is close enough.

It's a very helpful analogy, actually. SQL is 100x more complex of a language than XML is. XML is trivial by comparison. That's exactly why what you describe is vastly harder than you think it is, even if you limit yourself to the rudimentary ancient SQL standards.

And a tool that breaks down and becomes unusable as soon as you start using anything even vaguely modern is a tool I would rather not exist, quite frankly, as it only further discourages anyone learning SQL beyond MySQL 3.

--Larry Garfield

4 years ago by Kamil Tekiela — view source

unread

This is open to SQL injection:

$queryBuilder
->select(...$_GET['columns'])
->from($_GET['table'])
->where($_GET['where'])
;

All below statements produce 42. This is valid SQL:

SELECT 42 FROM TABLE() FROM dual;
SELECT ⠀ FROM ⠀;
SELECT * FROM "42"; -- With ANSI_QUOTES
SELECT * FROM """""";

This is valid in MySQL:

VALUES ROW(42)

This is valid in MariaDB:

VALUES (42);

This is not a valid SQL:

SELECT * FROM """";

There are also windows functions, CTE, Stored procedures, and a bunch of
new features.