Making parallel database queries from PHP

18 years ago by Arend van Beelen — view source — reply

unread

Hi there,

I am researching the possibility of developing a shared library which can
perform database queries in parallel to multiple databases. One important
requirement is that I will be able to use this functionality from PHP.
Because I know PHP is not thread-safe due to other libraries, I am wondering
what would be the best way to implement this. Right now I can imagine three
solutions:

Use multiple threads to connect to the databases, but let the library
export a blocking single-threaded API. So, PHP calls a function in the
library, this function spawns new threads, which do the real work. Meanwhile
the function waits for the threads to finish, and when all threads are done
it returns the final result back to PHP.
Use a single thread and asynchronous socket communication. So, PHP calls
the library function and this function handles all connections within the
same thread using asynchronous communication, and returns the result to PHP
when all communication is completed.
Use a daemon on the localhost. Make a connection from PHP to the daemon,
the daemon handles all the connections to the databases and passes the
result back to the connection made from PHP.

Can someone give me some advise about advantages of using one approach or
another? Please keep in mind that I'm hoping for a solution which will be
both stable and minimizes overhead.

Thanks,
Arend.

--
Arend van Beelen jr.
"If you want my address, it's number one at the end of the bar."

18 years ago by Alexey Zakhlestin — view source — reply

unread

I would prefer to have some function, which would check, if the
requested data is already available (if it is not, I would still be
able to do something useful, while waiting)

Hi there,

I am researching the possibility of developing a shared library which can
perform database queries in parallel to multiple databases. One important
requirement is that I will be able to use this functionality from PHP.
Because I know PHP is not thread-safe due to other libraries, I am wondering
what would be the best way to implement this. Right now I can imagine three
solutions:

Use multiple threads to connect to the databases, but let the library
export a blocking single-threaded API. So, PHP calls a function in the
library, this function spawns new threads, which do the real work. Meanwhile
the function waits for the threads to finish, and when all threads are done
it returns the final result back to PHP.

Use a single thread and asynchronous socket communication. So, PHP calls
the library function and this function handles all connections within the
same thread using asynchronous communication, and returns the result to PHP
when all communication is completed.

Use a daemon on the localhost. Make a connection from PHP to the daemon,
the daemon handles all the connections to the databases and passes the
result back to the connection made from PHP.

Can someone give me some advise about advantages of using one approach or
another? Please keep in mind that I'm hoping for a solution which will be
both stable and minimizes overhead.

Thanks,
Arend.

--
Arend van Beelen jr.
"If you want my address, it's number one at the end of the bar."

--
Alexey Zakhlestin
http://blog.milkfarmsoft.com/

18 years ago by Arend van Beelen — view source — reply

unread

While I can see the theoretical advantage of this, I wonder how much there's too gain in practice (at least for us, that is).

In our current codebase, when a database query is done, PHP can only continue when it has the result anyway, so it would require serious code modifications to make use of such functionality. Also, while it may theoratically shorten page load times, our webservers are already constraint by CPU load anyway, so we would probably not be able to get more pageviews out of it either.

-----Oorspronkelijk bericht-----
Van: Alexey Zakhlestin [mailto:indeyets@gmail.com]
Verzonden: za 10-11-2007 11:31
Aan: Arend van Beelen
CC: internals@lists.php.net
Onderwerp: Re: [PHP-DEV] Making parallel database queries from PHP

I would prefer to have some function, which would check, if the
requested data is already available (if it is not, I would still be
able to do something useful, while waiting)

Hi there,

I am researching the possibility of developing a shared library which can
perform database queries in parallel to multiple databases. One important
requirement is that I will be able to use this functionality from PHP.
Because I know PHP is not thread-safe due to other libraries, I am wondering
what would be the best way to implement this. Right now I can imagine three
solutions:

Use multiple threads to connect to the databases, but let the library
export a blocking single-threaded API. So, PHP calls a function in the
library, this function spawns new threads, which do the real work. Meanwhile
the function waits for the threads to finish, and when all threads are done
it returns the final result back to PHP.

Use a single thread and asynchronous socket communication. So, PHP calls
the library function and this function handles all connections within the
same thread using asynchronous communication, and returns the result to PHP
when all communication is completed.

Use a daemon on the localhost. Make a connection from PHP to the daemon,
the daemon handles all the connections to the databases and passes the
result back to the connection made from PHP.

Can someone give me some advise about advantages of using one approach or
another? Please keep in mind that I'm hoping for a solution which will be
both stable and minimizes overhead.

Thanks,
Arend.

--
Arend van Beelen jr.
"If you want my address, it's number one at the end of the bar."

--
Alexey Zakhlestin
http://blog.milkfarmsoft.com/

18 years ago by Lukas Kahwe Smith — view source — reply

unread

Hi,

A few pointers and ideas:

ext/pgsql has support for asynchronous queries (pg_send_query()
and friends)
maybe you can create something out of MySQL Proxy that splits out
a single query into multiple queries and then rejoins them
since MySQL AB is actively developing a new libmysql replacement
for PHP only, you might want to talk to them about implementing
something like this

regards,
Lukas

18 years ago by Arend van Beelen — view source — reply

unread

Hi Lukas,

maybe you can create something out of MySQL Proxy that splits out
a single query into multiple queries and then rejoins them

since MySQL AB is actively developing a new libmysql replacement
for PHP only, you might want to talk to them about implementing
something like this

Spot on! What you're suggesting, splitting queries and combining the results, is exactly the direction I'm thinking about. We had a short look though at MySQL Proxy, and basically it does not seem to cut the job for us. We want to avoid adding another proxy layer between our webservers and the database servers as it would just mean additional overhead and possible bottlenecks and points of failure. This is why I want to move this functionality onto the webservers themselves, to achieve minimum overhead and to guarantee it will scale with the number of webservers.
Nevertheless, MySQL Proxy does appear to provide some of the functionality we will be needing and it might indeed be a good idea to contact MySQL and try to reuse some of Proxy's components if possible.

Thanks!
Arend.

18 years ago by Donal McMullan — view source — reply

unread

Hi Arend -

If your webserver CPUs are already maxed out, that problem won't go away
on its own, but once you've solved that (optimized your code or added
more webservers), the curl_multi_* functions might help you out.

A cheap way to parallelize your database or data-object access, is to
implement a kind of services-oriented architecture, where you have one
PHP script* that does little except get data from a database, serialize
that data, and return it to your main PHP script.

The main PHP script uses the curl_multi_init, curl_multi_add_handle,
etc. functions to call this script multiple times in parallel, returning
different data objects for each call.

Because this introduces latency into the data retrieval trip, it will be
slower for most applications. Some circumstances that might make it
viable include:

you have > 1 data store
you have multiple slow queries that aren't interdependent
you have to do expensive processing on the data you retrieve
you have lots of slack (CPU, RAM, processes) on the webservers

In its favor - it should take just a couple of hours to prototype. If
you have a single canonical data store, you might find that as soon as
you enable parallel queries against the database, your database becomes
the bottleneck, and throughput doesn't actually increase. This technique
should reveal that as a potential problem without much development cost.

Interested to know how you proceed.

Donal McMullan

Donal @ Catalyst.Net.NZ PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
OFFICE: +64(4)803-2372 MOB: +64(21)661-254

*actually - Java's a pretty good option for this tier too.

Arend van Beelen wrote:

While I can see the theoretical advantage of this, I wonder how much there's too gain in practice (at least for us, that is).

In our current codebase, when a database query is done, PHP can only continue when it has the result anyway, so it would require serious code modifications to make use of such functionality. Also, while it may theoratically shorten page load times, our webservers are already constraint by CPU load anyway, so we would probably not be able to get more pageviews out of it either.

-----Oorspronkelijk bericht-----
Van: Alexey Zakhlestin [mailto:indeyets@gmail.com]
Verzonden: za 10-11-2007 11:31
Aan: Arend van Beelen
CC: internals@lists.php.net
Onderwerp: Re: [PHP-DEV] Making parallel database queries from PHP

I would prefer to have some function, which would check, if the
requested data is already available (if it is not, I would still be
able to do something useful, while waiting)

Hi there,

I am researching the possibility of developing a shared library which can
perform database queries in parallel to multiple databases. One important
requirement is that I will be able to use this functionality from PHP.
Because I know PHP is not thread-safe due to other libraries, I am wondering
what would be the best way to implement this. Right now I can imagine three
solutions:

Use multiple threads to connect to the databases, but let the library
export a blocking single-threaded API. So, PHP calls a function in the
library, this function spawns new threads, which do the real work. Meanwhile
the function waits for the threads to finish, and when all threads are done
it returns the final result back to PHP.

Use a single thread and asynchronous socket communication. So, PHP calls
the library function and this function handles all connections within the
same thread using asynchronous communication, and returns the result to PHP
when all communication is completed.

Use a daemon on the localhost. Make a connection from PHP to the daemon,
the daemon handles all the connections to the databases and passes the
result back to the connection made from PHP.

Can someone give me some advise about advantages of using one approach or
another? Please keep in mind that I'm hoping for a solution which will be
both stable and minimizes overhead.

Thanks,
Arend.

--
Arend van Beelen jr.
"If you want my address, it's number one at the end of the bar."

18 years ago by Arend van Beelen — view source — reply

unread

Hi Donal,

thanks for your suggestion. While I think this approach might provide some quick solutions short-term, there actually is a much bigger problem we are trying to attack. I don't know exactly how much details I can give, but I will give some background information to get some more insight in the situation...

We are dealing with literally hundreds of webservers, and hundreds of database servers, and are expanding both of them on a frequent basis. Whenever we increase the number of webservers, the databases become our bottleneck and vice versa. We realize we won't magically solve any of these bottlenecks by introducing parallel querying on the databases. We have lots of tables which are divided over more than a dozen database clusters, and we are getting more and more tables which become so big they have to be spread out over multiple databases. Because of the distribution of these tables, querying them becomes increasingly hard, and we are approaching a limit where further distribution will become virtually undoable using our current approach. The current approach being querying the various databases serially from PHP and manually merging the results. If we continue down this path, our PHP application will have to do increasingly many queries serially, and latencies will add up more and more. Not to mention the code maintenance required for finding the correct databases to query and merging all the results. Therefore we will be needing parallellization techniques that will be able to transparently handle communication with the databases, to keep our latencies low, but also to relieve our PHP application from having to deal with all the distributed databases.

Thanks!
Arend.

-----Oorspronkelijk bericht-----
Van: Donal McMullan [mailto:donal@catalyst.net.nz]
Verzonden: za 10-11-2007 13:43
Aan: Arend van Beelen
CC: internals@lists.php.net; Alexey Zakhlestin
Onderwerp: Re: [PHP-DEV] Making parallel database queries from PHP

Hi Arend -

The main PHP script uses the curl_multi_init, curl_multi_add_handle,
etc. functions to call this script multiple times in parallel, returning
different data objects for each call.

Because this introduces latency into the data retrieval trip, it will be
slower for most applications. Some circumstances that might make it
viable include:

you have > 1 data store
you have multiple slow queries that aren't interdependent
you have to do expensive processing on the data you retrieve
you have lots of slack (CPU, RAM, processes) on the webservers

Interested to know how you proceed.

Donal McMullan

Donal @ Catalyst.Net.NZ PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
OFFICE: +64(4)803-2372 MOB: +64(21)661-254

*actually - Java's a pretty good option for this tier too.

Arend van Beelen wrote:

While I can see the theoretical advantage of this, I wonder how much there's too gain in practice (at least for us, that is).

In our current codebase, when a database query is done, PHP can only continue when it has the result anyway, so it would require serious code modifications to make use of such functionality. Also, while it may theoratically shorten page load times, our webservers are already constraint by CPU load anyway, so we would probably not be able to get more pageviews out of it either.

-----Oorspronkelijk bericht-----
Van: Alexey Zakhlestin [mailto:indeyets@gmail.com]
Verzonden: za 10-11-2007 11:31
Aan: Arend van Beelen
CC: internals@lists.php.net
Onderwerp: Re: [PHP-DEV] Making parallel database queries from PHP

I would prefer to have some function, which would check, if the
requested data is already available (if it is not, I would still be
able to do something useful, while waiting)

Hi there,

I am researching the possibility of developing a shared library which can
perform database queries in parallel to multiple databases. One important
requirement is that I will be able to use this functionality from PHP.
Because I know PHP is not thread-safe due to other libraries, I am wondering
what would be the best way to implement this. Right now I can imagine three
solutions:

Use multiple threads to connect to the databases, but let the library
export a blocking single-threaded API. So, PHP calls a function in the
library, this function spawns new threads, which do the real work. Meanwhile
the function waits for the threads to finish, and when all threads are done
it returns the final result back to PHP.

Use a single thread and asynchronous socket communication. So, PHP calls
the library function and this function handles all connections within the
same thread using asynchronous communication, and returns the result to PHP
when all communication is completed.

Use a daemon on the localhost. Make a connection from PHP to the daemon,
the daemon handles all the connections to the databases and passes the
result back to the connection made from PHP.

Can someone give me some advise about advantages of using one approach or
another? Please keep in mind that I'm hoping for a solution which will be
both stable and minimizes overhead.

Thanks,
Arend.

--
Arend van Beelen jr.
"If you want my address, it's number one at the end of the bar."

18 years ago by Arend van Beelen — view source — reply

unread

Hi Scott,

thanks for your thorough reply! I'm glad to see the first option appears to be preferred, as it was my first choise as well. The main reason I had some doubts was because a colleague of mine quite strongly argued (though I was still not entirely convinced) that I would find myself in an aweful lot of trouble when trying to implement a multi-threaded library inside a single-threaded PHP environment. Which is the problem I am hoping to solve by only exporting a single-threaded API.

For the record, I do have some answers to your questions...

At the moment we use Apache2 with the prefork MPM. However, there is some thought going into switching to lighttpd, so I don't think it would be a good choice to make hard dependencies on Apache right now. The machines we're using are all dual-cores and quad-cores, so yes, it would be silly not to make use of their power :)
I will be the main developer of the library. I have personal experience with assembly, C, C++, Java, PHP and whatnot. However, last few years I've only been professionaly working with PHP so I will need to make some effort to get back into low-level programming. Fortunately, I've got some colleagues which are also good C and C++ programmers, so I think the experience is there :)
MySQL is an absolute must. But it would be nice not to make the library too dependant on MySQL (there's probably little need for that as well), so it can be extended to other databases in the future.
Target platform is Gentoo Linux x86-64.

Thanks again!
Arend.

-----Oorspronkelijk bericht-----
Van: Scott A. Guyer [mailto:saguyer@gte.net]
Verzonden: za 10-11-2007 3:46
Aan: Arend van Beelen
Onderwerp: RE: [PHP-DEV] Making parallel database queries from PHP

Hi Arend,

The first and second options differ primarily in who owns the scheduling of
tasks (DB tasks). In the first option, you assign tasks to threads and
allow the OS to schedule threads. In the second option, you are the
scheduler. When you view it this way, the pros/cons are fairly clear. The
latter option gives you all the control to schedule the way you want to
schedule. But this control places all the risk on you as well. And you may
very well be painting yourself into a corner with respect to being able to
take advantage of any improvements in OS/job/thread scheduling in the
future. Which I'm betting is going to be rampant again as multi-core is
sooooo readily available.

One sorta fuzzy intangible I can think of is this. Your first choice is the
most common pattern for concurrency these days. Why? I think because it is
easier to implement and because OS thread handling is MUCH better these days
than it used to be say 15 years ago (lighter weight, better scheduling,
multi-core optimizations, etc.). Contrast that to the second option. Where
you are kinda hoping any 3rd party libraries (and DB libraries) are written
to support async-I/O. This just isn't as common as you might expect these
days. So you might be constraining yourself a little with the libraries you
could expect to use (of course, this depends a great deal on precisely which
libraries you will use, but it is a risk). Thread safe libraries are more
common than async-I/O libraries in my experience. Async-I/O was making a
little comeback in recent years. But I can't say with any certainty that it
is prevalent in the libraries you might depend on.

In both cases, you could implement based on Apache APR library which would
get you up and running nicely in apache on Windows, Unix, MAC. So that's a
plus.

Turning briefly to the 3rd option...this really only benefits you in two
cases. (1) It completely encapsulates you code so that any failures in your
code will not bring down the PHP (or its hosting app server). (2) You have
some one-off (perhaps proprietary or legacy) code base that you would not be
able to embed in apache/iis/php nicely (e.g., conflicts in threading, memory
management, etc.). It many not be a bad way to prototype as you work out
some kinks in your code. However, I don't favor this approach primarily
because it adds an install dependency and a little extra IPC overhead from
PHP to your daemon. Additionally, this option may add a greater portability
burden if you were trying to move your daemon amongst the common OSes. So I
don't consider it a long term option.

To conclude, I would favor your option 1, ceteris paribus. I would just take
a hard look at any dependent code you are expecting to utilize in your code.
That's where the rubber will meet the road.

Other factors not fully considered which may impact your decision:
(1) If in Apache, any particular MPM? All MPMs? Do you have a deployment
that already uses about as many threads as your hardware can handle?
(2) Any particular skills (or lack of skills) for the developers of this
library?
(3) Which DBs are a must for you? Which are nice-to-haves?
(4) Target platform(s)?

Hope that helps. Cheers,
-Scott

PS - sorry for length :-(

-----Original Message-----
From: arendjr@gmail.com [mailto:arendjr@gmail.com] On Behalf Of Arend van
Beelen
Sent: Friday, November 09, 2007 8:27 PM
To: internals@lists.php.net
Subject: [PHP-DEV] Making parallel database queries from PHP

Hi there,

Use multiple threads to connect to the databases, but let the library
export a blocking single-threaded API. So, PHP calls a function in the
library, this function spawns new threads, which do the real work. Meanwhile
the function waits for the threads to finish, and when all threads are done
it returns the final result back to PHP.
Use a single thread and asynchronous socket communication. So, PHP calls
the library function and this function handles all connections within the
same thread using asynchronous communication, and returns the result to PHP
when all communication is completed.
Use a daemon on the localhost. Make a connection from PHP to the daemon,
the daemon handles all the connections to the databases and passes the
result back to the connection made from PHP.

Can someone give me some advise about advantages of using one approach or
another? Please keep in mind that I'm hoping for a solution which will be
both stable and minimizes overhead.

Thanks,
Arend.

--
Arend van Beelen jr.
"If you want my address, it's number one at the end of the bar."

18 years ago by Rasmus Lerdorf — view source — reply

unread

Arend van Beelen wrote:

thanks for your thorough reply! I'm glad to see the first option appears to be preferred, as it was my first choise as well. The main reason I had some doubts was because a colleague of mine quite strongly argued (though I was still not entirely convinced) that I would find myself in an aweful lot of trouble when trying to implement a multi-threaded library inside a single-threaded PHP environment. Which is the problem I am hoping to solve by only exporting a single-threaded API.

I would lean towards an event-driven approach along the lines of
select() and curl_multi. You could probably copy a lot of the code from
the curl_multi implementation since essentially you are doing the same
thing. And yes, your colleague is correct, you will have issues doing a
threaded library. Portability, threading clashes, signal handling, etc.
And I don't see how scheduling is a factor here at all. On a busy web
server your OS is already spreading things out over all your cores. You
don't need intra-request scheduling of threads on top of that. A simple
single-threaded event-driven model keeps things clean.

-Rasmus

Making parallel database queries from PHP

Donal @ Catalyst.Net.NZ PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St OFFICE: +64(4)803-2372 MOB: +64(21)661-254

Donal @ Catalyst.Net.NZ PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St OFFICE: +64(4)803-2372 MOB: +64(21)661-254

Donal @ Catalyst.Net.NZ PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
OFFICE: +64(4)803-2372 MOB: +64(21)661-254

Donal @ Catalyst.Net.NZ PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
OFFICE: +64(4)803-2372 MOB: +64(21)661-254