[RFC] Measuring maximum execution time based on wall-time

4 years ago by kocsismate90@gmail.com — view source

unread

Dear Internals,

Currently, our company is migrating away from HHVM to PHP, and we are
experiencing some performance regressions due to the fact that the
max_execution_time ini setting has different semantics in the two languages.

By default, HHVM measures wall-time, but the behaviour can be controlled by
an ini setting (
https://github.com/facebook/hhvm/commit/9a9b42e3610cdf242f16ddb8936ce34adfa0be9e)
if compatibility with PHP is important. On the other hand, PHP measures the
CPU time on most systems (with the most notable exception of Windows),
which means that neither sleep(), nor network/system calls are counted
towards the timeout. This is really a big pain for distributed systems with
high traffic, where proper timeout settings prevent cascading failures.
Even if all the external calls have their own timeout settings, the script
itself has no control over the real execution time (e.g. there can be
dozens/hundreds of such calls).

That's why I'd like to add support for measuring the execution timeout
based on the wall-time. There are a couple of ways to approach the problem
though:

by measuring wall-time on all platforms
by adding a new "max_execution_time_type" or so ini setting for
optionally changing the meaning of max_execution_time (this is what HHVM is
doing)
by adding a new "max_execution_wall_time" ini setting for being able to
timeout based on both the real execution time and the CPU time.

My POC implementation at https://github.com/php/php-src/pull/6504 currently
uses the third solution, but I would be okay with the other possibilities
as well (especially with the first one). I would also be very curious if
anyone is aware of the reasons why the CPU time metric was chosen back
then? In my opinion, wall-time is much more useful, but maybe I'm just
missing some technical limitations (?).

Please note that wall-time timeouts would take into effect on a best effort
basis, only after the network/system call exceeding the time limit is
finished.

Regards,
Máté

4 years ago by Rowan Tommins — view source

unread

I would also be very curious if
anyone is aware of the reasons why the CPU time metric was chosen back
then? In my opinion, wall-time is much more useful, but maybe I'm just
missing some technical limitations (?).

For most users, the max execution time is not really a precise metric,
but a backstop to prevent against a process running forever - for
instance, it will trap an accidental infinite loop (accidental infinite
recursion, meanwhile, is often caught by the memory limit in my experience).

For that use case, it can actually be desirable to exclude things like
database and network calls (i.e. to use CPU time rather than wall time)
because they will vary for completely different reasons and on different
scales.

For instance, if an SQL query takes 30 seconds, you might want to log
that and work on optimising it, but your application is still
functional. However, if a normal web page uses 30 seconds of CPU time in
the main thread, something has probably gone seriously wrong, and you
probably want to kill the process to stop it using all the server's
resources.

I think your proposal to allow both limits to be set independently is
sensible, and imagine that most users will want a much higher clock-time
limit than the CPU-time limit.

Regards,

--
Rowan Tommins (né Collins)
[IMSoP]

4 years ago by Andreas Leathley — view source

unread

That's why I'd like to add support for measuring the execution timeout
based on the wall-time. There are a couple of ways to approach the problem
though:

by measuring wall-time on all platforms

by adding a new "max_execution_time_type" or so ini setting for
optionally changing the meaning of max_execution_time (this is what HHVM is
doing)

by adding a new "max_execution_wall_time" ini setting for being able to
timeout based on both the real execution time and the CPU time.

My POC implementation at https://github.com/php/php-src/pull/6504 currently
uses the third solution, but I would be okay with the other possibilities
as well (especially with the first one). I would also be very curious if
anyone is aware of the reasons why the CPU time metric was chosen back
then? In my opinion, wall-time is much more useful, but maybe I'm just
missing some technical limitations (?).

For my applications the current behavior is the more important one, but
implementing both (and being able to set both limits independently)
would be an interesting improvement.

Next to having hard limits, having a way similar to FPMs
request_slowlog_timeout in PHP would be a useful addition in my opinion:
to detect slow requests/scripts and report them, as that can be an early
warning and something worthy to analyze. Basically, set a time limit for
either cpu or wall time, or both, and if that limit is reached call a
PHP callable to report it or handle it in some way (similar to how
pcntl_signal can act on signals in an async way). This would open up
more options, as the current max_execution_time or a new
max_execution_wall_time would be a last resort, but most of the time I
would rather know about a problem early on and log it.