[RFC] [Discussion] Measuring maximum execution time based on wall-time

4 years ago by Rowan Tommins — view source

unread

Hi Máté,

RFC: https://wiki.php.net/rfc/max_execution_wall_time

I wonder if you could give some more detailed examples of what you would use this for.

You write:

[The current functionality] can have serious consequences for distributed systems with high traffic, where finishing a request in a timely manner is essential for avoiding cascading failures.

It feels like "finishing in a timely manner" is rather different from "being forcefully killed after a fixed time limit".

I'm struggling to picture when I'd want a hard wall-time limit, rather than:

checking the script's duration at key points where I know it can gracefully exit, ensure a consistent state, and return an appropriate message to the caller
enforcing a network timeout on the calling end so that I don't need to rely on all services cooperatively exiting in good time

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Benjamin Eberlei — view source

unread

Hi Máté

we talked about this before and I think it's a good addition. Two
suggestions to improve the RFC:

The RFC could include more details on how the behavior works exactly in
combination with long running I/O. Lets say the default socket timeout
being 60 seconds, andt the max execution wall time being 10 seconds. What
happens if an HTTP call taking 60 seconds via file_get_contents gets
started at 9 seconds in the request?

I think you should mention that you also need to configure the timeouts of
all I/O operations for an effective strategy, so that one gets a good
picture how this puzzle piece fits into production.

greetings
Benjamin

Hi Everyone,

Thanks for all the insightful feedbacks so far, I went ahead, and wrote a
proper
RFC: https://wiki.php.net/rfc/max_execution_wall_time

Regards:
Máté

4 years ago by kocsismate90@gmail.com — view source

unread

Hi Rowan and Benjamin,

we talked about this before and I think it's a good addition. Two

suggestions to improve the RFC:

Thanks for the ideas, I'll incorporate more details about the interaction
of these settings. And I've just collected a few examples when my
proposal would be useful:
https://gist.github.com/kocsismate/dbcc4ba81b27cfb2e25b949723bb7c79 . Their
intention is to illustrate that no matter
how tight the individual timeout settings are, if the number of external
calls is high enough during the same request, nothing (*) can prevent the
response time
to skyrocket until all the workers become busy, causing outage because the
web server can't serve any new connections.

*except for some more complex, or non-clean mechanisms, which are outlined
in the RFC,

I'm struggling to picture when I'd want a hard wall-time limit, rather than:

checking the script's duration at key points where I know it can
gracefully exit, ensure a consistent state, and return an appropriate
message to the caller

enforcing a network timeout on the calling end so that I don't need to
rely on all services cooperatively exiting in good time

In my opinion, the main problem with the first suggestion is that in
most cases, non-trivial production code can barely do anything like
"checking the script's duration
at key points". If there are enough levels of abstraction, the outermost
layers (e.g. controllers) can't do any checks, they simply have not enough
control.
On the other hand, having assumptions about the request duration in any
inner layers (e.g. model if we take MVC as an example) is a bad idea
according to my
experience. I admit though that there are some cases when it is possible to
add the checks in question, but I believe that would result in a very noisy
code,
while a simple ini config would do a much better service.

Do the above examples and this reasoning make sense to you? Also, I'll
improve the wording I used when describing the consequences.

Regards,
Máté

4 years ago by Rowan Tommins — view source

unread

Hi Máté,

I've just collected a few examples when my proposal would be useful:
https://gist.github.com/kocsismate/dbcc4ba81b27cfb2e25b949723bb7c79

[...]

Do the above examples and this reasoning make sense to you? Also, I'll
improve the wording I used when describing the consequences.

The part I'm curious about is not so much when the setting would have an
effect, as when it would be the desired solution.

Your point about the server not being able to serve any more connections
is reasonable, and I can see this being a useful last resort mechanism
to keep the server responding. I'd be likely to treat it similarly to
the Linux OOM killer: if it was ever actually invoked, I would be
investigating how to fix my application.

The wording in your e-mails and RFC suggests you see it as something
you'd use instead of application-level solutions, rather than as well as
them, and that's what I'm trying to picture the use case for.

One scenario I can think of would look a bit like this:

foreach ( $ordersToProcess as $orderDetails ) {
$result = send_order_to_remote_system($orderDetails);
save_confirmation_to_db($orderDetails->customerOrderRef,
$result->supplierOrderRef);
}

If the remote system responded very slowly, that loop might take many
times the expected duration. If we simply kill the process when a fixed
wall time has elapsed, we're very likely to create an order on the
remote system, but exit without saving its reference. It is however easy
to see where in that loop we could safely call a routine like
throw_exception_if_time_limit_reached().

If rather than placing orders, the loop was just gathering search
results, killing the process would be less dangerous, but cleanly
exiting would still be preferable, because we could return a warning and
partial results, rather than a 500 error.

Other scenarios I can think of have similar problems:

If the database locks up inside a transaction, killing the process
probably won't roll that transaction back cleanly
If the process was working through a large import of data, it may have
written a bunch of temporary data to disk or holding tables which won't
be cleaned up

In other words, I can only think of cases where "the cure would be worse
than the disease".

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by kocsismate90@gmail.com — view source

unread

Hi Rowan,

I'd be likely to treat it similarly to

the Linux OOM killer: if it was ever actually invoked, I would be
investigating how to fix my application.

I think this is mostly not about application-level, rather than
infrastructure-level issues, at least
according to our use-cases. For example, if your application is under heavy
DOS attack, or there are
other kind of network connection problems, then your service(s) may
experience slow database/cache/API
response times. It is also possible that a 3rd party API you depend on
faces such issues. All these scenarios
could severely harm the availability of your application, unless you have a
hard, wall-clock time based
timeout as a way to short-circuit too slow responses. So we are not talking
about application (design) issues
only, like the n+1 query problem.

If the remote system responded very slowly, that loop might take many

times the expected duration. If we simply kill the process when a fixed
wall time has elapsed, we're very likely to create an order on the
remote system, but exit without saving its reference. It is however easy
to see where in that loop we could safely call a routine like
throw_exception_if_time_limit_reached().

In my opinion, if the proposed ini setting causes consistency issues for an
application, then they are already
vulnerable to other factors which can make their application halt execution
at random places: fatal errors, power outages,
etc. I think developers of distributed systems should be aware of this -
and I think they usually are, let's just take the CAP theorem -,
so they have to accept and consider these risks. Please also note that
"max_execution_time" already measures wall-time on a
few platforms, so we already have precedence for the proposed behavior.

If rather than placing orders, the loop was just gathering search
results, killing the process would be less dangerous, but cleanly
exiting would still be preferable, because we could return a warning and
partial results, rather than a 500 error.

If returning a 50x response with a custom message is a requirement, then
sure, developers can just ignore the new ini setting.
Although, i.e. apache and nginx already allow custom error pages, and I
think that should just be good enough for most use-cases.

If the database locks up inside a transaction, killing the process

probably won't roll that transaction back cleanly

Since there can be many other causes of a killed process, I think this
particular problem is unrelated to my proposal, and if
such a thing happens, then it's a bug in the database server. Also, please
be aware that the timeout is a clean shutdown
mechanism, so shutdown handlers and the already
mentioned RSHUTDOWN functions are triggered. On the other hand, fpm's
timeout doesn't invoke any of them.

In other words, I can only think of cases where "the cure would be worse

than the disease".

To be honest, my impression is that you either underestimate the "disease",
or overestimate the current "cures". Speaking about the
latter, nginx + fpm, one of the most popular web server setups (if not the
most popular one) doesn't provide an easy-to-use and safe
way to shutdown execution after a dynamically configurable amount of time.
While one can use the "if (time() - $startTime > $timeout) { /* ... */ }"
based approach instead, this won't scale well when used in non-trivial
codeses. Thus, I believe my suggestion offers a more
convenient, safer, and more universal way to solve the underlying problem
of controlling the real execution time, than what the currently
available options do.

Regards:
Máté

4 years ago by Rowan Tommins — view source

unread

Hi Máté,

I think this is mostly not about application-level, rather than
infrastructure-level issues, at least
according to our use-cases.

I think we may actually just be saying the same thing in different
terms: in this message, you refer to a "heavy DOS attack", which I
totally agree is the kind of scenario where "just kill some processes
and hope that's enough to ride out the storm" would be useful.

Some of your earlier messages, though, made it sound like you were
relying on it as an every day thing - you talked about seeing
"regressions" during your migration, for instance. But maybe I just read
too much into that wording?

In my opinion, if the proposed ini setting causes consistency issues
for an application, then they are already
vulnerable to other factors which can make their application halt
execution at random places: fatal errors, power outages,
etc.

Certainly, I would like my application to be robust to a power outage
in the middle of a request; but I would prioritise that robustness based
on how likely it is to happen. If one request in a billion is terminated
unexpectedly, I will probably take certain risks; if it's going to
happen on a regular basis, I'm going to have to spend a lot more time on
that set of problems.

So, again, it comes down to whether this is a "last resort" setting, or
something you'd expect to be invoked regularly.

I think developers of distributed systems should be aware of this -
and I think they usually are

Actually, this may be part of the confusion: I'm not sure what you're
referring to by "distributed systems", so don't know whether I'm
included in the set of developers you're picturing here.

If returning a 50x response with a custom message is a requirement,
then sure, developers can just ignore the new ini setting.
Although, i.e. apache and nginx already allow custom error pages, and
I think that should just be good enough for most use-cases.

You missed the key part of the quote you replied to here, which is
"return partial results". My point was not about the formatting of the
error, but that its content might need to be application-specific.

Also, please be aware that the timeout is a clean shutdownmechanism,
so shutdown handlers and the already mentioned RSHUTDOWN functions are
triggered.

It might be useful to expand on this in the RFC, remembering that
"RSHUTDOWN" doesn't mean anything to a userland developer. Will
"finally" blocks be run? Destructors? Will the error handler be invoked?
I know the answers are probably "the same as the current timeout
setting", but spelling it out would help to picture when this feature
might or might not be useful.

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Peter Bowyer — view source

unread

Also, please
be aware that the timeout is a clean shutdown
mechanism, so shutdown handlers and the already
mentioned RSHUTDOWN functions are triggered. On the other hand, fpm's
timeout doesn't invoke any of them.

Tangentially, can this be considered a bug in FPM's handling? I appreciate
the speed boost FPM brought over CGI, but the more I work with it the less
I like the way it functions (but that is a separate conversation).

Peter

4 years ago by kocsismate90@gmail.com — view source

unread

Hi Rowan and Peter,

Tangentially, can this be considered a bug in FPM's handling? I appreciate

the speed boost FPM brought over CGI, but the more I work with it the less
I like the way it functions (but that is a separate conversation).

No, I don't think so, since FPM terminates the child process by sending a
SIGTERM (rather
than a SIGKILL), it seems that it does the right thing based on the source
code. But I'd
appreciate it if someone more familiar with FPM or process management could
verify
my assumptions.

Actually, this may be part of the confusion: I'm not sure what you're

referring to by "distributed systems", so don't know whether I'm
included in the set of developers you're picturing here.

My definition of "distributed system" is when there is a network boundary
between components of an application. Practically speaking, this is the case
when the application code and the database/cache/sessions etc. are on
different servers.

So, again, it comes down to whether this is a "last resort" setting, or

something you'd expect to be invoked regularly.

Yes, I certainly imagine it as a last resort possibility which could
supersede
FPM's "request_terminate_timeout" or the other external timeout
mechanisms that are currently in use. In fact, "max_execution_wall_timeout"
is not effective alone, it has to be used in cooperation with other timeout
settings either on the caller (PHP) or the callee (e.g. database) side,
since external
calls cannot be cancelled midway by PHP. That said, setting cURL, socket
etc.
timeouts is still highly encouraged as a regular countermeasure. My purpose
with
this RFC is to improve the last safety net we have (real execution
timeout), which I find
suboptimal in its current form.

You missed the key part of the quote you replied to here, which is

"return partial results". My point was not about the formatting of the
error, but that its content might need to be application-specific.

Ah, sorry, I did really miss it! I think my answer still holds though: if
there is such a requirement, developers can ignore this ini setting (or
explicitly set its
value to 0, if necessary). Probably this is a good example why we should
not make
wall-clock time based measurement the default of "max_execution_time".
Otherwise,
I think my proposal would offer a solution for the 90% of the problems
we currently
have due to the way how "max_execution_time" works.

It might be useful to expand on this in the RFC, remembering that

"RSHUTDOWN" doesn't mean anything to a userland developer. Will
"finally" blocks be run? Destructors? Will the error handler be invoked?
I know the answers are probably "the same as the current timeout
setting", but spelling it out would help to picture when this feature
might or might not be useful.

I absolutely agree, and I'll try to elaborate as soon as I touch the RFC
again, most probably
today. Your answer is true, it is "the same as the current timeout
setting". However, I was
considering the possibility to throw an exception in case of the new
timeout, but I think
this task would be more suitable for a PHP 9.0 release where we could
convert fatal errors
to exceptions where it would make sense.

Máté

4 years ago by Nikita Popov — view source

unread

Hi Everyone,

Thanks for all the insightful feedbacks so far, I went ahead, and wrote a
proper
RFC: https://wiki.php.net/rfc/max_execution_wall_time

Regards:
Máté

Something potentially worth pointing out (and assuming I'm inferring the
correct behavior here): If max_execution_wall_time is exceeded during an
internal function call (which seems quite likely, as that's where there is
the most potential for something to hang) and the function does not return
within hard_timeout seconds, then the a process abort will be triggered.
The hard_timeout is 2s by default. If any of the individual call timeouts
are >= 2s, then it's not unlikely that this situation occurs.

Regards,
Nikita

4 years ago by kocsismate90@gmail.com — view source

unread

Hi Nikita,

Something potentially worth pointing out (and assuming I'm inferring the

correct behavior here): If max_execution_wall_time is exceeded during an
internal function call (which seems quite likely, as that's where there is
the most potential for something to hang) and the function does not return
within hard_timeout seconds, then the a process abort will be triggered.
The hard_timeout is 2s by default. If any of the individual call timeouts
are >= 2s, then it's not unlikely that this situation occurs.

Thanks for bringing this to my attention, I didn't realize this
consequence. I'm wondering though if it would be a sane idea to use the
current,
CPU time based timer for hard_limit even when max_execution_wall_time times
out? This way the process abort could be avoided in case of
most functions in question, if I got it right. Nevertheless, I'll update
the RFC with its relation to hard_timeout in the coming days.

As I'd like to proceed with this RFC soon, I'd appreciate any review and
constructive feedback. I have two questions for example:

Is the max_execution_wall_time really the best setting name we can come
up with? What about something like max_execution_real_time?
Would anyone miss a set_wall_time_limit() function?

Regards:
Máté