Performance drops after some time, PHP7-FPM + Docker

8 years ago by jeremie.bordier@gmail.com — view source

unread

Hello,

I am seeing a strange performance drop on our production servers in
the PHP components.

It happens usually after 1-3 uptime days, the response time suddenly
doubles. I already investigated to try to isolate any strange request,
but found nothing. We are having 2k to 4k RPM, the stack is PHP-FPM
7.0.X (currently 12, I just updated) into Docker containers, images
are based on the official PHP-FPM Docker image.

The usual response time is ~25ms, and it surges to ~50ms after this
"event". Here's a typical graph of the issue :
http://i.imgur.com/3WaNBL4.png

After restarting the docker containers (so the PHP-FPM processes), it
just works fine again.

I don't really know how to investigate further. If you have any
pointers on how to help figuring out what's wrong, I'd love to try.

Thanks,
Regards,
Jérémie

--
Jérémie BORDIER

8 years ago by Rasmus Lerdorf — view source

unread

On Fri, Oct 14, 2016 at 9:15 PM, Jérémie BORDIER jeremie.bordier@gmail.com
wrote:

I don't really know how to investigate further. If you have any
pointers on how to help figuring out what's wrong, I'd love to try.

I would breakout the Linux perf command for something like this.

Run it like this:

perf record -p <pid_of_a_php-fpm_process> -g

Let that run for some arbitrary amount of time. Start with 10 seconds, or
so. Then:

perf report -n

That should give you a nice report of what your php-fpm process did for
those 10 seconds. Now do the same thing when you see the problem and
compare the reports.

-Rasmus

8 years ago by jeremie.bordier@gmail.com — view source

unread

Hello Rasmus,

Thank you for the insight. I ran exactly what you said, on the very
same php-fpm process, once just after restarting it almost 2 days ago,
so having fast response time, and one just right now after the "slow
down" issue triggered during the night.

The output of the perf diff is quite poor I think, here's the mainline :
35.90% +44.94% php-fpm [.] 0x0000000000042412
10.72% -6.05% libc-2.19.so [.] 0x0000000000079030
9.71% -9.34% newrelic.so [.] 0x0000000000030980
3.81% -3.47% [kernel.kallsyms] [k] ipt_do_table
2.56% -2.02% [kernel.kallsyms] [k] nf_iterate
2.32% -1.99% opcache.so [.] 0x0000000000008cec
1.44% [kernel.kallsyms] [k] update_cfs_shares
1.42% [kernel.kallsyms] [k] __bpf_prog_run
1.28% [vdso] [.] __vdso_gettimeofday

How could I provide more info ?

Thanks,
Best regards,
Jérémie

On Fri, Oct 14, 2016 at 9:15 PM, Jérémie BORDIER jeremie.bordier@gmail.com
wrote:

I don't really know how to investigate further. If you have any
pointers on how to help figuring out what's wrong, I'd love to try.

I would breakout the Linux perf command for something like this.

Run it like this:

perf record -p <pid_of_a_php-fpm_process> -g

Let that run for some arbitrary amount of time. Start with 10 seconds, or
so. Then:

perf report -n

That should give you a nice report of what your php-fpm process did for
those 10 seconds. Now do the same thing when you see the problem and compare
the reports.

-Rasmus

--
Jérémie BORDIER

8 years ago by Rasmus Lerdorf — view source

unread

The output of the perf diff is quite poor I think, here's the mainline :
35.90% +44.94% php-fpm [.] 0x0000000000042412
10.72% -6.05% libc-2.19.so [.] 0x0000000000079030
9.71% -9.34% newrelic.so [.] 0x0000000000030980
3.81% -3.47% [kernel.kallsyms] [k] ipt_do_table
2.56% -2.02% [kernel.kallsyms] [k] nf_iterate
2.32% -1.99% opcache.so [.] 0x0000000000008cec
1.44% [kernel.kallsyms] [k] update_cfs_shares
1.42% [kernel.kallsyms] [k] __bpf_prog_run
1.28% [vdso] [.] __vdso_gettimeofday

How could I provide more info ?

Yes, not much to go on there. But newrelic.so stands out to me. Can you try
removing newrelic and see if it still happens?

If it does, your next step is to look at the timing of a request. I usually
do that with something like this:

sudo strace -Ff -ttt -T -p <pid> 2>&1 | tee strace_output_file

Do that on a fast process and on a slow process and let it run for a little
while to the point where you think you have captured similar requests in
both runs, then have a look at the timing between syscalls in both and see
if you can spot if it is an isolated delay in a certain part of the request
or if it is just an overall slowdown.

I would also run:

vmstat 1

both before and after. This gives you a picture of the overall health of
your server. The si/so columns, for example, will tell you if you are
swapping. If something has eaten a ton of memory and your system has
started swapping more than before then that could also explain this.

-Rasmus