Hello,
I am seeing a strange performance drop on our production servers in
the PHP components.
It happens usually after 1-3 uptime days, the response time suddenly
doubles. I already investigated to try to isolate any strange request,
but found nothing. We are having 2k to 4k RPM, the stack is PHP-FPM
7.0.X (currently 12, I just updated) into Docker containers, images
are based on the official PHP-FPM Docker image.
The usual response time is ~25ms, and it surges to ~50ms after this
"event". Here's a typical graph of the issue :
http://i.imgur.com/3WaNBL4.png
After restarting the docker containers (so the PHP-FPM processes), it
just works fine again.
I don't really know how to investigate further. If you have any
pointers on how to help figuring out what's wrong, I'd love to try.
Thanks,
Regards,
Jérémie
--
Jérémie BORDIER
On Fri, Oct 14, 2016 at 9:15 PM, Jérémie BORDIER jeremie.bordier@gmail.com
wrote:
I don't really know how to investigate further. If you have any
pointers on how to help figuring out what's wrong, I'd love to try.
I would breakout the Linux perf command for something like this.
Run it like this:
perf record -p <pid_of_a_php-fpm_process> -g
Let that run for some arbitrary amount of time. Start with 10 seconds, or
so. Then:
perf report -n
That should give you a nice report of what your php-fpm process did for
those 10 seconds. Now do the same thing when you see the problem and
compare the reports.
-Rasmus
Hello Rasmus,
Thank you for the insight. I ran exactly what you said, on the very
same php-fpm process, once just after restarting it almost 2 days ago,
so having fast response time, and one just right now after the "slow
down" issue triggered during the night.
The output of the perf diff is quite poor I think, here's the mainline :
35.90% +44.94% php-fpm [.] 0x0000000000042412
10.72% -6.05% libc-2.19.so [.] 0x0000000000079030
9.71% -9.34% newrelic.so [.] 0x0000000000030980
3.81% -3.47% [kernel.kallsyms] [k] ipt_do_table
2.56% -2.02% [kernel.kallsyms] [k] nf_iterate
2.32% -1.99% opcache.so [.] 0x0000000000008cec
1.44% [kernel.kallsyms] [k] update_cfs_shares
1.42% [kernel.kallsyms] [k] __bpf_prog_run
1.28% [vdso] [.] __vdso_gettimeofday
How could I provide more info ?
Thanks,
Best regards,
Jérémie
On Fri, Oct 14, 2016 at 9:15 PM, Jérémie BORDIER jeremie.bordier@gmail.com
wrote:I don't really know how to investigate further. If you have any
pointers on how to help figuring out what's wrong, I'd love to try.I would breakout the Linux perf command for something like this.
Run it like this:
perf record -p <pid_of_a_php-fpm_process> -g
Let that run for some arbitrary amount of time. Start with 10 seconds, or
so. Then:perf report -n
That should give you a nice report of what your php-fpm process did for
those 10 seconds. Now do the same thing when you see the problem and compare
the reports.-Rasmus
--
Jérémie BORDIER
The output of the perf diff is quite poor I think, here's the mainline :
35.90% +44.94% php-fpm [.] 0x0000000000042412
10.72% -6.05% libc-2.19.so [.] 0x0000000000079030
9.71% -9.34% newrelic.so [.] 0x0000000000030980
3.81% -3.47% [kernel.kallsyms] [k] ipt_do_table
2.56% -2.02% [kernel.kallsyms] [k] nf_iterate
2.32% -1.99% opcache.so [.] 0x0000000000008cec
1.44% [kernel.kallsyms] [k] update_cfs_shares
1.42% [kernel.kallsyms] [k] __bpf_prog_run
1.28% [vdso] [.] __vdso_gettimeofdayHow could I provide more info ?
Yes, not much to go on there. But newrelic.so stands out to me. Can you try
removing newrelic and see if it still happens?
If it does, your next step is to look at the timing of a request. I usually
do that with something like this:
sudo strace -Ff -ttt -T -p <pid> 2>&1 | tee strace_output_file
Do that on a fast process and on a slow process and let it run for a little
while to the point where you think you have captured similar requests in
both runs, then have a look at the timing between syscalls in both and see
if you can spot if it is an isolated delay in a certain part of the request
or if it is just an overall slowdown.
I would also run:
vmstat 1
both before and after. This gives you a picture of the overall health of
your server. The si/so columns, for example, will tell you if you are
swapping. If something has eaten a ton of memory and your system has
started swapping more than before then that could also explain this.
-Rasmus