Best way to monitor php-fpm container liveness on Kubernetes

3 years ago by Adam Hamsik — view source

unread

Hello,

We are using PHP for our application backends, this works very well as we have developed s imple way to clone them with minimal effort(they can be very similar). For our orchestration we are using Kubernetes (>= 1.21). Our application pod generally contains NGINX + php-fpm and fluentbit for log shipping. We generally want to have LivenessProbe(for an simple explanation this is a simple check which is run against our pod to verify if it's alive, if it fails particular container will be restarted).

This works very we(we are also using swoole which is roughly 80-70% better)l, but in certain unstable situations when we see higher application latency (db problem or a bug in our application). We often experience problems, because pods are falsely marked as dead (failed liveness probe and restarted by kubelet). This happens all processes in our static pool are allocated to application requests. For our livenessProbe we tried to use both fpm.ping and fpm.status endpoints but both of them behave in a same way as they are managed with worker processes.

I had a look at pgp-src repo if e.g. we can use signals to verify if application server is running as a way to go around our issue. When looking at this I saw fpm-systemd.c which is a SystemD specific check. This check reports fpm status every couple seconds(configurable to systemd). Would you be willing to integrate similar feature for kubernetes. This would be based on a pull model probably with and REST interface.

My idea is following:

During startup if this is enabled php-fpm master will open a secondary port pm.health_port(9001) and listen for a pm.health_path(/healtz)[2].
If we receive GET request fpm master process will respond with HTTP code 200 and string ok. If anything is wrong (we can later add some checks/metrics to make sure fpm is in a good state). If we do not respond or fpm is not ok our LivenessProbe will fail. based on configuration this will trigger container restart.

Would you be interested to integrate feature like this ? or is there any other way how we can achieve similar results ?

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-liveness-probe
[2] https://kubernetes.io/docs/reference/using-api/health-checks/

Best Regards,

Adam.

Adam Hamšík
Co-founder & CEO
Mobile: +421-904-937-495
www.lablabs.io

3 years ago by drealecs@gmail.com — view source

unread

Hello,

We are using PHP for our application backends, this works very well as we
have developed s imple way to clone them with minimal effort(they can be
very similar). For our orchestration we are using Kubernetes (>= 1.21). Our
application pod generally contains NGINX + php-fpm and fluentbit for log
shipping. We generally want to have LivenessProbe(for an simple explanation
this is a simple check which is run against our pod to verify if it's
alive, if it fails particular container will be restarted).

This works very we(we are also using swoole which is roughly 80-70%
better)l, but in certain unstable situations when we see higher application
latency (db problem or a bug in our application). We often experience
problems, because pods are falsely marked as dead (failed liveness probe
and restarted by kubelet). This happens all processes in our static pool
are allocated to application requests. For our livenessProbe we tried to
use both fpm.ping and fpm.status endpoints but both of them behave in a
same way as they are managed with worker processes.

I had a look at pgp-src repo if e.g. we can use signals to verify if
application server is running as a way to go around our issue. When looking
at this I saw fpm-systemd.c which is a SystemD specific check. This check
reports fpm status every couple seconds(configurable to systemd). Would you
be willing to integrate similar feature for kubernetes. This would be based
on a pull model probably with and REST interface.

My idea is following:

During startup if this is enabled php-fpm master will open a secondary
port pm.health_port(9001) and listen for a pm.health_path(/healtz)[2].

If we receive GET request fpm master process will respond with HTTP
code 200 and string ok. If anything is wrong (we can later add some
checks/metrics to make sure fpm is in a good state). If we do not respond
or fpm is not ok our LivenessProbe will fail. based on configuration this
will trigger container restart.

Would you be interested to integrate feature like this ? or is there any
other way how we can achieve similar results ?

[1]
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-liveness-probe
[2] https://kubernetes.io/docs/reference/using-api/health-checks/

Best Regards,

Adam.

Adam Hamšík
Co-founder & CEO
Mobile: +421-904-937-495
www.lablabs.io

Hi Adam,

While I believe that improvements for health checking and other metrics can
be added to the php-fpm to expose internal status and statistics,
I want to say that I don't know too much about that and I want to first
discuss the problem that you mentioned and the approach.

Based on my experience, it is best to have the health check always going
through the application.
You mentioned "certain unstable situations when we see higher application
latency (db problem or a bug in our application)".
Taking this two examples:

"db problems". I'm guessing you mean, higher latency from the database.
In case of the health check, you should not connect to the database, of
course so the actual execution of the healthcheck should not be impacted.
But probably you mean that more requests are piling up as php-fpm is not
able to handle them as fast as they are coming due to limited child
processes.
One solution here would be to configure a second listening pool for health
endpoint on php-fpm with 1 or 2 child processes and configure nginx to use
it for the specific path.
"a bug in our application".I'm guessing you mean a bug that causes high
CPU usage.
If the issue is visible immediately once the pod starts, it's good to have
the health check failed so the deployment rollout fails and avoid bringing
bugs in production.
If the issue is visible later, some time after the pod starts, I'm thinking
this could happen due to a memory leak. A pod restart due to a failed
health check would also make sure the production stays healthy.

Having the health check passing through the application makes sure it's
actually working.
Based on my experience, it's good to include in the health check all the
application bootstrapping that are local and avoid any I/O like database,
memcache and others.
A missed new production configuration dependency that would make the
application not start up properly would not allow the deployment rollout
and keep high uptime.
A health check that is not using the actual application would report it
healthy while it will not be able to handle requests.

If I understand things differently or there are other cases that you
encountered where you think a health check not going through the app is
helping please share so we can learn about it.

Regards,
Alex

3 years ago by Adam Hamsik — view source

unread

Hi Alexander,

Pls see below my answers.

Best Regards,

Adam.

Adam Hamšík
Co-founder & CEO
Mobile: +421-904-937-495
www.lablabs.io

Hello,

We are using PHP for our application backends, this works very well as we have developed s imple way to clone them with minimal effort(they can be very similar). For our orchestration we are using Kubernetes (>= 1.21). Our application pod generally contains NGINX + php-fpm and fluentbit for log shipping. We generally want to have LivenessProbe(for an simple explanation this is a simple check which is run against our pod to verify if it's alive, if it fails particular container will be restarted).

This works very we(we are also using swoole which is roughly 80-70% better)l, but in certain unstable situations when we see higher application latency (db problem or a bug in our application). We often experience problems, because pods are falsely marked as dead (failed liveness probe and restarted by kubelet). This happens all processes in our static pool are allocated to application requests. For our livenessProbe we tried to use both fpm.ping and fpm.status endpoints but both of them behave in a same way as they are managed with worker processes.

I had a look at pgp-src repo if e.g. we can use signals to verify if application server is running as a way to go around our issue. When looking at this I saw fpm-systemd.c which is a SystemD specific check. This check reports fpm status every couple seconds(configurable to systemd). Would you be willing to integrate similar feature for kubernetes. This would be based on a pull model probably with and REST interface.

My idea is following:

During startup if this is enabled php-fpm master will open a secondary port pm.health_port(9001) and listen for a pm.health_path(/healtz)[2].

If we receive GET request fpm master process will respond with HTTP code 200 and string ok. If anything is wrong (we can later add some checks/metrics to make sure fpm is in a good state). If we do not respond or fpm is not ok our LivenessProbe will fail. based on configuration this will trigger container restart.

Would you be interested to integrate feature like this ? or is there any other way how we can achieve similar results ?

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-liveness-probe[2] https://kubernetes.io/docs/reference/using-api/health-checks/

Best Regards,

Adam.

Adam Hamšík
Co-founder & CEO
Mobile: +421-904-937-495
www.lablabs.io

Hi Adam,

While I believe that improvements for health checking and other metrics can be added to the php-fpm to expose internal status and statistics,
I want to say that I don't know too much about that and I want to first discuss the problem that you mentioned and the approach.

Based on my experience, it is best to have the health check always going through the application.
You mentioned "certain unstable situations when we see higher application latency (db problem or a bug in our application)".
Taking this two examples:

"db problems". I'm guessing you mean, higher latency from the database.
In case of the health check, you should not connect to the database, of course so the actual execution of the healthcheck should not be impacted.
But probably you mean that more requests are piling up as php-fpm is not able to handle them as fast as they are coming due to limited child processes.
One solution here would be to configure a second listening pool for health endpoint on php-fpm with 1 or 2 child processes and configure nginx to use it for the specific path.

"a bug in our application".I'm guessing you mean a bug that causes high CPU usage.
If the issue is visible immediately once the pod starts, it's good to have the health check failed so the deployment rollout fails and avoid bringing bugs in production.
If the issue is visible later, some time after the pod starts, I'm thinking this could happen due to a memory leak. A pod restart due to a failed health check would also make sure the production stays healthy.
Both of these problem are not usually by themselves big enough to cause an outage. They are just making application behave slightly worse, this however can sometimes lead to failed liveness probes -> pod restarts.

Having the health check passing through the application makes sure it's actually working.
Sure, Bu in our case we either go to fpm.ping or fpm.status as initializing whole symfony is quite expensive. I'm not sure if this counts as going through application.
Based on my experience, it's good to include in the health check all the application bootstrapping that are local and avoid any I/O like database, memcache and others.
A missed new production configuration dependency that would make the application not start up properly would not allow the deployment rollout and keep high uptime.
A health check that is not using the actual application would report it healthy while it will not be able to handle requests.

I agree with this. We initially tried to do a lot in our healthchecks and gradually reducte their footprint/scope to just required minimum, because they were too fragile.

If I understand things differently or there are other cases that you encountered where you think a health check not going through the app is helping please share so we can learn about it.

Regards,
Alex

3 years ago by Jakub Zelenka — view source

unread

Hello,

We are using PHP for our application backends, this works very well as we
have developed s imple way to clone them with minimal effort(they can be
very similar). For our orchestration we are using Kubernetes (>= 1.21). Our
application pod generally contains NGINX + php-fpm and fluentbit for log
shipping. We generally want to have LivenessProbe(for an simple explanation
this is a simple check which is run against our pod to verify if it's
alive, if it fails particular container will be restarted).

This works very we(we are also using swoole which is roughly 80-70%
better)l, but in certain unstable situations when we see higher application
latency (db problem or a bug in our application). We often experience
problems, because pods are falsely marked as dead (failed liveness probe
and restarted by kubelet). This happens all processes in our static pool
are allocated to application requests. For our livenessProbe we tried to
use both fpm.ping and fpm.status endpoints but both of them behave in a
same way as they are managed with worker processes.

I had a look at pgp-src repo if e.g. we can use signals to verify if
application server is running as a way to go around our issue. When looking
at this I saw fpm-systemd.c which is a SystemD specific check. This check
reports fpm status every couple seconds(configurable to systemd). Would you
be willing to integrate similar feature for kubernetes. This would be based
on a pull model probably with and REST interface.

My idea is following:

During startup if this is enabled php-fpm master will open a secondary
port pm.health_port(9001) and listen for a pm.health_path(/healtz)[2].

If we receive GET request fpm master process will respond with HTTP
code 200 and string ok. If anything is wrong (we can later add some
checks/metrics to make sure fpm is in a good state). If we do not respond
or fpm is not ok our LivenessProbe will fail. based on configuration this
will trigger container restart.

Would you be interested to integrate feature like this ? or is there any
other way how we can achieve similar results ?

This is already being tracked as https://bugs.php.net/bug.php?id=68678 .
You could probably already use pm.status_listen but which should cover what
you need. We should eventually add support for ping.listen . It's a bit
lower priority on my list because status already sort of cover it but if
someone wants to send a PR for that, than I will be happy to review it - I
would imagine that the implementation will be pretty much the same like
pm.status_listen.

I'm wondering if it would be also useful to add HTTP support for ping
because currently it's just FCGI so you still need to hit it through web
server, right? Adding HTTP support could be a bit harder as it would need
to be a different listener...

Regards

Jakub