Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:116958
Date: Mon, 31 Jan 2022 15:32:42 +0100
To: =?utf-8?Q?Alexandru_P=C4=83tr=C4=83nescu?= <drealecs@gmail.com>
Cc: PHP internals <internals@lists.php.net>
Message-ID: <a06192e1-648f-4f64-bf48-d2f74051880a@Spark>
In-Reply-To: <CAAwdEzBnQ9jXQPW4jQjr0SkwHUHYXFmbQHohFy7sh+_-BsugPA@mail.gmail.com>
References: <c2f04ec1-d62e-47ab-ac19-f43ab28101d7@Spark>
 <CAAwdEzBnQ9jXQPW4jQjr0SkwHUHYXFmbQHohFy7sh+_-BsugPA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="61f7f30f_43f18422_e489"
Subject: Re: [PHP-DEV] Best way to monitor php-fpm container liveness on
 Kubernetes
From: adam.hamsik@lablabs.io (Adam Hamsik)

--61f7f30f_43f18422_e489
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hi Alexander,

Pls see below my answers.


=C2=A0 =C2=A0Best Regards,

=C2=A0 =C2=A0Adam.



Adam Ham=C5=A1=C3=ADk
Co-founder & CEO
Mobile:=C2=A0+421-904-937-495
www.lablabs.io


On 23 Jan 2022, 09:09 +0100, Alexandru P=C4=83tr=C4=83nescu <drealecs=40g=
mail.com>, wrote:
>
> On Sat, Jan 22, 2022 at 10:00 PM Adam Hamsik <adam.hamsik=40lablabs.io>=
 wrote:
> > Hello,
> >
> > We are using PHP for our application backends, this works very well a=
s we have developed s imple way to clone them with minimal effort(they ca=
n be very similar). =46or our orchestration we are using Kubernetes (>=3D=
 1.21). Our application pod generally contains NGINX + php-fpm and fluent=
bit for log shipping. We generally want to have LivenessProbe(for an simp=
le explanation this is a simple check which is run against our pod to ver=
ify if it's alive, if it fails particular container will be restarted).
> >
> > This works very we(we are also using swoole which is roughly 80-70% b=
etter)l, but in certain unstable situations when we see higher applicatio=
n latency (db problem or a bug in our application). We often experience p=
roblems, because pods are falsely marked as dead (failed liveness probe a=
nd restarted by kubelet). This happens all processes in our static pool a=
re allocated to application requests. =46or our livenessProbe we tried to=
 use both fpm.ping and fpm.status endpoints but both of them behave in a =
same way as they are managed with worker processes.
> >
> > I had a look at pgp-src repo if e.g. we can use signals to verify if =
application server is running as a way to go around our issue. When looki=
ng at this I saw fpm-systemd.c which is a SystemD specific check. This ch=
eck reports fpm status every couple seconds(configurable to systemd). Wou=
ld you be willing to integrate similar feature for kubernetes. This would=
 be based on a pull model probably with and REST interface.
> >
> > My idea is following:
> >
> > 1) During startup if this is enabled php-fpm master will open a secon=
dary port pm.health=5Fport(9001) and listen for a pm.health=5Fpath(/healt=
z)=5B2=5D.
> > 2) If we receive GET request fpm master process will respond with HTT=
P code 200 and string ok. If anything is wrong (we can later add some che=
cks/metrics to make sure fpm is in a good state).=C2=A0=C2=A0If we do not=
 respond or fpm is not ok our LivenessProbe will fail. based on configura=
tion this will trigger container restart.
> >
> > Would you be interested to integrate feature like this =3F or is ther=
e any other way how we can achieve similar results =3F
> >
> > =5B1=5D=C2=A0https://kubernetes.io/docs/concepts/workloads/pods/pod-l=
ifecycle/=23when-should-you-use-a-liveness-probe=5B2=5D=C2=A0https://kube=
rnetes.io/docs/reference/using-api/health-checks/
> >
> > =C2=A0 =C2=A0Best Regards,
> >
> > =C2=A0 =C2=A0Adam.
> >
> >
> >
> > Adam Ham=C5=A1=C3=ADk
> > Co-founder & CEO
> > Mobile:=C2=A0+421-904-937-495
> > www.lablabs.io
>
> Hi Adam,
>
> While I believe that improvements for health checking and other metrics=
 can be added to the php-fpm to expose internal status and statistics,
> I want to say that I don't know too much about that and I want to first=
 discuss the problem that you mentioned and the approach.
>
> Based on my experience, it is best to have the health check always goin=
g through the application.
> You mentioned =22certain unstable situations when we see higher applica=
tion latency (db problem or a bug in our application)=22.
> Taking this two examples:
>
> - =22db problems=22. I'm guessing you mean, higher latency from the dat=
abase.
> In case of the health check, you should not connect to the database, of=
 course so the actual execution of the healthcheck should not be impacted=
.
> But probably you mean that more requests are piling up as php-fpm is no=
t able to handle them as fast as they are coming due to limited child pro=
cesses.
> One solution here would be to configure a second listening pool for hea=
lth endpoint on php-fpm with 1 or 2 child processes and configure nginx t=
o use it for the specific path.
>
> - =22a bug in our application=22.I'm guessing you mean a bug that cause=
s high CPU usage.
> If the issue is visible immediately once the pod starts, it's good to h=
ave the health check failed so the deployment rollout fails and avoid bri=
nging bugs in production.
> If the issue is visible later, some time after the pod starts, I'm thin=
king this could happen due to a memory leak. A pod restart due to a faile=
d health check would also make sure the production stays healthy.
Both of these problem are not usually by themselves big enough to cause a=
n outage. They are just making application behave slightly worse, this ho=
wever can sometimes lead to failed liveness probes -> pod restarts.
>
> Having the health check passing through the application makes sure it's=
 actually working.
Sure, Bu in our case we either go to fpm.ping or fpm.status as initializi=
ng whole symfony is quite expensive. I'm not sure if this counts as=C2=A0=
=C2=A0going through application.
> Based on my experience, it's good to include in the health check all th=
e application bootstrapping that are local and avoid any I/O like databas=
e, memcache and others.
> A missed new production configuration dependency that=C2=A0would make t=
he application not start up properly would not allow the deployment rollo=
ut and keep high uptime.
> A health check=C2=A0that is not using the actual application would repo=
rt it healthy while=C2=A0it will not be able to handle requests.
>

I agree with this. We initially tried to do a lot in our healthchecks and=
 gradually reducte their footprint/scope to just required minimum, becaus=
e they were too fragile.
> If I understand things differently or there are other cases that you en=
countered where you think a health check not going through the app is hel=
ping please share so we can learn about it.

>
> Regards,
> Alex
>

--61f7f30f_43f18422_e489--