OK, first of all I apologise for not posting this in the "right place",
but this is an unreproducable bug (the worst kind...), and I need some
educated guesses as to what is causing it. This thing has me at my wits'
end...
The situation is this: Apache on our main dynamic web server keeps on
suddenly eating all 3+GB of available virtual memory. When this happens,
the server stops responding to all requests, and basically freezes until
the kernel OOM killer gets around to killing enough httpd processes so
we can get in to kill the rest and restart it. This happens every couple
of minutes to couple of hours - there's no pattern to it. You can see an
attractive graph of the occurence here:
http://static.last.fm/phpbug/mem.gif
As you can see, the memory usage shoots up suddenly - it doesn't appear
to be a conventional memory leak. This is accompanied by a similar spike
in the number of apache processes - right up to the MaxClients limit.
We've been running PHP with debug support enabled for the last couple of
days, and we've noticed that a series of errors is always logged just
before the spike in memory usage. A log snippet is available here (note
that these errors carry on for several pages - although I suspect the
first one is the only relevant one - this is only the first page or so):
http://static.last.fm/phpbug/log.txt
The error doesn't just happen with that script, however the initial
error always occurs in the same place (zend_variables.c:44).
This machine serves around 500,000 hits daily, and 99% of them are
PHP-parsed. It's running Debian 3.0 with backported kernel 2.6.7. The
bug manifests itself with both Apache 1.3 and 2, and both PHP 4.3.8 and
4.3.9RC2. Compile options are as follows:
'./configure' '--with-apxs2=/web/apache2/bin/apxs' '--without-mysql'
'--with-zlib-dir=/usr' '--enable-gd-native-ttf' '--with-gettext'
'--enable-mbstring' '--with-pgsql=/usr/local/pgsql' '--enable-sysvmsg'
'--with-gd' '--with-jpeg-dir=/usr' '--enable-debug'
The only third-party module we're using is Turck mmcache - removing it
is kind of difficult since running without any cache brings the machine
to its knees :).
Sorry about the length of this message, I had to fit all the details
in... I'd appreciate it if you have any suggestions at all, this is
really annoying me now. It's problems like this with PHP which make me
consider moving to Java ;)... Anyhow.
Thanks in advance,
Russ Garrett
russ@last.fm
The only third-party module we're using is Turck mmcache - removing it
is kind of difficult since running without any cache brings the machine
to its knees :).
But you should be able to trivially replace it with pecl/apc as the
peformance of the two a very similar and that would at least eliminate one
variable.
-Rasmus
This machine serves around 500,000 hits daily, and 99% of them are
PHP-parsed.
By the way, that is not a lot of hits. Less than 6 requests per second.
I tend to get worried when my servers can't do at least 80-100
requests/second. And you certainly shouldn't need an opcode cache to
do 6 req/sec. What exactly do these PHP scripts of yours do?
-Rasmus
Is your server really unusable w/o a compiled code cache? Of not, try to
remove it and see if the problem persists. One of the problems of most
opcode caches is that a crash bug in PHP or one of its modules can end up
resulting in a full server crash.
I have to say though that it doesn't look that way to me. From first
glance, it appears to be the standard 'spiraling crash'. What that
basically means is:
- For whatever reason, the number of Apache processes rises (typically
due to increased end user load, but sometimes also because some
administration script is being run, database slowdown, etc.). - The machine hits the swap threshold , which causes it to slow down much
more (typically by an order of magnitude, at least). - Because of the slowdown, the increased number of Apache processes
quickly becomes saturated (it takes each one more time to serve the
request), and with new requests flowing in, the number of Apache processes
increases even more. - More swap is necessary for the increased number of Apache processes,
and an hopeless spiral begins, typically ending only when the server dies.
If that's indeed what happens on your system (and it happens to almost
everyone, sooner or later) - then it means your system has a value of
MaxClients that's not backed by its CPU power and more importantly,
available memory. You need to either decrease that number or add more memory.
Generally, your machine should have enough memory to run Apache when it
reaches MaxClients without hitting swap. You can test it by setting
StartServers to the same number as MaxClients, and then hitting some of
your PHP-based pages with a high-concurrency ab.
It might be possible that the crash is somehow related, especially if it
corrupts the compiled code cache and results in frequent crashes of Apache
processes, which will cause Apache to fork more and more processes that can
be the initial slowdown trigger, but still, a properly configured server
should not die out of memory because of that.
Zeev
At 22:35 05/09/2004, Russ Garrett wrote:
OK, first of all I apologise for not posting this in the "right place",
but this is an unreproducable bug (the worst kind...), and I need some
educated guesses as to what is causing it. This thing has me at my wits' end...The situation is this: Apache on our main dynamic web server keeps on
suddenly eating all 3+GB of available virtual memory. When this happens,
the server stops responding to all requests, and basically freezes until
the kernel OOM killer gets around to killing enough httpd processes so we
can get in to kill the rest and restart it. This happens every couple of
minutes to couple of hours - there's no pattern to it. You can see an
attractive graph of the occurence here:http://static.last.fm/phpbug/mem.gif
As you can see, the memory usage shoots up suddenly - it doesn't appear to
be a conventional memory leak. This is accompanied by a similar spike in
the number of apache processes - right up to the MaxClients limit.We've been running PHP with debug support enabled for the last couple of
days, and we've noticed that a series of errors is always logged just
before the spike in memory usage. A log snippet is available here (note
that these errors carry on for several pages - although I suspect the
first one is the only relevant one - this is only the first page or so):http://static.last.fm/phpbug/log.txt
The error doesn't just happen with that script, however the initial error
always occurs in the same place (zend_variables.c:44).This machine serves around 500,000 hits daily, and 99% of them are
PHP-parsed. It's running Debian 3.0 with backported kernel 2.6.7. The bug
manifests itself with both Apache 1.3 and 2, and both PHP 4.3.8 and
4.3.9RC2. Compile options are as follows:'./configure' '--with-apxs2=/web/apache2/bin/apxs' '--without-mysql'
'--with-zlib-dir=/usr' '--enable-gd-native-ttf' '--with-gettext'
'--enable-mbstring' '--with-pgsql=/usr/local/pgsql' '--enable-sysvmsg'
'--with-gd' '--with-jpeg-dir=/usr' '--enable-debug'The only third-party module we're using is Turck mmcache - removing it is
kind of difficult since running without any cache brings the machine to
its knees :).Sorry about the length of this message, I had to fit all the details in...
I'd appreciate it if you have any suggestions at all, this is really
annoying me now. It's problems like this with PHP which make me consider
moving to Java ;)... Anyhow.Thanks in advance,
Russ Garrett
russ@last.fm
Thanks for all the prompt responses, most appreciated.
Firstly I forgot to add in a fairly crucial subdomain to my hits
estimate (I'm half asleep today). It's closer to 2 million dynamic hits
per day, all added in, which make my numbers a little more reasonable...
I doubt the spiralling-crash theory because sometimes the server will
run fine for hours, using less than 1GB of the 2GB of RAM, and then
suddenly die. It tends to die as frequently during off-peak times as it
does during peak times. Plus, we're running at a modest MaxClients
setting of 100, which with dual 2.4 Xeons and 2GB of RAM should be more
than reasonable.
I tend to agree that it may be a case of the crash causing the opcode
cache to be corrupted, and causing the rest of the apache processes to hang.
APC doesn't seem to work at all as a DSO, I'll try it statically later.
We can't run without an opcode cache, I just tried it and it completely
maxes out the CPU and causes the load to go over 100. We are two fairly
heavy Smarty-based sites.
Cheers,
Russ
APC doesn't seem to work at all as a DSO, I'll try it statically later.
I run it on thousands of servers as a DSO. What are you seeing that would
make you think this?
Also, if you are running PHP as a DSO and pushing your CPU you might want
to compile it non-pic. Use this patch and reconfigure/recompile:
http://lerdorf.com/non-pic.txt
-Rasmus
Rasmus Lerdorf wrote:
APC doesn't seem to work at all as a DSO, I'll try it statically later.
I run it on thousands of servers as a DSO. What are you seeing that would
make you think this?
I didn't really want to hang around with the site offline to find out.
Load shot up, I couldn't get a page out of Apache at all. It may well
have been due to the debug build of PHP, although it appeared to load OK.
I've just installed the 30 day trial of Zend Perfomance Suite, so I
shall see if that fixes it.
Also, if you are running PHP as a DSO and pushing your CPU you might want
to compile it non-pic. Use this patch and reconfigure/recompile:http://lerdorf.com/non-pic.txt
Noted. Thanks.
Russ
OK, the situation seems a lot more stable with Zend Accelerator instead
of mmcache, and we're regularly getting quite a few "checksum failed"
errors in the logs, which does tend to indicate that shared memory
corruption was (and still is) happening. But now I don't have to restart
the damn thing every hour, at least for the duration of the 30-day trial ;).
However, now we've eliminated this problem, another becomes obvious.
Namely that there does seem to be a small amount of memory leaking -
likely due to the crashes which are still occurring (i.e. those detailed
here: http://static.last.fm/phpbug/log.txt).
This results in some Apache children taking up 200MB+ of RAM and
lingering there, not serving any requests, until they're killed or the
server is restarted. Regrettably I can't be more specific because the
location in our code that the crashes happen is random (the location in
PHP always appears to be zend_variables.c line 44).
Since the httpd processes appear to just hang, the Apache
MaxRequestsPerChild setting is useless against this.
Thanks so much for your help so far, it is most appreciated. We seem
to be ridiculously unlucky when it comes to these sorts of things...
Cheers,
Russ
You are going to have to narrow this down further for us to have any
chance to help you. Put your stuff on a development server and hit your
various pages looking for that error or the request that causes your httpd
to grow to 200M (use Apache1, not Apache2 for this). Or if all else fails
replay the log to it to recreate the situation. Then replay it slower
without an opcode cache and get it down to a single script and then a
specific part of that script.
-Rasmus
OK, the situation seems a lot more stable with Zend Accelerator instead
of mmcache, and we're regularly getting quite a few "checksum failed"
errors in the logs, which does tend to indicate that shared memory
corruption was (and still is) happening. But now I don't have to restart
the damn thing every hour, at least for the duration of the 30-day trial ;).However, now we've eliminated this problem, another becomes obvious.
Namely that there does seem to be a small amount of memory leaking -
likely due to the crashes which are still occurring (i.e. those detailed
here: http://static.last.fm/phpbug/log.txt).This results in some Apache children taking up 200MB+ of RAM and
lingering there, not serving any requests, until they're killed or the
server is restarted. Regrettably I can't be more specific because the
location in our code that the crashes happen is random (the location in
PHP always appears to be zend_variables.c line 44).Since the httpd processes appear to just hang, the Apache
MaxRequestsPerChild setting is useless against this.Thanks so much for your help so far, it is most appreciated. We seem
to be ridiculously unlucky when it comes to these sorts of things...Cheers,
Russ
i can confirm it. it's the problem of cacher.
mmcache is rather complex and NOT stable, although many ppl is running happily, they're not under heavy load.
1 hours to 1days after apache is restarted, mmcache end up with all page randomly crash (share mem courpo
APC works with apache2 DSO, and the optimizer is stable ONLY with my patches
check it out here: http://pecl.php.net/bugs/search.php?cmd=display&status=Open&bug_type[]=APC
i've used APC from the time my last patch posted till now, having 0 crash. (if i clear the cache after long time running, about 1/10 chance will get crash)
FYI: my script seems never beyond cache size
phpa, is quite stable untill the author stopped releasing new version
even i installed the phpa "which can't work with my lastest php", my page is still ok.
this is because: it has "crash recover" scheme: mark the share memory to "reset" on crash, and reset it when it get write lock of share mem.
phpa fall back to non-caching whenever it failed to operate on the share mem, thus no hanging.
phpa stopped itself but won't let php down/hang if the share memory is dead locked or messed up or even can't be recovered.
(i know it by reading the log when phpa crash, some of the above is base on guessing)
both mmcache and apc does not have "crash recover"
does Zend products implement it?
----- Original Message -----
From: "Russ Garrett" russ@last.fm
To: internals@lists.php.net
Sent: Monday, September 06, 2004 3:35 AM
Subject: [PHP-DEV] Really odd PHP problem
OK, first of all I apologise for not posting this in the "right place",
but this is an unreproducable bug (the worst kind...), and I need some
educated guesses as to what is causing it. This thing has me at my wits'
end...The situation is this: Apache on our main dynamic web server keeps on
suddenly eating all 3+GB of available virtual memory. When this happens,
the server stops responding to all requests, and basically freezes until
the kernel OOM killer gets around to killing enough httpd processes so
we can get in to kill the rest and restart it. This happens every couple
of minutes to couple of hours - there's no pattern to it. You can see an
attractive graph of the occurence here:http://static.last.fm/phpbug/mem.gif
As you can see, the memory usage shoots up suddenly - it doesn't appear
to be a conventional memory leak. This is accompanied by a similar spike
in the number of apache processes - right up to the MaxClients limit.We've been running PHP with debug support enabled for the last couple of
days, and we've noticed that a series of errors is always logged just
before the spike in memory usage. A log snippet is available here (note
that these errors carry on for several pages - although I suspect the
first one is the only relevant one - this is only the first page or so):http://static.last.fm/phpbug/log.txt
The error doesn't just happen with that script, however the initial
error always occurs in the same place (zend_variables.c:44).This machine serves around 500,000 hits daily, and 99% of them are
PHP-parsed. It's running Debian 3.0 with backported kernel 2.6.7. The
bug manifests itself with both Apache 1.3 and 2, and both PHP 4.3.8 and
4.3.9RC2. Compile options are as follows:'./configure' '--with-apxs2=/web/apache2/bin/apxs' '--without-mysql'
'--with-zlib-dir=/usr' '--enable-gd-native-ttf' '--with-gettext'
'--enable-mbstring' '--with-pgsql=/usr/local/pgsql' '--enable-sysvmsg'
'--with-gd' '--with-jpeg-dir=/usr' '--enable-debug'The only third-party module we're using is Turck mmcache - removing it
is kind of difficult since running without any cache brings the machine
to its knees :).Sorry about the length of this message, I had to fit all the details
in... I'd appreciate it if you have any suggestions at all, this is
really annoying me now. It's problems like this with PHP which make me
consider moving to Java ;)... Anyhow.Thanks in advance,
Russ Garrett
russ@last.fm
APC works with apache2 DSO, and the optimizer is stable ONLY with my patches
check it out here: http://pecl.php.net/bugs/search.php?cmd=display&status=Open&bug_type[]=APC
i've used APC from the time my last patch posted till now, having 0 crash. (if i clear the cache after long time running, about 1/10 chance will get crash)
FYI: my script seems never beyond cache size
I have fixed a number of problems related to running out of shared memory
in APC lately. If you grab the current CVS version I think you will find
that it is less likely to fill up shared memory, and when it does, it is
smarter about handling that scenario when it happens.
I really haven't done much to the optimizer. I tend to just leave it off.
I would be interested in seeing your patches.
both mmcache and apc does not have "crash recover"
The concept of a crash recover is somewhat flawed in my opinion. The only
way to really do this is to catch SIGSEGV, SIGBUS
and other such fatal
signals and twiddle a knob somewhere in shared memory that tells other
processes to flush the cache. The problem with doing this is that once
you get a SEGV, it really isn't safe to do anything like that. You run a
very serious risk of ending up in an infinite crash loop where you catch
the crash, try to set the crash-recover flag, crash trying to do that,
catch the crash, etc.
-Rasmus
thanks for your taking care of my bug reports
my optimizer patch is in http://pecl.php.net/bugs/bug.php?id=1678
i guess u've saw it just now. the changes required by the fix isn't that much as my patch.
i reorgnized the blocks of code into macro(i personally don't like too much boring repeats),
this should make it easy to update and less mistakes. i don't knw it this breaks any coding rules.
feel free to keep origin struct but make the changes careful :)
the mose unstable code is doing constant_fold.
IIRC, long ago, ZendEngine disabled static computeValue:
php -r 'function a(){static $a=1+1;}'
Parse error: parse error, expecting ','' or
';'' in Command line code on line 1
to avoid being unstable(crash?)
i wonder why mmcache managed to do it.
how about other optimizers?
----- Original Message -----
From: "Rasmus Lerdorf" rasmus@php.net
To: "Xuefer" Xuefer@hotmail.com
Cc: internals@lists.php.net; "Russ Garrett" russ@last.fm
Sent: Tuesday, September 07, 2004 12:05 PM
Subject: Re: [PHP-DEV] Really odd PHP problem
APC works with apache2 DSO, and the optimizer is stable ONLY with my patches
check it out here: http://pecl.php.net/bugs/search.php?cmd=display&status=Open&bug_type[]=APC
i've used APC from the time my last patch posted till now, having 0 crash. (if i clear the cache after long time running, about 1/10 chance will get crash)
FYI: my script seems never beyond cache sizeI have fixed a number of problems related to running out of shared memory
in APC lately. If you grab the current CVS version I think you will find
that it is less likely to fill up shared memory, and when it does, it is
smarter about handling that scenario when it happens.I really haven't done much to the optimizer. I tend to just leave it off.
I would be interested in seeing your patches.both mmcache and apc does not have "crash recover"
The concept of a crash recover is somewhat flawed in my opinion. The only
way to really do this is to catch SIGSEGV,SIGBUS
and other such fatal
signals and twiddle a knob somewhere in shared memory that tells other
processes to flush the cache. The problem with doing this is that once
you get a SEGV, it really isn't safe to do anything like that. You run a
very serious risk of ending up in an infinite crash loop where you catch
the crash, try to set the crash-recover flag, crash trying to do that,
catch the crash, etc.-Rasmus
both mmcache and apc does not have "crash recover"
The concept of a crash recover is somewhat flawed in my opinion. The only
way to really do this is to catch SIGSEGV,SIGBUS
and other such fatal
signals and twiddle a knob somewhere in shared memory that tells other
processes to flush the cache. The problem with doing this is that once
you get a SEGV, it really isn't safe to do anything like that. You run a
very serious risk of ending up in an infinite crash loop where you catch
the crash, try to set the crash-recover flag, crash trying to do that,
catch the crash, etc.-Rasmus
without crash recover, corrupted share mem will trigger the crash too in another process.
sorry for my low experience on C and sharemem
but IMHO, it not that hard
it easy to make a reset_flag at top or bottom of sharemem
flag is just an int, not pointer
it won't crash unless the sharemem is unavailable, or the pointer to the share mem is corrupted, maybe possible?
after all we can reset the signalhandler when we're going to operate on the flag
remember to log something when share mem is going to reset(no matter can or cannot obtain write lock to reset)
both mmcache and apc does not have "crash recover"
The concept of a crash recover is somewhat flawed in my opinion. The only
way to really do this is to catch SIGSEGV,SIGBUS
and other such fatal
signals and twiddle a knob somewhere in shared memory that tells other
processes to flush the cache. The problem with doing this is that once
you get a SEGV, it really isn't safe to do anything like that. You run a
very serious risk of ending up in an infinite crash loop where you catchthe crash, try to set the crash-recover flag, crash trying to do that,
catch the crash, etc.-Rasmus
without crash recover, corrupted share mem will trigger the crash too in another process.
sorry for my low experience on C and sharemem
but IMHO, it not that hard
it easy to make a reset_flag at top or bottom of sharemem
flag is just an int, not pointer
it won't crash unless the sharemem is unavailable, or the pointer to the share mem is corrupted, maybe possible?
after all we can reset the signalhandler when we're going to operate on the flag
remember to log something when share mem is going to reset(no matter can or cannot obtain write lock to reset)
There are different ways of doing it, but using a signal handler to catch
a SEGV is not a good idea. You can't count on any code of any sort
working after a SEGV. You could turn it around and have processes check
in and out as they handle requests and if a process doesn't check out
within some allotted time, assume a crash and reset. Or you could have an
external mechanism monitor for crashes and do the reset externally. But
having the process itself that crashed do anything is just asking for
trouble. It doesn't matter if the flag is an int or what it is. Any code
at all executed after a SEGV is unsafe.
-Rasmus