Hi everybody!
Three months ago one of my PHP applications started to hang webserver
processes. The schema of the application is as follows (it's a webservice):
-Main server, with LINUX, Apache and PHP, handles the requests from the
clients.
-Once the client is authenticated, and the request is valid, it relays the
request to one of the four slave servers (let's call them nodes).
-One of the nodes uses PHP4 (works fine), and the other three use LINUX,
Apache and PHP 5.1.6 with the following SSL configuration:
OpenSSL support => enabled
OpenSSL Version => OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008
-The nodes send the request to a remote webservice, and reply it back to the
main server.
-The main server processes the response and sends back the response to the
client.
The problem is that, eventually, the PHP5 nodes ends with blocked (sleeping)
processes from Apache. Finally the system runs out of resources because all
the blocked-sleeping processes have a ESTABLISHED socket.
First I noticed that, at least in that version of PHP, the blocking sockets
can have a send and recive timeout. Then I switched to non blocking sockets
and select. Now the application works fine in the upper level of the
communication with the other webservices, but I noticed that it still ends
with sleeping processes. I "straced" the Apache and this is what I found
(the bound IP and the remote webservice IP are not real; the second one is
from google.com):
Block 1 -
5503 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 20
5503 bind(20, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("10.6.0.11")}, 16) = 0
5503 fcntl64(20, F_GETFL) = 0x2 (flags O_RDWR)
5503 fcntl64(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
5503 connect(20, {sa_family=AF_INET, sin_port=htons(443),
sin_addr=inet_addr("74.125.127.100")}, 16) = -1 EINPROGRESS (Operation now
in progress)
5503 poll([{fd=20, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1370327) = 1
([{fd=20, revents=POLLOUT}])
Block 2 -
5503 getsockopt(20, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
5503 fcntl64(20, F_SETFL, O_RDWR) = 0
5503 time(NULL) = 1251911098
5503 brk(0xb9e39000) = 0xb9e39000
5503 time(NULL) = 1251911098
5503 write(20,
"\200d\1\3\1\0K\0\0\0\20\0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"...,
- = 102
5503 read(20, "\26\3\1\0J\2\0", 7) = 7
5503 time(NULL) = 1251911098
5503 time(NULL) = 1251911098
5503 read(20,
"\0F\3\1J\236\245\273^\v\375\357<\205\223\334n}\226k\232\\311\300\21Q\256\267\304\257o#"..., - = 72
5503 read(20, "\26\3\1\6\252", 5) = 5
5503 read(20,
"\v\0\6\246\0\6\243\0\3\2400\202\3\2340\202\3\5\240\3\2\1\2\2\4<\260Eh0\r\6"..., - = 1376
5503 read(20, " 2 CA1\r0\v\6\3U\4\3\23\4CRL10+\6\3U\35\20\4$0"\200"..., - = 330
5503 read(20, "\26\3\1\1^", 5) = 5
5503 read(20, "\r\0\1V\3\1\2@\1P\0M0K1\v0\t\6\3U\4\6\23\2ES1\0170\r\6"..., - = 350
5503 write(20,
"\26\3\1\0\7\v\0\0\3\0\0\0\26\3\1\0\206\20\0\0\202\0\200\215\30
\tT\312\313\255\210"..., 210) = 210
5503 read(20, "\24\3\1\0\1", 5) = 5
5503 read(20, "\1", 1) = 1
5503 read(20, "\26\3\1\0000", 5) = 5
5503 read(20,
"\364\35k\17\25\224\314]|\263Y\307z\310\223\314\210\235h\224t\345\305\373\267\241\377\t\1s\345\250"..., - = 48
Block 3 -
5503 fcntl64(20, F_GETFL) = 0x2 (flags O_RDWR)
5503 fcntl64(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
Block 4 -
5503 select(21, [], [20], [], {44320, 0}) = 1 (out [20], left {44320, 0})
5503 gettimeofday({1251911098, 885029}, NULL) = 0
5503 write(13, "[Wed Sep 02 19:04:58 2009] [erro"..., 161) = 161
5503 write(20,
"\27\3\1\2\240\10\260\325k\v\37{\251\355t_R\274\312\16\354r\315\1\277\204b\342xU\30L"...,
- = 677
5503 select(21, [20], [], [], {44320, 0} <unfinished ...>
5503 <... select resumed> ) = 1 (in [20], left {44319, 828000})
5503 read(20, "\27\3\1\0\240", 5) = 5
5503 read(20,
"\304\310\273s\370\304e\335\273\1\346\376N\234\376\226\276\0\fG9T\313\235\2&\31\340B=\f""..., - = 160
5503 select(21, [20], [], [], {44320, 0}) = 1 (in [20], left {44320, 0})
5503 read(20, "\27\3\1\4P", 5) = 5
5503 read(20,
"\333`\341\212},\215\3476\370\325\273"\260\340\234\365\324\324\10a\25v\327\266'+\340\34\1\317\364"..., - = 1104
Block 5 -
5503 write(20, "\25\3\1\0
\305\10\236\5\10\226JS\330\365\204\21\230\246\362\247R\332\220:\210o\2\333\10.\27"...,
- = 37
5503 brk(0xb9e2d000) = 0xb9e2d000
5503 close(20) = 0
Block 6 -
5503 brk(0xb9e2b000) = 0xb9e2b000
5503 chdir("/") = 0
5503 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) =
0
5503 writev(19,
[{"\27\3\1\0\260n?\301\204\263\0341yi\204\v\273\332\335\263\277\335M\t\223\364$\266\214\212c\362"...,
714}], 1) = 714
5503 gettimeofday({1251911099, 58549}, NULL) = 0
5503 write(14, "10.6.0.1 - noone [02/Sep"..., 97) = 97
5503 gettimeofday({1251911099, 58696}, NULL) = 0
5503 times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) =
369262016
5503 gettimeofday({1251911099, 58799}, NULL) = 0
5503 writev(19, [{"\25\3\1\0
MJi\215B\321Q\2657E\30cw\341n\264o\215\17\237{ej\342n\222\202"..., 37}], 1)
= 37
5503 brk(0xb9e2a000) = 0xb9e2a000
5503 shutdown(19, 1 /* send */) = 0
5503 poll([{fd=19, events=POLLIN}], 1, 2000) = 1 ([{fd=19,
revents=POLLIN}])
5503 read(19, "\25\3\1\0
\230\220\315\210n\342zc\201\367H\r'xPt\277\243\346\227\320g\2223;U\352"...,
- = 37
5503 gettimeofday({1251911099, 59317}, NULL) = 0
5503 gettimeofday({1251911099, 59372}, NULL) = 0
5503 poll([{fd=19, events=POLLIN}], 1, 2000) = 1 ([{fd=19,
revents=POLLIN|POLLERR|POLLHUP}])
5503 read(19, 0xbf81c504, 512) = -1 ECONNRESET (Connection reset by
peer)
5503 close(19) = 0
5503 read(7, 0xbf81c713, 1) = -1 EAGAIN (Resource temporarily
unavailable)
5503 close(18) = 0
5503 close(8) = 0
5503 close(7) = 0
5503 exit_group(0) = ?
As you can see I devided in 6 blocks this piece of strace output. In the
first block, the socket it's created, its bound to a network interface, and
it's connected to the remote host; fine. But in the block 2 the socket it's
switched back to blocking and then starts the SSL communication. In the
block 3 the socket it's switched to non blocking (this is the first time
that our application have chenged the socket options, because we don't have
the stream until the stream_socket_client function returns). Then, in the
block 4, the PHP that I wrote starts to work and the send / receive takes
place. In the block 5 there is a write without select... why? Well, I think
that is another "feature" of the underlaying socket communications, related
to the fclose of the stream. At last, the application ends in the block 6
sending the result back to the main server and closing resources.
After saying all of this, my question is... why the SSL implementation uses
potentially blocking code? The only work around that I found (inside
PHP/Apache) for facing this issue is setting a max execution time of the
apache processes because the PHP max_execution_time doesn't count the
sleeping time. I can't test right now this code in a latest version of PHP,
but I'm afraid that the strace output will be the same.
Regards,
Eloy Bote.
I found same problem in IIS recently.
When SSL connection established and read to eof,I can't write anything
to stream again.
It's trow :SSL operation failed with code 1. OpenSSL Error messages:
error:140D00CF:SSL routines:func(208):reason(207).
My php version is 5.2.9
2009/9/8 Eloy Bote Falcon eloybote@gmail.com:
Hi everybody!
Three months ago one of my PHP applications started to hang webserver
processes. The schema of the application is as follows (it's a webservice):-Main server, with LINUX, Apache and PHP, handles the requests from the
clients.
-Once the client is authenticated, and the request is valid, it relays the
request to one of the four slave servers (let's call them nodes).
-One of the nodes uses PHP4 (works fine), and the other three use LINUX,
Apache and PHP 5.1.6 with the following SSL configuration:
OpenSSL support => enabled
OpenSSL Version => OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008-The nodes send the request to a remote webservice, and reply it back to the
main server.
-The main server processes the response and sends back the response to the
client.The problem is that, eventually, the PHP5 nodes ends with blocked (sleeping)
processes from Apache. Finally the system runs out of resources because all
the blocked-sleeping processes have a ESTABLISHED socket.
First I noticed that, at least in that version of PHP, the blocking sockets
can have a send and recive timeout. Then I switched to non blocking sockets
and select. Now the application works fine in the upper level of the
communication with the other webservices, but I noticed that it still ends
with sleeping processes. I "straced" the Apache and this is what I found
(the bound IP and the remote webservice IP are not real; the second one is
from google.com):Block 1 -
5503 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 20
5503 bind(20, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("10.6.0.11")}, 16) = 0
5503 fcntl64(20, F_GETFL) = 0x2 (flags O_RDWR)
5503 fcntl64(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
5503 connect(20, {sa_family=AF_INET, sin_port=htons(443),
sin_addr=inet_addr("74.125.127.100")}, 16) = -1 EINPROGRESS (Operation now
in progress)
5503 poll([{fd=20, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1370327) = 1
([{fd=20, revents=POLLOUT}])Block 2 -
5503 getsockopt(20, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
5503 fcntl64(20, F_SETFL, O_RDWR) = 0
5503 time(NULL) = 1251911098
5503 brk(0xb9e39000) = 0xb9e39000
5503 time(NULL) = 1251911098
5503 write(20,
"\200d\1\3\1\0K\0\0\0\20\0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"...,
- = 102
5503 read(20, "\26\3\1\0J\2\0", 7) = 7
5503 time(NULL) = 1251911098
5503 time(NULL) = 1251911098
5503 read(20,
"\0F\3\1J\236\245\273^\v\375\357<\205\223\334n}\226k\232\\311\300\21Q\256\267\304\257o#"...,- = 72
5503 read(20, "\26\3\1\6\252", 5) = 5
5503 read(20,
"\v\0\6\246\0\6\243\0\3\2400\202\3\2340\202\3\5\240\3\2\1\2\2\4<\260Eh0\r\6"...,- = 1376
5503 read(20, " 2 CA1\r0\v\6\3U\4\3\23\4CRL10+\6\3U\35\20\4$0"\200"...,- = 330
5503 read(20, "\26\3\1\1^", 5) = 5
5503 read(20, "\r\0\1V\3\1\2@\1P\0M0K1\v0\t\6\3U\4\6\23\2ES1\0170\r\6"...,- = 350
5503 write(20,
"\26\3\1\0\7\v\0\0\3\0\0\0\26\3\1\0\206\20\0\0\202\0\200\215\30
\tT\312\313\255\210"..., 210) = 210
5503 read(20, "\24\3\1\0\1", 5) = 5
5503 read(20, "\1", 1) = 1
5503 read(20, "\26\3\1\0000", 5) = 5
5503 read(20,
"\364\35k\17\25\224\314]|\263Y\307z\310\223\314\210\235h\224t\345\305\373\267\241\377\t\1s\345\250"...,- = 48
Block 3 -
5503 fcntl64(20, F_GETFL) = 0x2 (flags O_RDWR)
5503 fcntl64(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0Block 4 -
5503 select(21, [], [20], [], {44320, 0}) = 1 (out [20], left {44320, 0})
5503 gettimeofday({1251911098, 885029}, NULL) = 0
5503 write(13, "[Wed Sep 02 19:04:58 2009] [erro"..., 161) = 161
5503 write(20,
"\27\3\1\2\240\10\260\325k\v\37{\251\355t_R\274\312\16\354r\315\1\277\204b\342xU\30L"...,
- = 677
5503 select(21, [20], [], [], {44320, 0} <unfinished ...>
5503 <... select resumed> ) = 1 (in [20], left {44319, 828000})
5503 read(20, "\27\3\1\0\240", 5) = 5
5503 read(20,
"\304\310\273s\370\304e\335\273\1\346\376N\234\376\226\276\0\fG9T\313\235\2&\31\340B=\f""...,- = 160
5503 select(21, [20], [], [], {44320, 0}) = 1 (in [20], left {44320, 0})
5503 read(20, "\27\3\1\4P", 5) = 5
5503 read(20,
"\333`\341\212},\215\3476\370\325\273"\260\340\234\365\324\324\10a\25v\327\266'+\340\34\1\317\364"...,- = 1104
Block 5 -
5503 write(20, "\25\3\1\0
\305\10\236\5\10\226JS\330\365\204\21\230\246\362\247R\332\220:\210o\2\333\10.\27"...,
- = 37
5503 brk(0xb9e2d000) = 0xb9e2d000
5503 close(20) = 0Block 6 -
5503 brk(0xb9e2b000) = 0xb9e2b000
5503 chdir("/") = 0
5503 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) =
0
5503 writev(19,
[{"\27\3\1\0\260n?\301\204\263\0341yi\204\v\273\332\335\263\277\335M\t\223\364$\266\214\212c\362"...,
714}], 1) = 714
5503 gettimeofday({1251911099, 58549}, NULL) = 0
5503 write(14, "10.6.0.1 - noone [02/Sep"..., 97) = 97
5503 gettimeofday({1251911099, 58696}, NULL) = 0
5503 times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) =
369262016
5503 gettimeofday({1251911099, 58799}, NULL) = 0
5503 writev(19, [{"\25\3\1\0
MJi\215B\321Q\2657E\30cw\341n\264o\215\17\237{ej\342n\222\202"..., 37}], 1)
= 37
5503 brk(0xb9e2a000) = 0xb9e2a000
5503 shutdown(19, 1 /* send */) = 0
5503 poll([{fd=19, events=POLLIN}], 1, 2000) = 1 ([{fd=19,
revents=POLLIN}])
5503 read(19, "\25\3\1\0
\230\220\315\210n\342zc\201\367H\r'xPt\277\243\346\227\320g\2223;U\352"...,
- = 37
5503 gettimeofday({1251911099, 59317}, NULL) = 0
5503 gettimeofday({1251911099, 59372}, NULL) = 0
5503 poll([{fd=19, events=POLLIN}], 1, 2000) = 1 ([{fd=19,
revents=POLLIN|POLLERR|POLLHUP}])
5503 read(19, 0xbf81c504, 512) = -1 ECONNRESET (Connection reset by
peer)
5503 close(19) = 0
5503 read(7, 0xbf81c713, 1) = -1 EAGAIN (Resource temporarily
unavailable)
5503 close(18) = 0
5503 close(8) = 0
5503 close(7) = 0
5503 exit_group(0) = ?As you can see I devided in 6 blocks this piece of strace output. In the
first block, the socket it's created, its bound to a network interface, and
it's connected to the remote host; fine. But in the block 2 the socket it's
switched back to blocking and then starts the SSL communication. In the
block 3 the socket it's switched to non blocking (this is the first time
that our application have chenged the socket options, because we don't have
the stream until the stream_socket_client function returns). Then, in the
block 4, the PHP that I wrote starts to work and the send / receive takes
place. In the block 5 there is a write without select... why? Well, I think
that is another "feature" of the underlaying socket communications, related
to the fclose of the stream. At last, the application ends in the block 6
sending the result back to the main server and closing resources.After saying all of this, my question is... why the SSL implementation uses
potentially blocking code? The only work around that I found (inside
PHP/Apache) for facing this issue is setting a max execution time of the
apache processes because the PHP max_execution_time doesn't count the
sleeping time. I can't test right now this code in a latest version of PHP,
but I'm afraid that the strace output will be the same.Regards,
Eloy Bote.