Sunday, November 28, 2010

Troubleshooting php and xinetd with strace

In a previous entry I mentioned I was trying to track down a difference in a php script between two machines. I've now got closer, close enough in fact to fix the problem but not to understand it. As a troubleshooter, the big discovery for me was how useful strace can be.

First, a PHP code snippet:
while(1){
stream_set_blocking(STDIN,true);
$s=trim(fgets(STDIN));
if($GLOBALS['log_fp'])fwrite($GLOBALS['log_fp'],$s."\n");
$params=explode(" ",$s);
$command=array_shift($params);
...
}

The first two lines say listen on stdin for a command, and wait forever for it to arrive. When it does we log it and then split it up into a command and parameters.

This script is used over xinetd. Xinetd listens on a port for us, and when it gets something it feeds it to stdin; php script output is then fed back over the socket to the client.

Here is the setup:
  • machine A: ubuntu 8.04, 32-bit, php 5.2
  • machine B: ubuntu 10.04, 64-bit, php 5.3

Both machines run an indentical version of xinetd: "xinetd Version 2.3.14 libwrap loadavg".

Machine A runs fine, machine B sometimes sits there. Analyzing this some more, by connecting over telnet to machine B:
  1. I send a command
  2. It replies straightaway
  3. After 60 seconds or so it gives an error message
  4. If I do nothing this error message appears again every 60 seconds.

On machine A, or if I run the php script directly on machine B, it behaves like this:
  1. I send a command
  2. It replies straightaway
  3. It sits there forever waiting for input

Out of desperation I ran strace (using -p you can attach it to a running process; use ps -a | grep php to find the PID), and I was pleasantly surprised to see it was not too verbose.

Over xinetd on machine B (the problem configuration) the interesting snippet is:
...
write(1, "mydata\n", 7)                 = 7
fcntl(0, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(0, F_SETFL, O_RDWR)               = 0
poll([{fd=0, events=POLLIN|POLLERR|POLLHUP}], 1, 60000) = 0 (Timeout)
write(3, "\n", 1)                       = 1
write(5, "\n", 1) 
...

On machine A over xinetd it instead looks like this:
...
write(1, "mydata\n", 7)                 = 7
fcntl64(0, F_GETFL)                     = 0x8002 (flags O_RDWR|O_LARGEFILE)
fcntl64(0, F_SETFL, O_RDWR|O_LARGEFILE) = 0
read(0,
(Yes nothing comes after the "0," until I send some input.)

And direct access to the php script on machine B it practically the same:
...
write(1, "mydata\n", 7)                 = 7
fcntl(0, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
fcntl(0, F_SETFL, O_RDWR|O_LARGEFILE)   = 0
read(0,

So, the cause is clear: fgets() blocks for only 60 seconds instead of blocking forever, but only on this machine and only over xinetd. And the bug in my script then becomes clear: I'm not prepared to handle blank input!

Here is my fix:
while(1){
stream_set_blocking(STDIN,true);
$s=trim(fgets(STDIN));
if($s=='')continue;
if($GLOBALS['log_fp'])fwrite($GLOBALS['log_fp'],$s."\n");
$params=explode(" ",$s);
$command=array_shift($params);
...
}


As an ironic postscript, the real problem was elsewhere (a mismatch in another program version and the configuration being used for it!), and I also used strace to track that mistake down. But, what was so important about fixing the above was that it stopped distracting me with an error message every 60 seconds on machine B, when the configuration error was actually on machine A!

1 comment:

darren said...

By the way, the full php versions are:
A: PHP 5.2.4-2ubuntu5.12
B: PHP 5.3.2-1ubuntu4.5

I was going to look at the fgets() source before submitting a bug report; I feel I ought to understand why it only happens over xinetd first.