Sunday, November 28, 2010

Troubleshooting php and xinetd with strace

In a previous entry I mentioned I was trying to track down a difference in a php script between two machines. I've now got closer, close enough in fact to fix the problem but not to understand it. As a troubleshooter, the big discovery for me was how useful strace can be.

First, a PHP code snippet:
while(1){
stream_set_blocking(STDIN,true);
$s=trim(fgets(STDIN));
if($GLOBALS['log_fp'])fwrite($GLOBALS['log_fp'],$s."\n");
$params=explode(" ",$s);
$command=array_shift($params);
...
}

The first two lines say listen on stdin for a command, and wait forever for it to arrive. When it does we log it and then split it up into a command and parameters.

This script is used over xinetd. Xinetd listens on a port for us, and when it gets something it feeds it to stdin; php script output is then fed back over the socket to the client.

Here is the setup:
  • machine A: ubuntu 8.04, 32-bit, php 5.2
  • machine B: ubuntu 10.04, 64-bit, php 5.3

Both machines run an indentical version of xinetd: "xinetd Version 2.3.14 libwrap loadavg".

Machine A runs fine, machine B sometimes sits there. Analyzing this some more, by connecting over telnet to machine B:
  1. I send a command
  2. It replies straightaway
  3. After 60 seconds or so it gives an error message
  4. If I do nothing this error message appears again every 60 seconds.

On machine A, or if I run the php script directly on machine B, it behaves like this:
  1. I send a command
  2. It replies straightaway
  3. It sits there forever waiting for input

Out of desperation I ran strace (using -p you can attach it to a running process; use ps -a | grep php to find the PID), and I was pleasantly surprised to see it was not too verbose.

Over xinetd on machine B (the problem configuration) the interesting snippet is:
...
write(1, "mydata\n", 7)                 = 7
fcntl(0, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(0, F_SETFL, O_RDWR)               = 0
poll([{fd=0, events=POLLIN|POLLERR|POLLHUP}], 1, 60000) = 0 (Timeout)
write(3, "\n", 1)                       = 1
write(5, "\n", 1) 
...

On machine A over xinetd it instead looks like this:
...
write(1, "mydata\n", 7)                 = 7
fcntl64(0, F_GETFL)                     = 0x8002 (flags O_RDWR|O_LARGEFILE)
fcntl64(0, F_SETFL, O_RDWR|O_LARGEFILE) = 0
read(0,
(Yes nothing comes after the "0," until I send some input.)

And direct access to the php script on machine B it practically the same:
...
write(1, "mydata\n", 7)                 = 7
fcntl(0, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
fcntl(0, F_SETFL, O_RDWR|O_LARGEFILE)   = 0
read(0,

So, the cause is clear: fgets() blocks for only 60 seconds instead of blocking forever, but only on this machine and only over xinetd. And the bug in my script then becomes clear: I'm not prepared to handle blank input!

Here is my fix:
while(1){
stream_set_blocking(STDIN,true);
$s=trim(fgets(STDIN));
if($s=='')continue;
if($GLOBALS['log_fp'])fwrite($GLOBALS['log_fp'],$s."\n");
$params=explode(" ",$s);
$command=array_shift($params);
...
}


As an ironic postscript, the real problem was elsewhere (a mismatch in another program version and the configuration being used for it!), and I also used strace to track that mistake down. But, what was so important about fixing the above was that it stopped distracting me with an error message every 60 seconds on machine B, when the configuration error was actually on machine A!

Saturday, November 27, 2010

php 5.3, ticks, pcntl_signal, pcntl_signal_dispatch

I'm trying to track down a problem with a script that works on php 5.2 but behaves strangely on php 5.3 (there are lots of differences between the environments, and I suspect php version will actually turn out to be completely unrelated). php 5.3 introduced pcntl_signal_dispatch() which processes outstanding signals and I've been investigating if that could somehow explain the behaviour differences I see.

The confusing part is that I've seen people saying that the old way of "declare(ticks=1)" is now deprecated in 5.3, and you must use pcntl_signal_dispatch(). This seemed very silly as you'd have to litter your code with calls to pcntl_signal_dispatch(), as well as have very different code for php 5.2 and 5.3.

If you're confused, like I was, here is what you need to know:
1. declare(ticks=1) still works: no deprecated message (with E_ALL|E_STRICT error reporting);

2. My script, using ticks, behaves identically under 5.2 and 5.3 when I send it a ctrl-C or a kill signal;

3. The docs don't mention it being deprecated; I realized everywhere saying this was user-contributed comments or blogs!

There is a performance aspect with using declare(ticks=1). I believe it is minor, but I think pcntl_signal_dispatch() has been introduced so you can use it instead of ticks if you want to take fine-control over when signals get considered.

Thursday, November 25, 2010

bash tips: meld on program output

Here is the challenge: I want to run diff on two large log files, but I'm only interested in the entries at a certain time in each log file.

This used to require four commands:
grep "ABC" a.txt >tmp1.txt
  grep "ABC" b.txt >tmp2.txt
  diff tmp1.txt tmp2.txt
  rm tmp1.txt tmp2.txt
(Imagine "ABC" is a datestamp, but it could be any other way to filter your log file.)

Thanks to the gurus on TLUG's mailing list I can now do this as a one-liner:
diff <(grep ABC a.txt) <(grep ABC b.txt)
It works perfectly for meld (a wonderful visual diff program) too! Here is another way to use it, to compare the output of a program on two different machines (here I'm comparing the php configuration):
diff <(php -i) <(ssh someserver 'php -i')
We're using the form of ssh that runs a program on the remote server. The command in the brackets can get quite complex. Here is an example where I needed to compare datestamps in two csv files, but the first field was an id number, arbitrary, and therefore different for all records.
diff <(egrep -o '[0-9]{8} [0-9]{2}:[0-9]{2}:[0-9]{2}' dir1/abc.csv)
  <(egrep -o '[0-9]{8} [0-9]{2}:[0-9]{2}:[0-9]{2}' dir2/abc.csv)
The -o flag to egrep tells it to only output the matching part, not the whole line. This next version shows the complete rest of the line starting with my datestamp field; i.e. this version just excludes the csv field(s) before the datestamp field:
diff <(egrep -o '[0-9]{8} [0-9]{2}:[0-9]{2}:[0-9]{2}.+' dir1/abc.csv)
  <(egrep -o '[0-9]{8} [0-9]{2}:[0-9]{2}:[0-9]{2}.+' dir2/abc.csv)