Sunday, July 25, 2010

Wine broke, or PHP, or xinetd??

Sometimes it can be really educational to sit on the shoulder of brilliant developers and watch how they troubleshoot and debug. So, here is your chance to sit on my shoulder,... and snigger as failure reduces me to tears.
[2010-11-18 UPDATE: a mere 5 months after originally posting, I think I have the answer; see the bottom of this article.]

A program worked fine Saturday. This is on my Ubuntu 8.04 (hardy heron) machine.

Sunday I updated firefox and ghostscript only. I then rebooted, which means kernel 2.6.24-28 finally became active (I updated it July 11th, but hadn't had chance to reboot until yesterday).

Today that program doesn't run. It is one particular wine program: a go program. Another wine program (also a go program) runs fine. And here is the real killer: if I start that problem wine program from gogui (gogui.sf.net) it works. Exactly the same commandline, but it works from gogui, and doesn't work when started from my php script. That PHP script was last changed last Wednesday.

I rebooted into the previous kernel (2.6.24-27), and the problem is exactly the same.

Have you ever done those logic problems where you get a list of clues and have to work out who did what? Applied to the above we discover the only logical explanation is that... reality is warped. It can't be: wine, the go program, php, my php script or the kernel.

Here is some more evidence. The first time I run it after a reboot the program (called valhallgtp.exe) fails to start with a stack dump and backtrace. Here are the first few and last few lines of that:
=>1 0x7bc3b23c __regs_RtlRaiseException+0x4c() in ntdll (0x014ded68)
2 0x7bc76de3 in ntdll (+0x66de3) (0x014df0cc)
3 0x7bc3a936 RtlRaiseException+0x6() in ntdll (0x014df144)
4 0x00415833 in valhallgtp (+0x15833) (0x014df2ac)
5 0x004159dd in valhallgtp (+0x159dd) (0x014dfa24)
... (6..20 are all in valhallgtp)
21 0x0049b02c in valhallgtp (+0x9b02c) (0x014dff08)
22 0x7b8773a7 in kernel32 (+0x573a7) (0x014dffe8)

I also see: "err:seh:raise_exception Exception frame is not in stack limits => unable to dispatch exception."

The 2nd and subsequent times I try to start it I get:
Failed to start wine ValhallGTP.exe ...
(this is an error from my php script; i.e. proc_open() is failing.)

The CWD (current working directory) is correct, the parameters are correct; both are exactly what gogui is using. And running it from a bash shell is fine too.

The fingers are pointing at PHP. But no PHP upgrades either yesterday or in the July 11th batch. And that same PHP is still successfully starting 4 other programs, including another one that uses wine. (BTW, go to synaptic package manager, file menu, history; this is where you can see exactly what Ubuntu updated and when.)

Disk space is fine ("df"), memory is fine ("cat /proc/meminfo"). Machine load is low ("w"). Running it as root has same problem.

When I ran from the -28 kernel I also got errors about no X DISPLAY. And then when I started the other wine program it opened but had no display. And running wine config was the same. I've not seen that again. Let's pretend it never happened and concentrate on what we can reproduce.
(UPDATE: it happened again (this time with the -27 kernel). I had the gtpmfgo.exe running fine under wine; then starting mfgo.exe it came up but with no display. Close and retry showed it was repeatable. After closing gtpmfgo.exe it worked, and then I could open gtpmfgo.exe fine. Note: not repeatable; I haven't managed to get it to happen again.)

Okay, where to next? What I don't understand is how come I cannot reproduce the problem I get when I start it the first time. That seems like a small enough challenge to be achievable: how to get it to crash consistently.

...time passes...

Aha! It seems it is crashing in the same way; it was just crashing and dying before I got chance to read the crash message. So the first time after a reboot it must take longer than a second to start up (perhaps wine initialization).

...aha, aha! Narrowed down further. To explain this I need to give more background: I have a script called frontend.php that calls backend.php over sockets. backend.php uses stdin/stdout and uses xinetd to turn it into a socket server. When I run backend.php directly it works!

Could the updates/reboot have affected xinetd somehow? ...probably not. The xinetd files are all dated Dec 2007 except my own config file which is June 30th. And, the other 4 programs are started in exactly the same way and work fine.

Starting backend.php in exactly the same way as xinetd does works fine. (I've tried it both from bash and from sh.)

...time passes...

I seem to have a solution. (I had started working on a xinetd replacement php script, but that doesn't seem needed now.) My solution was to set xinetd to listen on another port, and use that port for just the problem program. This is working. I cannot explain it however: I had already tried connecting to only that program, so xinetd wouldn't have been doing anything different to now.

UPDATE: 7 weeks later and I do another reboot and it broke in exactly the same way!!
And again the only solution I could find was to give xinetd a new port for just this particular problem.
Weird, very, very weird, ...

2010-11-18 UPDATE: It was bothering me again, but on a different machine, and the trick with setting up xinetd on a different port didn't work. But adding this code to the top of backend.php seems to have solved it:
if(trim(getenv('DISPLAY'))=='')putenv('DISPLAY=:0.0');

The problem program has a GUI you see. I switch the GUI off with a commandline option ("hidewindows") but I think it actually creates windows and just hides them. And yes, none of the other programs uses a GUI... sigh, the clues were there.

Curiously the "hidewindows" option no longer works! That is probably going to annoy me, but not as annoying as not working at all!