So, what went wrong the other day?

If you’re a user of one of the services provided by the library systems office (e.g., the library website or our online catalog) you probably noticed that neither of those was working last night. In fact, some pieces of our remarkably intertwined web of services didn’t return until early this morning. Would like to take this opportunity to explain what caused the problemâ€”but I never did figure it out.

I can describe the symptom: our instance of Apache httpd had been running for about 65 days. I needed to insert a couple of redirects to launch our new ‘database portal’ which meant a few small edits in httpd.conf and a server restart. Did the edit, stopped the server and restarted to pick up the changes. Unfortunately, the server refused to start back up. I ran syntax checks on the configuration filesâ€”no errors. Reverted to a series of different backup versions (just in case some mod had been made since I last rebooted)â€”didn’t help. Examined system and httpd log filesâ€”nothing. ‘Trussed’ the startupâ€”nothing helpful. The only indication I had that something was wrong (beyond noticing that it didn’t run) was: â€œexit status: 3â€

…the sysadmin equivalent of the â€œcheck engineâ€ light.

My current theory is that something in the patching I did to Solaris in the intervening 65 days may have broken this particular (Sun-supplied) instance of apache. It’s only a theory and a flimsy one at that.

Unable to just take the server to the dealer, I had to scramble to move all the services to other machines and then reconnect themâ€”a process that took most of the evening and the better half of today to complete. It gets complicated when you serve up a form from one virtual host, normalize it on another, do a query of MySQL that’s sitting on yet another box, then feed the results to yet another machine that adds a bit “more value” as they say in the marketing brochures.Â Â This server’s been a departmental workhorse for three or four years so even figuring exactly what applications it plays a part in takes time.Â I came away from this experience realizing that Web 2.0 is really just a buzzword to mask what’s fast becoming a geometrical increase in our potential points of failure. Wasn’t this out-of-sync universe the reason we all wanted out of client/server?

But now, everything’s running again…and that low hum you’d hear if you were to walk in the office is probably just the swarms of web redirects flying around the roomâ€”giving the audience the sense that everything has meshed neatly into a seamless experience…

Over the next week, I’m going to build a new V240 (Solaris 10) server and migrate this ‘patched together’ illusion from three machines back to oneâ€”upgrading some software in the process. The unexpected meltdown reminded me of why it’s a good idea to update software once in a while–even if it’s running fine. When you’re glued to the console and the phone’s ringing with people wondering what happened to their favorite applicationâ€”that’s not the perfect time to begin trolling google looking for source so you can recompile DBI support into a copy of Perl and find a version of the PHP source that matches the quirks you know an old version of the apache module is counting on.

My other takeaway from thisâ€”I really do need to go back and finish porting our in-house e-reserves system to PHP so I can forget about Perl/DBI.

Post Views: 26