So, what went wrong the other day?

      1 Comment on So, what went wrong the other day?

checkengine.jpgIf you’re a user of one of the services provided by the library systems office (e.g., the library website or our online catalog) you probably noticed that neither of those was working last night. In fact, some pieces of our remarkably intertwined web of services didn’t return until early this morning. Would like to take this opportunity to explain what caused the problem—but I never did figure it out.

I can describe the symptom: our instance of Apache httpd had been running for about 65 days. I needed to insert a couple of redirects to launch our new ‘database portal’ which meant a few small edits in httpd.conf and a server restart. Did the edit, stopped the server and restarted to pick up the changes. Unfortunately, the server refused to start back up. I ran syntax checks on the configuration files—no errors. Reverted to a series of different backup versions (just in case some mod had been made since I last rebooted)—didn’t help. Examined system and httpd log files—nothing. ‘Trussed’ the startup—nothing helpful. The only indication I had that something was wrong (beyond noticing that it didn’t run) was: “exit status: 3”

…the sysadmin equivalent of the “check engine” light.

My current theory is that something in the patching I did to Solaris in the intervening 65 days may have broken this particular (Sun-supplied) instance of apache. It’s only a theory and a flimsy one at that.

Unable to just take the server to the dealer, I had to scramble to move all the services to other machines and then reconnect them—a process that took most of the evening and the better half of today to complete. It gets complicated when you serve up a form from one virtual host, normalize it on another, do a query of MySQL that’s sitting on yet another box, then feed the results to yet another machine that adds a bit “more value” as they say in the marketing brochures.   This server’s been a departmental workhorse for three or four years so even figuring exactly what applications it plays a part in takes time.  I came away from this experience realizing that Web 2.0 is really just a buzzword to mask what’s fast becoming a geometrical increase in our potential points of failure. Wasn’t this out-of-sync universe the reason we all wanted out of client/server?

But now, everything’s running again…and that low hum you’d hear if you were to walk in the office is probably just the swarms of web redirects flying around the room—giving the audience the sense that everything has meshed neatly into a seamless experience…

Over the next week, I’m going to build a new V240 (Solaris 10) server and migrate this ‘patched together’ illusion from three machines back to one—upgrading some software in the process. The unexpected meltdown reminded me of why it’s a good idea to update software once in a while–even if it’s running fine. When you’re glued to the console and the phone’s ringing with people wondering what happened to their favorite application—that’s not the perfect time to begin trolling google looking for source so you can recompile DBI support into a copy of Perl and find a version of the PHP source that matches the quirks you know an old version of the apache module is counting on.

My other takeaway from this—I really do need to go back and finish porting our in-house e-reserves system to PHP so I can forget about Perl/DBI.