Back to MARS

      2 Comments on Back to MARS

Marsediticonlarge
For the past year and a half, Dorothea has been our go-to person for MARS (our DSpace installation) and she’s handled most of the sysadmin duties on our XServe as well. She’ll be leaving us at the end of this month (heading back to Madison to straighten out their statewide Minds@UW institutional repository). OK, I’m kidding about the “straightening out” part but unfortunately the rest is quite true.

This means, of course, that DSpace falls back into my lap until we go through a successful “search and hire” process for Dorothea’s replacement (stay tuned for information).

To help me ease back into all things DSpace, today I decided to try to fix a problem that’s been hanging around for months. About once every seven days, we reach a point where idle Postgres connectors accumulate in such numbers that Postgres bumps up against its ‘active process limit.’ DSpace, unable to grab a new connection to the Postgres database, instead turns its attention to spamming me with multiple error-message emails. The behavior seems to correspond to the frequency with which the Google crawlers visit us (their deep-burrowing crawl triggers multiple connections that never seem to terminate).

Doing my pre-coding homework, I launched DEVONagent to see if someone else had already solved this problem. Not really but I did find a workaround hack that Cory Snavely posted on the DSpace-Tech mailing list a few weeks ago. He suggested building a cron script using pgrep & pkill to find and destroy any processes that had “idle in transaction” in the process description, piping that to wc (to count them) and then killing off the oldest one if there were more than 20.

Sounded like a plan. In fact, it’s a great script if you’re running DSpace/Postgres on Solaris or most versions of Linux. We’re on OS X so of course I get to think different(ly).

For starters, pgrep and pkill don’t ship with OS X server. Annoying (they’ve been on Solaris since 2.7) but not a show stopper. I found a port for Darwin (OSX) on SourceForge (proctools) and compiled it.

First try didn’t work because I forgot that my desktop MacPro is Intel and our server runs on G5’s. Moving the source over and compiling on the server fixed that problem. My next surprise was that this ported version of pkill doesn’t support the “-o” flag that Cory was using to kill off the oldest idle process. So, my version kills off the newest match (tip: don’t use the “v” switch to reverse the logic on the “-n” switch thinking it will reverse newest to oldest. Using -nv will then kill off all EXCEPT the newest matching process–that could get sort of dangerous).

At any rate, I ended up with a little script using test, pgrep and pkill and decided to use launchd instead of cron to call it (that’s the new way, after all). That added a 45 minute excursion into launchd documentation and multiple iterations of testing but I finally got that working (thanks to the Lingon utility).

I think I’ll make one last modification to my script once I’m sure it’s working well enough to leave in place: have it email me once a week to remind me that it’s running. I can just imagine that in a few weeks I’ll forget all about it and someday the fact that the newest postgres idle process seems to disappear every 60 seconds will have me scratching my head.

Here’s the money line in the OS X version of the script:

/bin/test `/usr/local/bin/pgrep -f '127.0.0.1' | \
/usr/bin/wc -l ` -gt 20 && /usr/local/bin/pkill -n -f '127.0.0.1'

I’m logging the results of my script so I can better track how the idle processes rise and fall. Here’s what the log looks like:

07Feb2007:15:40::Idle: 7
07Feb2007:15:41::Idle: 7
07Feb2007:15:42::Idle: 7
07Feb2007:15:43::Idle: 7

Will be interesting to see what happens when Google starts bumping up the idle processes and my script is killing them off one every 60 seconds. With luck it will work well enough to keep the server below the error-spam threshhold.

I’ll be sure to update this post if it turns out this brute-force, symptomatic treatment creates any new problems (or fails to fix the problem it’s been tasked to solve).

* thanks to MarsEdit (a great off-line blog editor) for use of the logo.