Update on PKP Harvester

      Comments Off on Update on PKP Harvester

Late last month (Jan 22, 2007) the Public Knowledge Project released an updated version (2.0.1) of their PKP OAI Harvester software. I wrote about this tool a few months ago, but hadn’t really spent much time under the hood once I got the 2.0.0 release working in November. In the intevening months, I’ve discovered that I need to beef up my OAI harvesting chops so a new PKP release was timely.

Unpacking the software, I quickly abandoned the “upgrade” option (the documentation didn’t track with what I was seeing on the screen) and just did a clean install. In less than 15 minutes I had a working copy of 2.0.1 running.

As you probably know, OAI (Open Archives Initiative) is all about achieving interoperability between systems via the exchange of metadata. Put another way (minus buzzwords), it’s just an agreed upon way one system (a harvester) can ask another system (a provider) about the contents of the provider’s database and make sense of the answer. The harvester asks the question via a specially-crafted URL and the provider responds with an XML file. There’s more to it than that but that simple description captures the essence of why the OAI-PMH (Protocol for Metadata Harvesting) exists.

As an aside, the next iteration (OAI-ORE) has the potential to get really interesting. OAI-ORE (Open Archives Initiative-Object Reuse and Exchange) asks why we should be content with just exchanging metadata. Why not develop a protocol that enabled exchange of actual digital objects from various repositories—creating the opportunity for new and potentially more interesting intellectual products? Scholarly mashups? A functioning OAI-ORE infrastructure won’t arrive anytime soon (e.g., at the first meeting of the OAI-ORE Technical Committee last month, one goal was “reaching a shared problem statement”) but it is a promising idea and a project worth tracking.

But returning to the topic at hand, the nice thing about the PKP package is that it comes not only with an OAI harvesting module but also a MySQL backend to store and index the parsed information as well as a template-based user interface for the search and retrieval function. Point the harvesting module at a couple of OAI-compliant sites and in no time at all you’ve built your own local version of OAIster.

CgiI began building our database by calling up the Admininstrative module on the Harvester’s web-based admin interface and filling out a form for our MARS system (base OAI URL, metadata format, index method, and so on). Clicked the “Update Metadata Index” button and harvesting began. At about 450 records, the process stopped with an XML error displayed in my browser. Dorothea found the record in MARS and noticed an errant control-character embedded in one of its metadata fields. She cleaned that up and let me know that XML display errors almost always mean some sort of garbage in the data. Restarting the harvester from scratch, PKP retrieved just over 1300 records.

Next step was to identify OAI-compliant servers operated by higher education institutions in Virginia (my goal was to build a sort of regional gateway). I didn’t know of any OAI registries so I started my quest at OAIster. Spent a few minutes working through their 700+ OAI contributors but quickly saw that OAI base url’s weren’t included. A targeted lazyweb request (an email to the OAIster “contact us” link) was immediately productive. Within minutes I received a very helpful reply from Kat Hagedorn, the OAIster Metadata Harvesting Librarian, pointing me to the OAI-PM Data Provider Registry operated by Tom Habing at the Grainger Engineering Library of the University of Illinois.

The registry is a great resource. Not only does it gather information on 1400+ repositories, it offers additional services as well. There’s an RSS feed of recent additions/edits as well as an SRU service. For example, this query returns information about our MARS system (URL’s too long to display inline):

SRU query for Mason

Armed with this information I located a few sites in Virginia and began building the index. As it was harvesting, I kept trying to figure out why the item count from our MARS system was 1300 (when our current handle count is closer to 2,000). Couldn’t figure it out so I deleted the MARS collection from within PKP and reharvested, wondering if it would hit 1300 again. Nope…this time it stopped at 939. Hmmm.

 

RTFM

While there’s no mention on the web-based “Administration screen”, the documentation warns that for larger sites you need to run the harvest.php program from the server’s command line—to eliminate the possibility of a web timeout. Tried it and all MARS items were harvested (1900 items). I can now turn my attention to making a few tweaks in the user interface (out of the box, the PKP software does not report the contributing repository when displaying matches).

Our test system is here:

http://furbo.gmu.edu/OAIharvester

YAZ PROXY/SRU/VOYAGER/SOLARIS

Looking for a reader who has experience with the YAZ proxy on Solaris. I have a few questions about configuring the software to provide SRU services to the Voyager Z39.50 interface. Hit the “Email Me” link over on the right if you don’t mind a couple of quick questions.