Harvesting tips

      1 Comment on Harvesting tips

TipsToday’s post consists of a simple UNIX tip and a note about an interesting piece of software. Given that graphic, let’s start with the tip…

Recursive grep

Most anyone comfortable with the command line knows how to use grep to find a file that contains a particular bit of text—typically to see the line(s) where the match(es) occur. The next level of complexity (and the thing that was giving me trouble today until I figured this out) is searching recursively through a directory tree with grep to find files that reside in subdirectories below your starting point. To save you the trouble of searching this out, here’s one way to do it:

find [StartPoint] -depth -print | xargs grep [LookFor] <return>

To start in the current directory and then also search all files below that directory for the string “Harvester” you’d enter:

find . -depth -print | xargs grep “Harvester” <return>

Basically you’re using “find” to get all file names in the directory you’re in (and those descending from it) and then using the “pipe” command to feed these names to xargs which builds a filename list and gives them one at a time to grep. You might want to send the output to yet another text file if you’re searching for something common—the output could be quite lengthy (just add a > and a path and filename to the end of the command string shown).

Basic stuff but I’m sure I’ll look back here in a few months to remember the syntax. As an aside, any time you spend studying the “find” command will repay you many times over in keystrokes and time saved.

OAI Harvester

This next “tip” is a great illustration of what the government can accomplish when it decides to help libraries become more powerful stewards of the digital realm. Unfortunately (for my American readers), I’m talking about the Canadian government but given the proximity we can hope this enlightenment might eventually begin to trickle down.

Here’s an excerpt from the PKP website that explains what they’re about:

pkp_logo.gifThe Public Knowledge Project is a federally funded research initiative at the University of British Columbia and Simon Fraser University on the west coast of Canada. It seeks to improve the scholarly and public quality of academic research through the development of innovative online environments. PKP has developed free, open source software for the management, publishing, and indexing of journals and conferences. Open Journal Systems and Open Conference Systems increase access to knowledge, improve management, and reduce publishing costs.

We’ve installed their Open Journal Systems (OJS) package and have been quite happy with it. It offers a simple but well-designed platform for hosting the “backoffice” aspects of e-journal publishing as well as managing the presentation of the content for readers. Recommended.

Over the past month or so I’ve been thinking about ways to incorporate OAI services to build different front-ends to digital storage/achiving systems like DSpace, EAD collections, and so on. Following a tip from a colleague, I downloaded and installed another piece of software from the PKP group—the OAI Harvester. Why this amazing app isn’t listed on the “Tools” page maintained by the openarchives.org site escapes me.

It’s a LAMP application (Linux, Apache, MySQL, PHP) but runs quite nicely as a MAMP installation on OS X Server (happily using the MySQL, Apache and PHP installations that ship with 10.4.x). Also plays well with the APC cache I talked about the other day. It took only a few minutes to install, a couple more minutes to configure and was almost immediately useful.

cgihome.jpgAfter typing in the OAI query urls for two DSpace installations (Mason and WRLC), I waited and within a minute or two, I had a nice “union” catalog of the content from both systems. I then added another neighbor’s collections (University of Maryland). Here’s what it looks like (I’m tweaking/testing this installation so it may or may not be running when you follow the link):

http://furbo.gmu.edu/OAIharvester

Right now I’m playing around with “branding” the setup with some Mason-specific mods (which explains why I was trying to grep the string “Harvester” out of the various sub-directories) and generally learning how the package is put together. I may next expand my test database to cover OAI-compliant systems within Virginia and see how that goes. To find the appropriate OAI-query URLs, I’ll start at the OAI provider registry maintained by the University of Illinois at Urbana-Champaign.

Once I figure out all the ways to fix and break this package, and then sort out a group of OAI providers that it makes sense to collect, I think we’ll roll out a production version of this software.

http://pkp.sfu.ca/?q=harvester