Mechanical Turk as Collection Development Tool

      1 Comment on Mechanical Turk as Collection Development Tool

Poking around Amazon’s Mechanical Turk today, I found this “HIT” (Human Intelligence Task) available to webworkers.

The author/publisher is offering $4.00 if you request the book from your library (which I guess they hope will trigger a wave of purchases). I don’t know why it surprised me to see that this sort of thing happens…

HIT

Mason Tweets

      Comments Off on Mason Tweets

Earlier today a tweet from Dan Cohen pointed me to an interesting service offered by NC State:

http://twitter.ncsu.edu/

They were nice enough to offer a link to their Zend-framework based PHP code on the site so I spent a few minutes today building a Mason tweet aggregator. It still needs a bit of work and I appreciate the fact that it has that Web 1.0 look that seems to come so effortlessly to me, but it does work and I’ll eventually get around to “styling” it

http://tweet.gmu.edu

OCR, Image/Text PDFs and the Mac

      8 Comments on OCR, Image/Text PDFs and the Mac

This week I’ve been staring at a collection of just over 29,000 PDFs. Image-only copies of thousands of documents created with “..the software that came with the scanner.”

My task? Figuring out the right tools and workflow to get these PDFs through an OCR process so we can unlock the content and make them more accessible. A number of these documents will end up in our MARS system, so exposing the text to the PDFBox indexing code that ships with DSpace is critical (as an aside, I’ve heard that Xpdf is a really nice replacement for PDFBox but I haven’t had time to tip it into our DSpace install yet).

I don’t have a precise OCR accuracy threshold in mind but assume if we can hit the mid-90% range we’ll find that retrieval doesn’t suffer.

I have seen a 2001 study by a group from Harvard University Library that found that 96.6% of searches will succeed on uncorrected OCR’d text. Also worth a look, Rose Holley’s recent article in D-Lib Magazine (“How Good Can It Get?“). She offers a number of interesting ideas on improving OCR accuracy in a large-scale digitization project. For some reason, it seems that most of the literature on OCR accuracy and retrieval focuses on scientific literature–where it appears to make very little difference. [ article behind pay wall ] [ freely viewable version ]

An ideal workflow would look something like this: fill a directory with image-only PDFs and point some sort of OCR process toward it. The final product would be yet another directory that contains “image-over-text” versions of the original PDFs (wherein the OCR’d text resides ‘inside’ the PDF as an extra ‘layer’ of content).

I’m trying out Mac-based solutions first (knowing that if it ends up being a Windows-based workflow we’ll likely use OmniPage (a product we already use with our ATIZ bookscanner)).

Continue reading

Javascript speed

      2 Comments on Javascript speed

I’ve long thought that if you wanted the fastest browser experience on a Mac, you went with the nightly Webkit build from http://nightly.webkit.org/.

So I was surprised today when I happened on the SunSpider JavaScript benchmark site and put several browsers through their paces.

One caveat, this test is measuring the core JavaScript engine and no other browser APIs or features. The results (smaller number is better):

Machine: MacPro (dual 2.8 quad-core); OS 10.6.1

Firefox 3.5.3 1036.8ms (32-bit)
Webkit Nightly (r49008) 434.8ms (64-bit)
Google Chrome (4.0.212.1) 434.4ms (32-bit)
Safari 4.0.3 364.6ms (64-bit)

Fix it till it breaks – into 64 bits

      Comments Off on Fix it till it breaks – into 64 bits

I like to fix things till they break. Today’s post is a cautionary tale for that admittedly small niche of sysadmins running OSX server on XServes upgraded in place from Leopard to Snow Leopard…

For the past few weeks I’ve been tweaking the JSP interface of our MARS system and doing some overdue “authority control” cleanup on subjects and authors. That’s been going so well that late this afternoon I decided to take a crack at updating a few packages originally installed via MacPorts back when the server was running Leopard server (the in-place Snow Leopard upgrade didn’t disturb the code in the /opt/local destination for Macport installs).

I pulled down version 3.2 of Apple’s Developer Tools (to insure 10.6 compatibility) and went to work. In no time at all I had upgraded ant, maven, postgres, bison, wget, openssl and a host of other dependencies. Rebooted and the fun began.  First up, Postgres:

FATAL: incorrect checksum in control file

Never saw that before.

Found a web posting on a Linux site explaining that this could easily happen if you tried to open a database with a 64-bit version of postgres when it had been closed by a 32-bit version. Then it hit me. Of course, on an XServe, Snow Leopard server defaults to 64-bit builds. Under Leopard, I had built a 32-bit version of Postgres.

Recommended solution from the Linux posting: forget about it. Only solution is to open the database under a 32-bit version of Postgres and then dump the data, reimporting it into a new database created by a 64-bit version.

I backed out my 64-bit upgrades, then manually uncommented the “build_arch i386” line in macports.conf to force 32-bit builds….then started rebuilding 32-bit versions of all the code. That fixed most everything but not Postgres. I still had at least one load library mismatch crashing that compilation.

As a last ditch effort, I tarred up the entire /opt/local tree and did a nearly full replacement from a sparse image clone of the machine’s boot drive that I made with SuperDuper just before doing the Snow Leopard upgrade (meaning all that code was 32-bit). I didn’t disturb /opt/local/var/db (that’s where my postgres database lived) but deleted and then restored these three directories from the sparse image backup:

  • /opt/local/lib
  • /opt/local/bin
  • /opt/local/share

Rebooted…success!

To enable use of the “port” command on this box, I then reinstalled the Snow Leopard version of macports (restoring selected parts of /opt/local from the backup broke the port command). That went smoothly and “port” now works.

My takeaway: Say ‘no’ to that little voice in your head that suggests you should “improve” a system that’s running well…and don’t ever say anything bad about sparse image backups.

iPhone / iTouch / Android enabled

      Comments Off on iPhone / iTouch / Android enabled

inode.jpgGot an iPhone 3GS the other day (clearly I’m behind the mobile curve but then I hate talking on the phone so it took me a while to “get it”).

Anyway, after just a few days with the thing, I realize I need to begin tweaking some of the library’s web-based content.

First (easy) step? A touch-device friendly theme for this weblog. Also applied it to the library’s news blog as well. Thus far it’s working well and it couldn’t be any easier to implement–just drop the code in your plugins folder and activate it. Presence of the mobile device automatically detected and touch theme is served when appropriate.

http://www.bravenewcode.com/wptouch/

tip: To do a screen capture on your 3G/3GS iPhone, press down the home key then hit the “top” button. Screen flashes and image goes into your photos folder.

MARS update (final)

      Comments Off on MARS update (final)

Completed upgrading the new MARS (DSpace) server to from 10.5.8 to 10.6 (Snow Leopard Server). Went very smoothly. Here’s the sequence:

  • run SuperDuper to clone boot drive to second drive in XServe
  • shutdown server, pull boot drive and set aside (for disaster recovery)
  • move second drive to boot drive spot
  • boot off cloned drive
  • insert new disc
  • click on “install Snow Leopard Server”
  • enter serial number when prompted
  • reboot
  • all systems running normally
  • shutdown second time and reinsert original boot drive in second drive slot
  • after running updated OS for two days, will clone new boot drive to second drive

One quirk I’ve noticed: haven’t seen any mention of this elseweb but the /etc/rc.local that was faithfully launching Tomcat under 10.5.8 doesn’t seem to work under 10.6. Lauching Tomcat manually now until I figure out enough launchd magic to reenable autostarting (my last session with Lingon wasn’t that successful)..

I’ll also begin testing whether this new OS might solve the problem(s) that left our Manakin interface too unstable for everyday use.

You can check on that interface via this link:

http://digilib.gmu.edu:8080/xxmlui

MARS update

      Comments Off on MARS update

Finally have a tolerable JSP (java server pages) interface up and running. I’m no web designer but after firing up CSS Edit and reading (then re-reading) the 1.5.2 DSpace manual,  I was able to chase away most of the “boxy-ness” of the default JSP  interface.    Some things have changed since I last understood DSpace internals (roughly the 1.2 release)  but it was similar enough that in no time at all I was back in that “find DSpace and replace with MARS” groove.

I’m certainly ready to say the JSP interface is more stable than Manakin–seems to run smoother and use fewer resources too.  Downside? It’s more work to bend it into something visually interesting and it is clearly less flexible.

Given the experience of the past few weeks,  today I’m thankful for stable.

http://mars.gmu.edu

Catching Up

      Comments Off on Catching Up

Yes, those are in fact cobwebs hanging from the corners.  It’s not that I’ve been too busy to write something, just distracted, I guess. Vacation…a few other short trips…couple of deadlines…a digression into the twittersphere…it’s pretty easy to fall behind on one’s blogging…

Award winning

A few weeks ago we received one of the 2009 Campus Technology Innovation Awards. GM_WGandEC_small.jpg I wasn’t really expecting that (heard there were 349 nominees) so it came as a nice surprise even as it managed to burn up the last of my 15 minutes of fame. (I lost the bulk in June when this AP story appeared in over 2,000 places.  My favorite was this one which nicely showcased my sudden-onset facility with Spanish).

Continue reading

History MetaFinder

      Comments Off on History MetaFinder

It’s not ‘feature complete’ but our new Mason Metafinder is up and running. The idea is to build small federated search engines for our various research portals. My testbed has been development of a search engine for an Early American history portal. Right now, it covers these seven sources:

  • Historical Abstracts
  • American Memory Project
  • Arts & Humanities Search
  • JSTOR
  • America History & Life
  • OAIster
  • WorldCat
  • and as of June 28th, we’ve added:

  • Early American Imprints (1801-1819)
  • Google Scholar
  • Library of Virginia
  • National Archives web site

We’ll likely add a few more sources as we go forward but for now there’s at least enough content to begin to test the system and start to understand (and appreciate) both the power and limitations of federated searching.

A couple of quick points:

  • Google is a “just in case” sort of search product. Content is collected and indexed by Google just in case you ask. By contrast, Metafinder is a “just in time” sort of thing. When you launch a search, targets are searched in real time and results flow back at an unpredictable pace. For that reason, you’ll see Metafinder pop up a “do you want to add these results to your set” message from time to time during a search session.
  • Metafinder isn’t doing an exhaustive search. Faster responders quickly reach our threshold (roughly 100 citations). Slower systems (hi there, OAIster) give us only 10 results before the clock runs out. To cope with this, Metafinder offers a “Collection Status” button on the results page. Click it and you’ll see how many matches Metafinder got from that source, and how many more the source reported it might eventually deliver. Where there’s a great discrepancy, you need to go to the native interface for that source to do a more deliberate search.

We’re working with Deep Web Technologies on this project and I’ll just note here that they’ve been very good sports as I send them notes asking for a change here and there in their already excellent search system. They’re the code behind sites like science.gov and biznar.

You can try a search in the text box below. Once you retrieve a results page, the “Home” link will take you to ‘normal’ launch page for this particular instance of Metafinder: the Colonial History research portal (also under construction).
 

Search Metafinder (History)