iNode – Page 4 – Digital Programs and Systems

Beefing up Discovery

Wally December 17, 2012 Comments Off

In preparation for tomorrow’s launch of inPrimo, we just added an important new service to the system: the ability to discover from within inPrimo whether any library in the WRLC holds a particular journal in their collection(s).

A few screen grabs will show how the service works:

2012 12 18 15 29 38

Let’s say an inPrimo search brings up this citation of interest. At this point, our user has clicked the “View Online” link and now sees an in-place-reveal window (is there a better term for this?). It is filled with information we provide based on a response from our OpenURL link resolver (360Link).

For the first of what will be a number of name checks in this post, we rely heavily on the work of Matthew Reidsma (Grand Valley State University) and 22K of local javascript to up-style and manipulate the output delivered by the 360Link resolver.

It’s easy to reflect Mason’s holdings in this environment–the “Full Text Online” links show where Mason affiliates can access content published in Molecular Ecology. Of course, there are times when Mason does not have licensed access to content which tends to complicate things.

Until today, the “next best” option was opening a new browser window and manually running a search on the WRLC union catalog to see if other libraries in our consortium might hold the journal issue of interest. Or, you could click the “Search Aladin by Journal Title” and wade through the results trying to make sense of the Voyager display of individual library holdings.

As of today, we can answer that question with a single click–all without leaving inPrimo and without disturbing the results set of the search just run.

How? Here’s the more complete version of the MasonLink+ window we saw in the first image. In this view, you can see all the linking options we offer for this particular title. And yes, I did just add the large red arrow for purposes of this post…

2012 12 18 15 16 18

Clicking the “Search other WRLC libraries…” link launches a Django application developed by Daniel Chudnov, Joshua Gomez, Laura Wrubel and others over at George Washington University–work done to support their any-day-now implementation of Summon. Reliably forward thinking, their software offers a really simple API and that open design is the key for our purposes. While we don’t need anything extra to identify Mason’s holdings, without the service they’ve developed we’d be without a method for seamless discovery of holdings locked inside the WRLC’s Voyager system.

On our end, a local PHP application (developed by Khatwanga Siva Kiran Seesala with a mentoring assist from our resident PHP expert Andrew Stevens) queries their “findit” service with an ISSN number and then reformats the JSON output for “pretty print” in the browser.

So, for Molecular Ecology (0962-1083), the end result looks a bit like the image below. The output is a long scrollable display but I’ve shortened it here to show just a couple of the libraries. Notice for print runs we offer an SMS message link (to send yourself the metadata you need if traveling to another library to retrieve the journal). We also offer a “Request” link to use the Consortium Loan Service (for delivery of hard copy).

2012 12 18 15 54 35

We’ve styled the service to run within the inPrimo environment, but supply an ISSN and it works fine by itself. Remember Dr. Dobbs Journal (ISSN 1044-789X)?

Wonder if any library in the Washington Research Library Consortium has a copy?

http://infowiz.gmu.edu/findit/?issn=1044-789X

The uncirculated monograph?

Wally December 7, 2012 Comments Off

An interesting social media experiment is underway at Mason–each week a different member of the Mason community takes over the university’s official Twitter feed. This week, Mark Sample had the honor and one of his tweets got me thinking about local numbers…

2012 12 07 12 27 23

The report Mark cites can be found here. Curious how Mason might compare, we ran a few SQL queries against our local catalog/circulation system.

The highlights:

If we limit ourselves to books that might have been expected to circulate (meaning we’re not counting reference books, e-books or government documents), we found that since December 31, 1989, we have added 958,303 books to our collection. Nearly one million real books placed on real shelves.

By library:

699,114 Fenwick
186,547 Johnson Center
42,080 Prince William Campus
30,505 Arlington Campus

285,799 or 29% of those 958,303 monographs have never circulated.

Of course, that doesn’t necessarily mean they’ve never left the shelf or that they’ve never been used. Walk past a “reshelving” area in any of our libraries and you’ll see that many items are in fact used but inside the library. And though it would be bad form to speculate in any detail, surely some small number of items by-pass the circulation desk altogether as they’re taken on a one-way trip out the door. Nevertheless, the circulation numbers are surely an adequate proxy for usage.

One more bit of fine print: We’re counting items added to our collection since 1990 which isn’t precisely the same thing as items published since 1990. Doubt the difference affects the numbers much. Every year purchasing an older imprint becomes more difficult thus the monographs we acquire in book form today were most probably published fairly recently.

So which is better? A higher number (like the 55% at Cornell) or something lower (like Mason’s 29%)?

There’s much to be said for the lower number, particularly if it tracks along with at least a moderate-sized collection. It suggests we’re good stewards of collection development funds but with a cushion that indicates we’re not just purchasing the fads of the moment (isn’t that a downstream problem with patron-driven acquisitions?). Further, I think it shows we’re not paying huge utility costs or stealing collaboration space just so we can warehouse books that were last touched the day they we put them in the stacks. Probably safe to say it also suggests we’re likely taking advantage of alternative formats (e-whatever) as we build our collections.

A higher number…well, if a library’s never-circulated percentage is high enough (and I’ve seen estimates as high as 80% for some) you reach the researcher’s library nirvana: a collection so deep that even the most obscure request will likely be immediately satisfied. That is certainly a worthy aspirational goal for any library.

Slowly building…

Wally December 7, 2012 Comments Off

Coming up on our go-live date but I’ve been tracking the results of our stealth promotion activity. The graph below shows the number of Primo searches each day–still quite small but over the past 30 days we’re averaging 543 daily searches (and that includes a few really down days over Thanksgiving). For the thirty days prior to that (Oct 6 – Nov 6) the daily average was 406 so I guess that means we’re making slow but steady progress…

2012 12 07 12 05 56

Discovery = Disaggregation = Disruption

Wally November 20, 2012 Comments Off

Now that we’re about done with our Primo implementation, I thought it might be useful to share a few observations …

It’s Ready

We resolved the last of the show-stoppers about six weeks ago and announced to staff that our new Primo discovery system was ready. Supported by what proved to be a great in-house implementation group and with solid vendor support (surprised to find that included very helpful weekly WebEx meetings) it wasn’t too hard to deliver a useable service even on an somewhat agressive schedule

Tiltedprimo

As an aside, I have promised our implementation team that after they did such a great job navigating the really difficult stages of a project, I’ll do what I can to make sure they’re not forgotten when the final stage kicks in.

The clock began ticking in June

As I look back on it now, I think I failed to appreciate that in this context the word “disruptive” had a very particular and actually rather narrow scope. It has indeed proven to be a disruptive technology…but chiefly for staff. Based on what I’ve seen, from users it’s more like “what took you so long.”

The work of getting Primo ready began in mid summer. From the moment the ink dried on our contract I felt the sooner we could get it in front of our users the happier (and more productive) they would become. An early and continual concern was making sure we got the introduction of this disruptive technology right. More on that in a bit…

Discovery = Disaggregation = Disruption

So, why is discovery disruptive inside the library? Not really sure but let me offer a theory:

Continue reading →

What are they e-reading? (2011)

Wally January 6, 2012 Comments Off

Based on usage, here are the ten most popular e-books in our Safari Online Books collection for 2011. Each title includes [number of accesses during the year]. In total, an e-book was pulled from our virtual shelves 356,563 times last year. The top 10:

Program Development in Java: Abstraction, Specification, and Object-Oriented Design [26,372]
Internet & World Wide Web: How to Program (4th edition) [20,417]
Effective Java: 2nd edition [15,853]
Computer Security: Art and Science [8165]
Joomla! 1.5: A User’s Guide: Building a Successful Joomla! Powered Website, 2nd Edition [6,914]
Test Driven: Practical TDD and Acceptance TDD for Java Developers [4,131]
Head First Servlets and JSP, 2nd edition [4,061]
Head First HTML with CSS & XHTML [3,791]
Head First Design Patterns [3,465]
OCA Oracle Database 11g: Administration: Exam Guide (Exam 1Z0-052) [2,713]

And the least popular e-book? Hard to say. We had 1,054 titles that saw only one virtual access during the year and 2,983 that just gathered cosmic dust. 5,097 (roughly 63%) of the titles in this particular e-book collection had at least one access.

What are they e-reading?

Wally March 1, 2011 2 Comments

I often wonder who reads the e-books we link into our catalog. While I love reading on my Kindle I can’t go more than few “pages” into a web-based e-book before my head starts splitting. Of course, I have to overcome years of paper-based reading to be satisfied with these webified versions so perhaps my reaction isn’t typical.

We offer just over 8,070 Safari e-books in our catalog and during the first two months of this year 1,370 of those titles were accessed (roughly 17% usage). So, what’s hot?

Top ten titles for 2011 (with year-to-date access in parentheses):

Program Development in Java: Abstraction, Specification, and Object-Oriented Design (3587)
Computer Security: Art and Science (3202)
Head First Design Patterns (1146)
CCIE Professional Development Routing TCP/IP (1180)
Internet & World Wide Web: How to Program, 4th ed. (1062)
Learning SAS® by Example: A Programmer’s Guide (877)
Effective Java™, Second Edition (772)
CompTIA A+® Certification All-in-One Exam Guide, 7th Edition (671)
Head First HTML with CSS & XHTML (660)
Test Driven: Practical TDD and Acceptance TDD for Java Developers (658)

Least popular title that had at least one use thus far in 2011?

Zend Framework in Action

What happens to the mid-major library?

Wally November 11, 2010 1 Comment

I was listening to the latest Digital Campus podcast on my way home yesterday when the discussion began to hit close to home…

Talking about a day in the maybe not-so-distant future when most books are available in one e-form or another:

Dan Cohen: “I wonder what will happen to libraries of the size we have here at Mason, you know, the one to two million volumes, pretty much recent collection (the past 100 years), doesn’t have a deep catalog of rare books? What happens in a world of all digital book content to that kind of library?

I still get the Library of Congress or Harvard or the University of Michigan, but it’s hard to give a rationale for why a library like Fenwick here at Mason sticks around. It’s a lot of heating, a lot of physical plant…and it’s a lot of people. And I love libraries, but aside from that fact, the sort of Upstairs/Downstairs ‘Well this is where the poor people go to get their sad, old printed books’ …you know, what happens to it? Even now it’s not a place where people start their research…”

Dan, I’ve been asking myself some form of that question for at least ten years.

While I surely have a salary-driven bias, I’ve always assumed there will still be something we’ll call a library when the e-future arrives. But I sometimes wonder–will it have evolved from today’s library or have been created as a replacement for it? Thinking about how we get from here to there, I worry:

Will we, as a profession, spend too much energy chasing improvement in the transactional metrics of success (items circulated, reference questions asked, gatecount, etc.)? Once tried-and-true measures of library utility, they’re in irrevocable and ever-accelerating decline. Shouldn’t we accept that and begin redeploying resources in pursuit of new opportunities?
Will we recognize and be able to exploit transformational moments as they appear? Or will we pass on them as “not something libraries traditionally do?” Put another way, how far is it from “heart of the university” to “vestigial organ?”

Today one ‘mid-major’ library has roughly the same collection as the next one on the list. That sameness, combined with the trend toward outsourcing what were once considered core enterprise-level services (e.g., campus email systems moving to a vendor-supplied cloud), seems a dangerous mix for the library. Let me share my own “worst case” scenario:

The world has gotten past the friction that limits universal satisfaction with today’s e-readers and e-content and into that environment, a large ‘web-scale‘ vendor appears…offering the university a subscription that provides e-access to all e-content along with a strategically-priced bundle of e-reference services.

Think it can’t happen? The ears of that cat are already peeking out of the bag. Consider a product like Summon. The ProQuest business model for Summon is surely based on two facts:

the leased (or licensed if you prefer) e-content of each library is roughly the same
it actually resides on the servers of vendors outside the library

Why not engage in a bit of corporate cooperation and then sell access to a cloud-based index of that content over and over to each and every one of those libraries? To the degree that you can vertically integrate content leases with the search mechanism–well, that’s what they call “lock-in” gravy.

So, back to Dan’s question. What does ‘the library’ do for a second act? I’m guessing:

Our footprint (buildings and staff) will be much, much smaller
We’ll offer very fast and ubiquitous networking on site and focus on high-end equipment/tools for manipulating and reworking digital content
We’ll offer on-demand services like “find and print” or “find and import” so users can build their own libraries
We’ll develop special tools and services to aggregate e-content in locally relevant ways (a 21st century analog to the old “finding aid”)
We’ll put much more emphasis on supporting teaching and learning
We’ll focus as much energy on data-driven research as we do today on the bibliographic-driven counterpart
We’ll offer more service and financial support for the front-end of the scholarly communication process (e.g., paying fees for campus authors in OA journals, helping authors secure their rights and protect the value of their intellectual investment, etc.)
We’ll still be doing the “special collections and archives” thing as that will be a large part of what differentiates libraries

We’ll surely still find that students are starting their research elseweb…

Life imitates Art

Wally November 1, 2010 Comments Off

Speaking of librarianship and technology (as happens here from time to time), I want to highlight a couple of signs from the rally down on the Mall this past weekend. Here’s a frame from some video I shot:

Thanks to a tweet from Dorothea, I now know the inspiration for that one…

Years of wacky vt100 emulation have given me an appreciation for my other favorite (sorry, don’t have a photo of it but this pretty much captures the idea):

What different sort algorithms sound like

Wally October 29, 2010 Comments Off

from andrut:

This particular audibilization is just one of many ways to generate sound from running sorting algorithms. Here on every comparison of two numbers (elements) I play (mixing) sin waves with frequencies modulated by values of these numbers. There are quite a few parameters that may drastically change resulting sound – I just chose parameteres that imo felt best.

Fun with the 245 tag

Wally October 28, 2010 Comments Off

Over the past few months I have, on more than one occasion, found myself making a full extract of the bibliographic (MARC) records in our library’s catalog. Turns out, this sort of thing happens frequently when you run your own ILS locally but also belong to a consortium where individual members are busily adding different “discovery” layers to the underlying catalog that they all share.

Some of this work is constant, sometimes it comes in spurts. Nightly, for example, I have a script that updates an AquaBrowser instance our consortium operates and then sends a second copy of those changes to Serials Solutions so another member’s Summon instance will reflect more current information. Â Less frequently, I respond to a request for an extract to populate trial instances of another product someone else is considering.

Let’s not even mention the skunkworks instance of VuFind that I run as a sort of library geek hobby.

Yesterday, I decided to take one of those extract files and try a text mining experiment on my desktop Mac. To begin, I ran the file of nearly 1.7 million MARC records through a Windows VM so MarcEdit could produce a plain text version of the data.

As a first cut, I extracted all the 245 tags (titles) from the file.

grep "=245 " MasonBibs.txt > 245tags.txt [return]

which yielded:

=245 Â 10$aAdolescence.$cPrepared by the society's committee. Edited ...
=245 Â 10$aGuidance in educational institutions.
=245 Â 14$aThe teaching of reading: a second report.
=245 Â 10$aHighways into the Upper Amazon Basin.
=245 Â 10$aProust et le roman,$bessai sur les formes et techniques ...
=245 Â 10$aCreative management in banking.

Interesting, but clearly more processing was needed. With a short perl script I removed the tag labels, subfield codes and most punctuation. Seconds later, I had a 169MB text file that looked like this short excerpt (a 245 tag on each line):

Adolescence Prepared by the society s committee Edited by Nelson B Henry
Guidance in educational institutions
The teaching of reading a second report
Highways into the Upper Amazon Basin
Proust et le roman essai sur les formes et techniques du roman dans
Creative management in banking

A second perl script normalized the capitalization then split out and counted the words. I used the “%08d” construct in the “printf” statement to insure I’d have a list sortable by usage when the script finished.

#!/opt/local/bin/perl
use strict;
use warnings;

my %count_of;
while (my $line = <>) {
  $line =~ tr/[A-Z]/[a-z]/;
  foreach my $word (split /\s+/, $line) {
    $count_of{$word}++;
      }
     }
  for my $word (sort keys %count_of) {
     printf "%08d : $word\n", $count_of{$word
  }

countwords.pl < 245tags.out > wordlist.txt [return]

Here’s an excerpt of the output:

00000002 : salinewater
00000001 : saling
00000001 : salingar
00000064 : salinger
00000002 : salinghi
00000002 : salinian
00000001 : salinisation
00000004 : salinities

Final step was to sort the 453,672 words/lines in this file by the number of occurrences:

sort < wordlist.txt > 245tags_sorted.txt [return]

VoilÃ ! I now know that these are the four most common words used in titles represented in our catalog:

the (1,343,112)
of (1,190,200)
and (918,245)
by (522,495)

then two outliers:

resource (450,200)
electronic (448,118)

and then back to prepositions and other unsurprising terms:

in (363,788)
a (346,909)
to (286,701)
on (252,793)
for (229,914)
edited (155,248)
states (126,221)
from (125,065)
with (124,319)
united (123,889)
committee (86,990)

Obviously, there’s not much of interest here and the point of the post is really to share the methodology and code snippets for anyone interested in running other experiments. However, I did find it odd that the words “electronic” and “resource” reached what I’d consider stopword status. Could we really be moving that close to the digital library I’ve been working toward for all these years?

Well, I’d like to think so but I’m guessing it’s the fact that not too long ago we loaded 306,000+ records from the Lexis-Nexis US Serial Set and that has skewed the frequency count for several terms. My sense is that a large load of these sorts of specialized records also has a negative effect on most users; that is, it helps build an ever-larger haystack for those seeking a needle of information that has nothing to do with that particular set of records.

Of course, that’s a problem we need to solve with better search tools, not by restricting the scope of our content.

Beyond developing a workflow that might yet yield an interesting outcome, there was one small spinoff benefit to this little wordcount experiment. Skimming the list of words that appeared only once across all titles, I was able to easily spot a number of misspellings.

For example, if you look at that little excerpt of my original word count file, you’ll see:

00000001 salinisation

I checked the catalog and yes, it’s a misspelling (although the variant title and subject headings saved the “word anywhere” searcher on this one):

Title:           	 Salinisation of land and water resources : human causes, extent...
Variant Title:	 Salinization of land and water resources
Primary Material:	 Book
Subject(s):	 Salinization --Control.
                         Salinization --Control --Case studies.
                         Soil salinization.
	                 Water salinization.

Now I’m wondering if there’s a way I can use this tag-extraction work with a spell-checker to assist in some automated way with our neverending quest for perfect metadata.