What happens to the mid-major library?

      1 Comment on What happens to the mid-major library?

I was listening to the latest Digital Campus podcast on my way home yesterday when the discussion began to hit close to home…

Talking about a day in the maybe not-so-distant future when most books are available in one e-form or another:

Dan Cohen: “I wonder what will happen to libraries of the size we have here at Mason, you know, the one to two million volumes, pretty much recent collection (the past 100 years), doesn’t have a deep catalog of rare books? What happens in a world of all digital book content to that kind of library?

I still get the Library of Congress or Harvard or the University of Michigan, but it’s hard to give a rationale for why a library like Fenwick here at Mason sticks around. It’s a lot of heating, a lot of physical plant…and it’s a lot of people. And I love libraries, but aside from that fact, the sort of Upstairs/Downstairs ‘Well this is where the poor people go to get their sad, old printed books’ …you know, what happens to it? Even now it’s not a place where people start their research…”

Dan, I’ve been asking myself some form of that question for at least ten years.

While I surely have a salary-driven bias, I’ve always assumed there will still be something we’ll call a library when the e-future arrives. But I sometimes wonder–will it have evolved from today’s library or have been created as a replacement for it?  Thinking about how we get from here to there, I worry:

  • Will we, as a profession, spend too much energy chasing improvement in the transactional metrics of success (items circulated, reference questions asked, gatecount, etc.)? Once tried-and-true measures of library utility, they’re in irrevocable and ever-accelerating decline. Shouldn’t we accept that and begin redeploying resources in pursuit of new opportunities?
  • Will we recognize and be able to exploit transformational moments as they appear? Or will we pass on them as “not something libraries traditionally do?” Put another way, how far is it from “heart of the university” to “vestigial organ?”

Today one ‘mid-major’ library has roughly the same collection as the next one on the list.  That sameness, combined with the trend toward outsourcing what were once considered core enterprise-level services  (e.g., campus email systems moving to a vendor-supplied cloud), seems a dangerous mix for the library. Let me share my own “worst case” scenario:

The world has gotten past the friction that limits universal satisfaction with today’s e-readers and e-content and into that environment, a large ‘web-scale‘ vendor appears…offering the university a subscription that provides e-access to all e-content along with a strategically-priced bundle of e-reference services.

Think it can’t happen? The ears of that cat are already peeking out of the bag. Consider a product like Summon. The ProQuest business model for Summon is surely based on two facts:

  • the leased (or licensed if you prefer) e-content of each library is roughly the same
  • it actually resides on the servers of vendors outside the library

Why not engage in a bit of corporate cooperation and then sell access to a cloud-based index of that content over and over to each and every one of those libraries?  To the degree that you can vertically integrate content leases with the search mechanism–well, that’s what they call “lock-in” gravy.

So, back to Dan’s question. What does ‘the library’ do for a second act? I’m guessing:

  • Our footprint (buildings and staff) will be much, much smaller
  • We’ll offer very fast and ubiquitous networking on site and focus on high-end equipment/tools for manipulating and reworking digital content
  • We’ll offer on-demand services like “find and print” or “find and import” so users can build their own libraries
  • We’ll develop special tools and services to aggregate e-content in locally relevant ways (a 21st century analog to the old “finding aid”)
  • We’ll put much more emphasis on supporting teaching and learning
  • We’ll focus as much energy on data-driven research as we do today on the bibliographic-driven counterpart
  • We’ll offer more service and financial support for the front-end of the scholarly communication process (e.g., paying fees for campus authors in OA journals, helping authors secure their rights and protect the value of their intellectual investment, etc.)
  • We’ll still be doing the “special collections and archives” thing as that will be a large part of what differentiates libraries

We’ll surely still find that students are starting their research elseweb…

Life imitates Art

      Comments Off on Life imitates Art

Speaking of librarianship and technology (as happens here from time to time), I want to highlight a couple of signs from the rally down on the Mall this past weekend. Here’s a frame from some video I shot:

Thanks to a tweet from Dorothea, I now know the inspiration for that one…

Years of wacky vt100 emulation have given me an appreciation for my other favorite (sorry, don’t have a photo of it but this pretty much captures the idea):

What different sort algorithms sound like

      Comments Off on What different sort algorithms sound like

from andrut:

This particular audibilization is just one of many ways to generate sound from running sorting algorithms. Here on every comparison of two numbers (elements) I play (mixing) sin waves with frequencies modulated by values of these numbers. There are quite a few parameters that may drastically change resulting sound – I just chose parameteres that imo felt best.

Fun with the 245 tag

      Comments Off on Fun with the 245 tag

Over the past few months I have, on more than one occasion, found myself making a full extract of the bibliographic (MARC) records in our library’s catalog. Turns out, this sort of thing happens frequently when you run your own ILS locally but also belong to a consortium where individual members are busily adding different “discovery” layers to the underlying catalog that they all share.

Some of this work is constant, sometimes it comes in spurts. Nightly, for example, I have a script that updates an AquaBrowser instance our consortium operates and then sends a second copy of those changes to Serials Solutions so another member’s Summon instance will reflect more current information.   Less frequently, I respond to a request for an extract to populate trial instances of another product someone else is considering.

Let’s not even mention the skunkworks instance of VuFind that I run as a sort of library geek hobby.

Yesterday, I decided to take one of those extract files and try a text mining experiment on my desktop Mac. To begin, I ran the file of nearly 1.7 million MARC records through a Windows VM so MarcEdit could produce a plain text version of the data.

As a first cut, I extracted all the 245 tags (titles) from the file.

grep "=245 " MasonBibs.txt > 245tags.txt [return]

which yielded:

=245  10$aAdolescence.$cPrepared by the society's committee. Edited ...
=245  10$aGuidance in educational institutions.
=245  14$aThe teaching of reading: a second report.
=245  10$aHighways into the Upper Amazon Basin.
=245  10$aProust et le roman,$bessai sur les formes et techniques ...
=245  10$aCreative management in banking.

Interesting, but clearly more processing was needed. With a short perl script I removed the tag labels, subfield codes and most punctuation. Seconds later, I had a 169MB text file that looked like this short excerpt (a 245 tag on each line):

Adolescence Prepared by the society s committee Edited by Nelson B Henry
Guidance in educational institutions
The teaching of reading a second report
Highways into the Upper Amazon Basin
Proust et le roman essai sur les formes et techniques du roman dans
Creative management in banking

A second perl script normalized the capitalization then split out and counted the words. I used the “%08d” construct in the “printf” statement to insure I’d have a list sortable by usage when the script finished.

use strict;
use warnings;

my %count_of;
while (my $line = <>) {
  $line =~ tr/[A-Z]/[a-z]/;
  foreach my $word (split /\s+/, $line) {
  for my $word (sort keys %count_of) {
     printf "%08d : $word\n", $count_of{$word

countwords.pl < 245tags.out > wordlist.txt [return]

Here’s an excerpt of the output:

00000002 : salinewater
00000001 : saling
00000001 : salingar
00000064 : salinger
00000002 : salinghi
00000002 : salinian
00000001 : salinisation
00000004 : salinities

Final step was to sort the 453,672 words/lines in this file by the number of occurrences:

sort < wordlist.txt > 245tags_sorted.txt [return]

Voilà! I now know that these are the four most common words used in titles represented in our catalog:

the (1,343,112)
of (1,190,200)
and (918,245)
by (522,495)

then two outliers:

resource (450,200)
electronic (448,118)

and then back to prepositions and other unsurprising terms:

in (363,788)
a (346,909)
to (286,701)
on (252,793)
for (229,914)
edited (155,248)
states (126,221)
from (125,065)
with (124,319)
united (123,889)
committee (86,990)

Obviously, there’s not much of interest here and the point of the post is really to share the methodology and code snippets for anyone interested in running other experiments. However, I did find it odd that the words “electronic” and “resource” reached what I’d consider stopword status. Could we really be moving that close to the digital library I’ve been working toward for all these years?

Well, I’d like to think so but I’m guessing it’s the fact that not too long ago we loaded 306,000+ records from the Lexis-Nexis US Serial Set and that has skewed the frequency count for several terms. My sense is that a large load of these sorts of specialized records also has a negative effect on most users; that is, it helps build an ever-larger haystack for those seeking a needle of information that has nothing to do with that particular set of records.

Of course, that’s a problem we need to solve with better search tools, not by restricting the scope of our content.

Beyond developing a workflow that might yet yield an interesting outcome, there was one small spinoff benefit to this little wordcount experiment. Skimming the list of words that appeared only once across all titles, I was able to easily spot a number of misspellings.

For example, if you look at that little excerpt of my original word count file, you’ll see:

00000001 salinisation

I checked the catalog and yes, it’s a misspelling (although the variant title and subject headings saved the “word anywhere” searcher on this one):

Title:           	 Salinisation of land and water resources : human causes, extent...
Variant Title:	 Salinization of land and water resources
Primary Material:	 Book
Subject(s):	 Salinization --Control.
                         Salinization --Control --Case studies.
                         Soil salinization.
	                 Water salinization.

Now I’m wondering if there’s a way I can use this tag-extraction work with a spell-checker to assist in some automated way with our neverending quest for perfect metadata.

what they’re reading…

      Comments Off on what they’re reading…

Extracted the titles from readings our faculty have placed in our e-reserves system this semester, removed terms like “Chapter” and “Ch.”, normalized case and fed the result into Wordle.

Just curious, I performed the same operations on readings from Fall 2005:

I’m beginning to suspect it’s our social scientists who make greatest use of our e-reserves service.


      Comments Off on Backups

It was once common for newspapers to reprint an important article from time to time.

Of course, this was before Google and in those days it wasn’t all that easy to lay your hands on an article that you just remembered seeing somewhere. Heck, it wasn’t easy if you remembered exactly where you saw it.

Reference librarians were gods then.

In that spirit, I’d like to “reprint” this article from Jamie Zawinski. It has one of my favorite quotes in it (“the universe tends toward maximum irony”), offers great advice, and does it in a way that’s easily absorbed and implemented. I’ve tweaked it just enough to skirt the explicit tag:

Hello, this is a public service announcement. I am here to tell you about backups. It’s very simple.

Option 1: Learn not to care about your data. Don’t save any old email, use a film camera, and only listen to physical CDs and not MP3s. If you have no posessions, you have nothing to lose.

Option 2 goes like this:

You have a computer. It came with a hard drive in it. Go buy two more drives of the same size or larger. If the drive in your computer is SATA2, get SATA2. If it’s a 2.5″ laptop drive, get two of those. Brand doesn’t matter, but physical measurements and connectors should match.

Get external enclosures for both of them. The enclosures are under $30.
Put one of these drives in its enclosure on your desk. Name it something clever like “Backup”. If you are using a Mac, the command you use to back up is this:

sudo rsync -vaxE –delete –ignore-errors / /Volumes/Backup/

If you’re using Linux, it’s something a lot like that. If you’re using Windows, go f*ck yourself.

If you have a desktop computer, have this happen every morning at 5AM by creating a temporary text file containing this line:

0 5 * * * rsync -vaxE –delete –ignore-errors / /Volumes/Backup/

and then doing sudo crontab -u root that-file

If you have a laptop, do that before you go to bed. Really. Every night when you plug your laptop in to charge.

If you’re on a Mac, that backup drive will be bootable. That means that when (WHEN) your internal drive scorches itself, you can just take your backup drive and put it in your computer and go. This is nice.

When (WHEN) your backup drive goes bad, which you will notice because your last backup failed, replace it immediately. This is your number one priority. Don’t wait until the weekend when you have time, do it now, before you so much as touch your computer again. Do it before goddamned breakfast. The universe tends toward maximum irony. Don’t push it.

That third drive? Do a backup onto it the same way, then take that to your office and lock it in a desk. Every few months, bring it home, do a backup, and immediately take it away again. This is your “my house burned down” backup.

“OMG, three drives is so expensive! That sounds like a hassle!” Shut up. I know things. You will listen to me. Do it anyway.

Addendum A:

Mac users: for the backup drive to be bootable, you need to do two things:
When you first format the drive, set the partition type to “GUID”, not “Apple Partition Map”;
Before doing your first backup, Get Info on the drive and un-check “Ignore ownership on this drive” under “Ownership and permissions.”

You can test whether it’s bootable by holding down Option while booting and selecting the external drive.

Addendum B:

RAID is a waste of your goddamned time and money. Is your personal computer a high-availability server with hot-swappable drives? No? Then you don’t need RAID, you just need backups.

I follow this procedure for the most part, but I use a newer version of rsync (installed 3.0.7 via macports) than the one that ships with Snow Leopard (2.6.9) and launch the nightly backups by placing a script in the /etc/periodic/daily directory. With this somewhat newer version of rsync, I use this set of switches:

/opt/local/bin/rsync -vaxAX –delete –ignore-errors / /volumes/backup

Anthologize it

      Comments Off on Anthologize it

Preview001.jpgMy pragmatic friends over at the Center for History and New Media are at it again. This time, with support from National Endowment for the Humanities, they’ve come up with an interesting (and I hope trendmaking) take on a sponsored project.   Let’s use the funding to support the actual creation of something useful…


Here’s the result of their ‘One Week | One Tool‘ project: a WordPress plugin that converts a series of blog posts (and optionally, feeds from other RSS sources as well) into an e-book.

I know, on one level this seems a bit retro (did Henry Ford spend time trying to figure out  a way to turn an auto back into horse-drawn buggy?) but on another level it’s quite brilliant.  For example, I can see a great future for this tool as a way to archive a blog.

Congratulations to CHNM and the “One Week | One Tool” participants.

A few year-end statistics

      Comments Off on A few year-end statistics

The fiscal year ended July 1 so it seems a good time to report some of the numbers we saw over the past year (July 1, 2009 – June 30, 2010). In no particular order:

  • 1,604,862 – number of times the library’s home page was loaded.
  • 29,914,095 – items served from main library website (pages, images, etc.)
  • 2,855,193 – number of searches run against our OPAC via the web interface
  • 13,665,355 – number of ‘pages’ served by our catalog during the year
  • 93,013,174 – number of ‘hits’ on our off-campus authentication (proxy) server.
  • 818,878 – number of items “viewed” in our MARS system.
  • 90,401 – number of pdf files retrieved from our e-reserves system.

There are a few other stats I haven’t listed but if I were to add them to these numbers we’d end up with just over 142 million ‘hits’ on library servers during the past year.  An average of  4.5 hits per second.

Actually, that’s low-balling the number to some degree.

Not included are visits to e-journals and databases when the query is launched from a machine on the campus network (e.g., in the library or a faculty desktop or in a dorm room).   Those requests don’t go through our proxy server and are thus much more difficult to quantify.   We’re also skipping our InfoGuides service (hosted elseweb) and  not counting our various research portals in this set of numbers.  Finally,  we’re also neglecting to count the non-stop traffic that goes in and out of the university’s PeopleFinder every day (it may surprise some to learn that we’ve run that service on a library server since roughly 1994–when we first launched a website that eventually became the university’s official web presence (http://www.gmu.edu)).

Comicbook Fonts

      Comments Off on Comicbook Fonts

Without renumeration or discounts of any kind, I still feel a duty to do what I can to improve the look of  the web…and that means I should point out that for the next 10 days or so (through July 31) all fonts over at Comicbookfonts.com are half-priced.  Not as attractive as their New Year’s promotion (any font for the price of the year, e.g., $20.11 next January if they run it again), but still a pretty sweet deal.


New Chrome beta for Mac

      Comments Off on New Chrome beta for Mac

The latest Google chrome browser beta is making me rethink my ‘default browser’ choice.  Two reasons:

1) The new V8 javascript engine is very quick.  In fact, it wins this race on the SunSpider JavaScript Benchmarks easily (smaller number is better):

Testbed: Mac Pro, 2 x 2.26 GHz, OS X 10.6.3

Firefox 3.6.3 861.0ms

Safari 4.0.5 432.0ms

Chrome 5.0.375.29 beta  320.4ms

2) 1Password has a plugin extension that now puts your passwords a click away.   It’s not full-featured (you still have to hit return after selecting the appropriate set of credentials) but it works.


Update (6/7/2010)

Safari 5.0 322.6 ms

Google Chrome (5.0.375.55)  324.6 ms

So, for the moment, Safari’s faster…wonder what Google did in the move from 5.0.375.29 to 5.0.375.55 to slow the javascript down a couple of milliseconds?