We have no perfect way of assessing e-content usage by our students even though we’re now spending 75% or more of our collections budget on this sort of material. We do receive and analyze COUNTER statistics but COUNTER stats focus on what’s being used and collapse all activity by students, faculty and staff into a single number for each source. Fine as far as it goes, but I’m also interested in who’s using content. Not down to the individual (I value the library’s reputation for privacy) but at least to some meaningful though suitably-anonymous aggregation. Until I get a better tool, here’s how I go about answering a question like “how do the different majors use our e-content collections?”
I am a relative newcomer to the topic of OERs (Open Educational Resources). Not unaware of the topic—our Mason Publishing Group has been working with faculty interested in affordable educational materials for some time now—but until late, I haven’t really been terribly involved in those efforts.
That changed one afternoon this summer as I grabbed my laptop and tagged along with them to a meeting with the Associate Provost for Undergraduate Education to talk about OERs.
As the meeting progressed (and moved ever further from my area of expertise) I started stealing moments to jump in and out of various OER aggregation sites, curious to see the sorts of resources already available on the net.
If you’ve spent much time with OERs, you won’t be surprised to hear that I discovered:
- many dissimilar aggregations of content;
- so many wildly-different interfaces;
- so much duplication across these aggregations;
- and such inconsistent metadata.
As I poked around, I could easily envision a faculty member—excited by idea of OERs—feeling the enthusiasm drain away as she dove in and out of the various content silos. Soon I found myself thinking much less about OERs and far more about how to improve their discoverability as a way to improve OER adoption…
A local library made news in 2010, announcing that it would archive every tweet ever posted. With Twitter generating 500 million tweets a day, can we really be surprised that it’s proving to be a challenge?
Of course, that doesn’t mean there aren’t a host of smaller services we can build around social media. By way of example, here are three social media services we offer the Mason community. One’s pretty simple while the other two require a bit more infrastructure.
Mason Tweets (http://tweet.gmu.edu)
This curated feed from “official” and “near-official” twitter accounts from across the university offers a quick and easy way to take the “Mason Nation” pulse.
To produce this service, we created a MasonTweeter account on Twitter to follow Mason-related feeds. The web presence is simply a page that embeds the MasonTweeter timeline.
An archive of every tweet from Mason’s President, Ángel Cabrera.
This service stems from a discussion I had with Dr. Cabrera a few years ago. At that time, Twitter did not offer users an archive of their tweets (they do now), so we were looking into how we might save his tweets for future university historians. We settled on a method that offers a searchable database of tweets stored locally in a MySQL database (suitable for future archiving). Thanks to Andrew M. Whalen for the code that helped build this LAMP-based archiving service.
Social Feed Manager (SFM) (https://gwu-libraries.github.io/sfm-ui/
Just the other day, I set up our most ambitious social media service yet: Social Feed Manager.
SFM is a Django application developed by George Washington University Libraries to collect social media data from Twitter. It connects to Twitter’s approved API to collect data in bulk and makes it possible for scholars, students, and librarians to identify, select, collect, and preserve Twitter data for research purposes. We’re running SFM in a Docker container (using Docker for Mac) which simplifies installation and abstracts away much of the underlying complexity.
We have added Social Feed Manager to the suite of data services we offer out of the new Digital Scholarship Center we’ve been shaking down in beta since late January.
Had an email exchange yesterday with a group that wants to archive a few of their online web projects in our MARS system. Due to what I hope is a temporary vacancy in our staffing, that meant I had to take the lead in explaining how we handle archiving a website and what the end result of that process looks like.
I had a solid understanding of the what we do part–it’s pretty straightforward: we point a webcrawler at the up-and-running version of the site and pack what we get back into a WARC file. We then upload that WARC file to the DSpace instance that delivers our MARS service and flesh out the metadata. We’ve done a number of these and it works well. My immediate challenge was mastering the how do we do this.
Until recently, the actual work of producing and archiving the WARC file was handled by Jeri Wieringa (who recently moved on from her still-spoken-of-in-reverent-tones stint with our Mason Publishing Group to new challenges). As I started down the WARC-enlightenment path, I was glad to see that our MARS entries for archived sites point the user to WARC-viewing tools for both Windows and MacOS. Seems we recommend Ilya Kreymer’s webarchiveplayer available at https://github.com/ikreymer/webarchiveplayer. So I started there…downloaded and installed the Mac version, pointed it at one of our archived WARC files and in minutes had a pretty firm grasp of this end of the process.
What I still didn’t know was precisely how we had been producing these WARC files. While on Ilya’s GitHub repository for webarchiveplayer I noticed a link to webrecorder—which I later learned corresponds to the service deployed at https://webrecorder.io. That looks like a large-scale solution and one I’ll set up and test soon. Until then, I’m going with a very lightweight method….wget.
This command from a unix shell or terminal window, produces a useable WARC of this blog:
wget --recursive --warc-file=inodeblog --execute robots=off --domains=inodeblog.com --user-agent=Mozilla http://inodeblog.com
Building the WARC across the net (wget was running on my office iMac at Mason and inodeblog.com lives at reclaim hosting), took roughly 7 minutes. If you’re going to sites you don’t own, you’ll probably want to add a –wait=X switch to pause for ‘X‘ seconds between requests and add your email to your –user-agent= string so people know who to contact if they don’t want to be crawled.
You can decipher my “switches” and see many other possibilities at this wget documentation site: https://www.gnu.org/software/wget/manual/html_node/index.html#SEC_Contents
I’ll update this post with our “official” method for producing WARCs as soon as I either find a note where Jeri explained our process to me or learn enough to define and document a new standard for us.
Not long ago I started using NetNewsWire once again–shortly after I discovered they’ve finally put across-device-sync back into the rewrite of this long-time favorite RSS reader. As expected, my back-to-the-future fling with RSS found that many of my go-to feeds had shuttered in the intervening years, but happily many good ones were still around…
It didn’t take me long to rediscover that a decent news reader–and a good set of links–is a great and relatively painless way to stay current. After a bit of pruning and with a few new additions, I thought it might be useful if I posted my current feed list. It is a bit idiosyncratic–reflecting my interest in most aspects of digital librarianship. If you have additions I should know about (and I hope there are many I’m missing), send me a link or better yet, an OPML file of your current feedlist.
send to: wallyg at gmu dot edu
My list as an OPML file is here.
Assessment on the Ground
Data Pub Blog
Data Science ish (text mining, APIs, etc.)
Devil’s Tale (Duke U.)
dh+lib (Where the digital humanities and librarianship meet)
Digital Library Blog (Stanford)
- site: https://library.stanford.edu/blogs/digital-library-blog
- feed: https://library.stanford.edu/feeds/blog/276
Digital Preservation Matters (Chris Erickson)
Digital scholarship blog
- site: http://blogs.bl.uk/digital-scholarship/
- feed: http://britishlibrary.typepad.co.uk/digital-scholarship/atom.xml
Disruptive Library Technology Jester
Dueling Data (Data Visualization)
Go To Hellman
- site: http://go-to-hellman.blogspot.com/
- feed: http://go-to-hellman.blogspot.com/feeds/posts/default
Hack Library School
Here and There
- site: http://hereandthere123.blogspot.com/
- feed: http://hereandthere123.blogspot.com/feeds/posts/default
In the Library with the Lead Pipe
- site: http://www.inthelibrarywiththeleadpipe.org
- feed: http://www.inthelibrarywiththeleadpipe.org/feed/
Information is Beautiful (Data Visualization)
Information Technology and Libraries
- site: http://ejournals.bc.edu/ojs/index.php/ital
- feed: http://ejournals.bc.edu/ojs/index.php/ital/gateway/plugin/
Inside Higher Ed | Blog U
- site: https://www.insidehighered.com/blogs/library-babel-fish
- feed: https://www.insidehighered.com/blogs/feed/Library%20Babel%20Fish
iNode (George Mason University Digital Programs)
IO: In The Open
Archives and Special Collections – James Hardiman Library
Jaime Mears (Notes from a Nascent Archivist)
Library & Technology Blog
Library Lost & Found
Library Tech Talk (University of Michigan)
- site: https://www.lib.umich.edu/blogs/library-tech-talk
- feed: https://www.lib.umich.edu/blogs/library-tech-talk/rss.xml
A Library Writer’s Blog
- site: http://librarywriting.blogspot.com/
- feed: http://librarywriting.blogspot.com/feeds/posts/default
Lorcan Dempsey’s Weblog
Information Wants to Be Free – Meredith Farkas
Metadata Blog (ALA)
Miriam Posner’s Blog
Musings about Librarianship (Aaron Tay)
- site: http://musingsaboutlibrarianship.blogspot.com/
No Shelf Required
Pattern Recognition (Jason Griffey)
Pinboard (Dorthea Salo) You can learn a lot stalking the right bookmarker!
Planet Code4Lib This one aggregates a number of blogs
Preservation Underground (Duke U.)
- site: http://blogs.library.duke.edu/preservation
- feed: http://blogs.library.duke.edu/preservation/feed/
The Signal: Digital Preservation (Library of Congress)
TechSoup for Libraries – Blog
Temple University Digital Scholarship Center
Text Mining, Analytics & More
- site: http://text-analytics101.rxnlp.com/
- feed: http://text-analytics101.rxnlp.com/feeds/posts/default
Thoughts from Carl Grant
- site: http://thoughts.care-affiliates.com/
- feed: http://thoughts.care-affiliates.com/feeds/posts/default?alt=rss
Was recently asked to help come up with data to help give the library a better handle on who is using our e-resources–to complement the numbers we have on what’s being used. Hard to be precise given all the variables, but I did hit upon something I think serves as a reasonable proxy (no pun intended).
Step one was to pull together three different datasets:
- an enormous (70+ million lines) proxy server log that covered all off-campus activity for restricted resources for the entire Fall 2015 semester. This log has NetIDs for users.
- our campus directory e-file which includes both a NetID (email address) and the major or departmental affiliation for students, faculty and staff
- a spreadsheet from a friend in Admissions which gave meaningful descriptions to the many four letter codes for majors we use here at Mason
The proxy server logs and campus directory were imported into MySQL tables (using the methodology I outlined a couple of years ago in a post about parsing EZProxy logs). I then ran a simple SQL query:
select affiliation,count(*) from users
join proxy where proxy.username = users.username
group by users.affiliation
order by count(*),desc
Ended up with a result like this:
…and so on.
Exported the MySQL results to a CSV file and launched Tableau. Within Tableau, I joined the results of my MySQL work with the CSV file containing “major codes” and their descriptions (so I could flesh out those NURS, GLOA, PUBP codes). I hit one of Tableau’s “visualization” buttons.
Got a chart like the excerpt you see below. Eventually, the complete visualization became an appendix in a document we prepared for internal reporting. If you’re curious, you can download that final chart as a PDF.
Earlier today I was running a few SQL queries against our local Voyager system–preparing for the upcoming metadata migration to a consortial implementation of Alma. My tool of choice for this sort of thing is Navicat and as I worked through a series of “count this for me” queries, like…
- how many bib records have NULL in the NETWORK_NUMBER field? 54,995
- how many have an OCLC number in that field? 1,640,304
- exactly how many bib records are there in the database? 3,490,929
…I realized that Navicat made the export of data in a variety of formats a reasonably trivial exercise. Thinking it might be somehow useful for people sharpening their text-mining chops in our new Digital Scholarship Center (2nd floor, Fenwick Library), I decided to build a text file of brief bibliographic data (author, title, publisher, date, etc.) from the 3+ million records in our Voyager database. A simple click in a checkbox produced both JSON and XML versions of the metadata
The zipped versions of these files are roughly 200MB each.
Click the link below to retrieve the JSON recordset.
XML? Click below…
Sample record in the JSON version of the file
The XML version has a couple more data fields (LCCN and SERIES) if available in a record.
If you end up using this data for anything useful (or need a slightly different extract), send me a tweet
The other day we started talking about joining arXiv.org as a way to help support Open Access and the valuable service that arXiv offers. As you’d expect, the question arose, “How often do we think it’s being used?”
One exploratory email to support at arxiv.org and I had an answer almost within the hour. For accesses from the gmu.edu domain (meaning the user was on the campus network) we see this activity:
A steady increase. Graphed, it’s easy to see. I’m going to guess that the jump in use around 2013 coincides with our rollout of the Primo discovery service (metadata for arXiv documents is included in the underlying database).
When you consider that we also have users who visit arXiv from off-network devices, it’s an easy decision for us–we are joining arXiv. Membership fee works out to about 14 cents per download (orders of magnitude better than what we see with Elsevier).
University Libraries belongs to several cooperatives that provide our students, faculty and staff with enhanced access to materials beyond our local print and digital collections. The big four for Mason are:
Local – WRLC (Washington Research Library Consortium)
Statewide – VIVA (Virtual Library of Virginia)
Regional – ASERL (Association of Southeastern Research Libraries)
National – CRL (Center for Research Libraries)
A Mason reader can request an item held in the collection of any library affiliated with one of these cooperatives and its delivery will be expedited thanks to something rather akin to the “favored nation” status that exists between members of each collective.
As you might guess, thanks to ease of networked discovery and physical resource proximity, researchers at Mason borrow a fair amount of material from fellow member institutions in the WRLC. What might surprise you is that Mason is a net lender to our fellow WRLC institutions (e.g., they borrow more from us than we borrow from them) and we have a similar status with the ASERL membership as well. I view this as marker of the strength and currency of Mason’s collections…though I know the danger of basing an argument on data viewed in isolation (e.g., a contrarian might argue that the collections of others are equally strong and also heavily used and we’re just racking up lending numbers as the backup source for that 2nd or 3rd copy of something). “Data-informed” is safer than “data-driven.”
Wondering how that activity is distributed across potential users and our collections, I decided to generate a “snapshot” dataset of every item currently checked-out to an affiliate of WRLC. Then I graphed the level of ‘borrowing-from-Mason’ activity by staff, undergraduate, graduate and faculty users at each member library (click to enlarge the graphic):
Graduate students at George Washington are our most active borrowers (with 115 items in circulation). Seems odd but GW faculty are tied with an identical 115 items charged at the moment. As you move down the left column, you find AU faculty coming in third with 89 items, followed by Georgetown graduate students at 87, and so on down the list.
What sections of our library collection most interest WRLC borrowers? The next graph shows by Call Number stem just how many items are currently in circulation to various WRLC borrowers. The PR classification (English Literature) leads the list with 160+ items. QA (Mathematics…but you’ll find most of the computer books carry a QA classification as well) comes in a close second. Love the fact that Z (Library Science and Bibliography) trails the list 🙂