Final Data for our DNS Query project

      Comments Off on Final Data for our DNS Query project

Here’s the final set of numbers on our analysis of DNS query logs (detailed in an earlier post).

Graph covers activity between July 3rd and December 9, 2019.

This reflects only the DNS query activity on the campus network.  Mason affiliates using off-campus networks are not included in this chart.

We can assume, I think, that most activity going to SCI-HUB is all about finding content.   ResearchGate also fills this role but as a social networking platform, there are other reasons for traffic to their site.   We can see that ResearchGate and Google Scholar are heavily visited.   What we can’t readily see is the degree to which they serve as an alternative source for otherwise restricted content.

Final Count for DNS Queries



You can download our dataset here (roughly 8 Mb):


Fun with our Traffic Counter’s API

      Comments Off on Fun with our Traffic Counter’s API

We installed counters over all the entrances in Fenwick Library a while back.  Smart little devices that offer an API as well.  Click the image to check out this particular experiment.

2019 07 30 21 10 39

What does local use of Sci-Hub look like?

      Comments Off on What does local use of Sci-Hub look like?

Product bundling thrives in markets with few competitive options.  Your cable company knows that and so do large academic publishers.   For years, they’ve sold collections of e-journals at a discount over what you’d pay to subscribe to each individual title in the bundle (though you probably wouldn’t if given the choice).  That’s a big deal, right?

But as the cost these bundles has risen–far outpacing inflation–libraries have begun looking for alternatives. Some are letting their “big deals” expire while others are developing strategies to help inform those looming (and often fraught) renewal decisions.

SPARC (the Scholarly Publishing and Academic Resources Coalition) has been carefully tracking this activity and their work provides an easy way to keep up-to-date on most aspects of this issue.

But one question I’ve had for some time is what sort of gravitational pull are sites like Sci-Hub or ResearchGate exerting on the already disrupted orbits of users, libraries and publishers?  Put another way, if researchers are satisfying their content needs outside the library/publisher channel, shouldn’t we factor that into our strategy around these big deals?

I realize I’m not the first to ask who’s using Sci-Hub.  Here are just a few of the many articles that get at this topic:

Each talks about usage activity and traffic patterns but in a way that is little more than anecdotal background noise if you’re trying to fashion a local strategy and need to focus on what your local users are actually doing.  Simply asking who’s using these sites poses all sorts of problems.

I finally settled on analyzing DNS queries to our campus nameservers as a reasonable metric.  When a user on our campus network points his browser at, our campus nameserver logs the transaction.  An imperfect measure to be sure (e.g., it ignores traffic to “shady” sites from off-campus affiliates using their ISP’s nameserver) but it does let me compare on-campus traffic to “pirate” sites with on-campus traffic to sites provided via our library’s subscriptions.

Mindful of privacy issues, I asked a friend in campus IT to take a list of 6 or 7 domains and derive an extract file from the DNS query logs, providing just date, time and query string for anything that matched the domain information I provided.  Here’s an excerpt of the result:

2019 07 10 13 53 29

Producing this extract is now part of a weekly cron job so I’ll be able to monitor the relative use of these sites over the coming months.  In this one particular instance, I can’t wait for the Fall term to begin…  [ update:  You can see subsequent months here ]

So what did I find by monitoring DNS queries between July 3rd and July 13th?

The graph shows activity for users on the campus network.  A better name for this post might be, “What does local use of ResearchGate look like?”

You don’t always have to write code

      Comments Off on You don’t always have to write code

Sometimes what appears to be a programming task doesn’t actually require firing up your editor.

Consider this problem: Two fixed-length text files, one has 42,000 lines while the other has 13,000.  In each file, a single line represents information about a particular user. If a person with status ‘X’ in the first file also appears in the 2nd file with a status of ‘Y’, we need to keep the line in the first file and delete the line for that user from the 2nd file. We can match up the user between files via the person’s ID# field which appears in both files.

Real word example: We have two different fixed-length files that we receive weekly from the campus computer center. For years we just sent those files (in Voyager SIF format) directly into our Voyager system to update patrons (students with one file, faculty and staff with the other). Moving to Alma we decided the best course for our accelerated implementation was to let the computer center continue producing those SIF files and we’d take on the task of converting the information into the XML form that Alma was expecting. That did require a bit of code but it’s been working pretty well.  But not perfectly…

Continue reading

E-Content Usage Update for Fall 2017

      Comments Off on E-Content Usage Update for Fall 2017

We have no perfect way of assessing e-content usage by our students even though we’re now spending 75% or more of our collections budget on this sort of material.  We do receive and analyze COUNTER statistics but COUNTER stats focus on what’s being used and collapse all activity by students, faculty and staff into a single number for each source.  Fine as far as it goes, but I’m also interested in who’s using content.   Not down to the individual (I value the library’s reputation for privacy) but at least to some meaningful though suitably-anonymous aggregation.  Until I get a better tool, here’s how I go about answering a question like “how do the different majors use our e-content collections?”

Continue reading

The OER Metafinder Origin Story

      Comments Off on The OER Metafinder Origin Story

I am a relative newcomer to the topic of OERs (Open Educational Resources).  Not unaware of the topic—our Mason Publishing Group has been working with faculty interested in affordable educational materials for some time now—but until late, I haven’t really been terribly involved in those efforts.

That changed one afternoon this summer as I grabbed my laptop and tagged along with them to a meeting with the Associate Provost for Undergraduate Education to talk about OERs.

As the meeting progressed (and moved ever further from my area of expertise) I started stealing moments to jump in and out of various OER aggregation sites, curious to see the sorts of resources already available on the net.

If you’ve spent much time with OERs, you won’t be surprised to hear that I discovered:

  • many dissimilar aggregations of content;
  • so many wildly-different interfaces;
  • so much duplication across these aggregations;
  • and such inconsistent metadata.

As I poked around, I could easily envision a faculty member—excited by idea of OERs—feeling the enthusiasm drain away as she dove in and out of the various content silos.   Soon I found myself thinking much less about OERs and far more about how to improve their discoverability as a way to improve OER adoption…

Continue reading

Dashboarding Google Analytics

      Comments Off on Dashboarding Google Analytics

One of our skunkworks projects involves taking real-time Google Analytics data and building a visually interesting dashboard to report out activity on various library sites.

Click the image below to take a peek at our ever-evolving sandbox:

Three Social Media Library Services

      Comments Off on Three Social Media Library Services

A local library made news in 2010, announcing that it would archive every tweet ever posted.  With Twitter generating 500 million tweets a day, can we really be surprised that it’s proving to be a challenge?

Of course, that doesn’t mean there aren’t a host of smaller services we can build around social media. By way of example, here are three social media services we offer the Mason community. One’s pretty simple while the other two require a bit more infrastructure.


Mason Tweets (

This curated feed from “official” and “near-official” twitter accounts from across the university offers a quick and easy way to take the “Mason Nation” pulse.

To produce this service, we created a MasonTweeter account on Twitter to follow Mason-related feeds.  The web presence is simply a page that embeds the MasonTweeter timeline.



Preztweets (

An archive of every tweet from Mason’s President, Ángel Cabrera.

This service stems from a discussion I had with Dr. Cabrera a few years ago.  At that time, Twitter did not offer users an archive of their tweets (they do now), so we were looking into how we might save his tweets for future university historians.  We settled on a method that offers a searchable database of tweets stored locally in a MySQL database (suitable for future archiving).  Thanks to Andrew M. Whalen for the code that helped build this LAMP-based archiving service.


Social Feed Manager (SFM) (

Just the other day, I set up our most ambitious social media service yet: Social Feed Manager.

SFM is a Django application developed by George Washington University Libraries to collect social media data from Twitter. It connects to Twitter’s approved API to collect data in bulk and makes it possible for scholars, students, and librarians to identify, select, collect, and preserve Twitter data for research purposes. We’re running SFM in a Docker container (using Docker for Mac) which simplifies installation and abstracts away much of the underlying complexity.

We have added Social Feed Manager to the suite of data services we offer out of the new Digital Scholarship Center we’ve been shaking down in beta since late January.

wget -> WARC

      Comments Off on wget -> WARC

Had an email exchange yesterday with a group that wants to archive a few of their online web projects in our MARS system. Due to what I hope is a temporary vacancy in our staffing, that meant I had to take the lead in explaining how we handle archiving a website and what the end result of that process looks like.

I had a solid understanding of the what we do part–it’s pretty straightforward: we point a webcrawler at the up-and-running version of the site and pack what we get back into a WARC file. We then upload that WARC file to the DSpace instance that delivers our MARS service and flesh out the metadata. We’ve done a number of these and it works well. My immediate challenge was mastering the how do we do this.

Until recently, the actual work of producing and archiving the WARC file was handled by Jeri Wieringa (who recently moved on from her still-spoken-of-in-reverent-tones stint with our Mason Publishing Group to new challenges). As I started down the WARC-enlightenment path, I was glad to see that our MARS entries for archived sites point the user to WARC-viewing tools for both Windows and MacOS. Seems we recommend Ilya Kreymer’s webarchiveplayer available at So I started there…downloaded and installed the Mac version, pointed it at one of our archived WARC files and in minutes had a pretty firm grasp of this end of the process.

What I still didn’t know was precisely how we had been producing these WARC files. While on Ilya’s GitHub repository for webarchiveplayer I noticed a link to webrecorder—which I later learned corresponds to the service deployed at   That looks like a large-scale solution and one I’ll set up and test soon. Until then, I’m going with a very lightweight method….wget.

This command from a unix shell or terminal window, produces a useable WARC of this blog:

wget --recursive --warc-file=inodeblog --execute robots=off --user-agent=Mozilla

Building the WARC across the net (wget was running on my office iMac at Mason and lives at reclaim hosting), took roughly 7 minutes. If you’re going to sites you don’t own, you’ll probably want to add a –wait=X switch to pause for ‘X‘ seconds between requests and add your email to your –user-agent= string so people know who to contact if they don’t want to be crawled.

If you wonder how a website looks after it’s been WARC-ed, grab a copy of webarchiveplayer and have a look at the 16 megabyte WARC file for this blog: inodeblog.warc.gz

You can decipher my “switches” and see many other possibilities at this wget documentation site:

I’ll update this post with our “official” method for producing WARCs as soon as I either find a note where Jeri explained our process to me or learn enough to define and document a new standard for us.

* 2019 update:  This is the link you want to use for a WARC viewer:

A few Library-related RSS feeds

      Comments Off on A few Library-related RSS feeds

Not long ago I started using my RSS reader again.  As expected, my back-to-the-future fling with RSS found that many of my go-to feeds had shuttered in the intervening years, but happily many good ones were still around…

It didn’t take me long to rediscover that a decent news reader–and a good set of links–is a great and relatively painless way to stay current.  After a bit of pruning and with a few new additions, I thought it might be useful if I posted my current feed list.   It is a bit idiosyncratic–reflecting my interest in most aspects of digital librarianship.  If you have additions I should know about (and I hope there are many I’m missing), send me a link or better yet, an OPML file of your current feedlist.

send to: wallyg at gmu dot edu

My list as an OPML file is here.

Academic Librarian


Assessment on the Ground

Bibliographic Wilderness

Chicago Librarian

Data Pub Blog

Data Science ish (text mining, APIs, etc.)

Devil’s Tale (Duke U.)

dh+lib (Where the digital humanities and librarianship meet)

Digital Library Blog (Stanford)

Digital Preservation Matters (Chris Erickson)

Digital scholarship blog

Disruptive Library Technology Jester

Distant Librarian

Dueling Data (Data Visualization)

Feral Librarian

Go To Hellman

Hack Library School

Here and There

Impromptu Librarian

In the Library with the Lead Pipe

Information is Beautiful (Data Visualization)

Information Technology and Libraries

Inside Higher Ed | Blog U

iNode (George Mason University Digital Programs)

IO: In The Open

Archives and Special Collections – James Hardiman Library

Jaime Mears (Notes from a Nascent Archivist)

Joeyanne Libraryanne

Krafty Librarian

Laurie Allen

Library & Technology Blog

Library Assessment

Library Juice

Library Lost & Found

Library Tech Talk (University of Michigan)

A Library Writer’s Blog

Lorcan Dempsey’s Weblog

Matthew Reidsma

Information Wants to Be Free – Meredith Farkas

Metadata Blog (ALA)

Miriam Posner’s Blog

Musings about Librarianship  (Aaron Tay)

No Shelf Required

OpenAIRE blog

Pattern Recognition (Jason Griffey)

Pinboard (Dorthea Salo) You can learn a lot stalking the right bookmarker!

Planet Code4Lib This one aggregates a number of blogs

Preservation Underground (Duke U.)


The Signal: Digital Preservation (Library of Congress)

TechSoup for Libraries – Blog

Temple University Digital Scholarship Center

Text Mining, Analytics & More

Thoughts from Carl Grant

Waki Librarian