Fun with the 245 tag

      Comments Off on Fun with the 245 tag

Over the past few months I have, on more than one occasion, found myself making a full extract of the bibliographic (MARC) records in our library’s catalog. Turns out, this sort of thing happens frequently when you run your own ILS locally but also belong to a consortium where individual members are busily adding different “discovery” layers to the underlying catalog that they all share.

Some of this work is constant, sometimes it comes in spurts. Nightly, for example, I have a script that updates an AquaBrowser instance our consortium operates and then sends a second copy of those changes to Serials Solutions so another member’s Summon instance will reflect more current information.   Less frequently, I respond to a request for an extract to populate trial instances of another product someone else is considering.

Let’s not even mention the skunkworks instance of VuFind that I run as a sort of library geek hobby.

Yesterday, I decided to take one of those extract files and try a text mining experiment on my desktop Mac. To begin, I ran the file of nearly 1.7 million MARC records through a Windows VM so MarcEdit could produce a plain text version of the data.

As a first cut, I extracted all the 245 tags (titles) from the file.

grep "=245 " MasonBibs.txt > 245tags.txt [return]

which yielded:

=245  10$aAdolescence.$cPrepared by the society's committee. Edited ...
=245  10$aGuidance in educational institutions.
=245  14$aThe teaching of reading: a second report.
=245  10$aHighways into the Upper Amazon Basin.
=245  10$aProust et le roman,$bessai sur les formes et techniques ...
=245  10$aCreative management in banking.

Interesting, but clearly more processing was needed. With a short perl script I removed the tag labels, subfield codes and most punctuation. Seconds later, I had a 169MB text file that looked like this short excerpt (a 245 tag on each line):

Adolescence Prepared by the society s committee Edited by Nelson B Henry
Guidance in educational institutions
The teaching of reading a second report
Highways into the Upper Amazon Basin
Proust et le roman essai sur les formes et techniques du roman dans
Creative management in banking

A second perl script normalized the capitalization then split out and counted the words. I used the “%08d” construct in the “printf” statement to insure I’d have a list sortable by usage when the script finished.

#!/opt/local/bin/perl
use strict;
use warnings;

my %count_of;
while (my $line = <>) {
  $line =~ tr/[A-Z]/[a-z]/;
  foreach my $word (split /\s+/, $line) {
    $count_of{$word}++;
      }
     }
  for my $word (sort keys %count_of) {
     printf "%08d : $word\n", $count_of{$word
  }

countwords.pl < 245tags.out > wordlist.txt [return]

Here’s an excerpt of the output:

00000002 : salinewater
00000001 : saling
00000001 : salingar
00000064 : salinger
00000002 : salinghi
00000002 : salinian
00000001 : salinisation
00000004 : salinities

Final step was to sort the 453,672 words/lines in this file by the number of occurrences:

sort < wordlist.txt > 245tags_sorted.txt [return]

Voilà! I now know that these are the four most common words used in titles represented in our catalog:

the (1,343,112)
of (1,190,200)
and (918,245)
by (522,495)

then two outliers:

resource (450,200)
electronic (448,118)

and then back to prepositions and other unsurprising terms:

in (363,788)
a (346,909)
to (286,701)
on (252,793)
for (229,914)
edited (155,248)
states (126,221)
from (125,065)
with (124,319)
united (123,889)
committee (86,990)

Obviously, there’s not much of interest here and the point of the post is really to share the methodology and code snippets for anyone interested in running other experiments. However, I did find it odd that the words “electronic” and “resource” reached what I’d consider stopword status. Could we really be moving that close to the digital library I’ve been working toward for all these years?

Well, I’d like to think so but I’m guessing it’s the fact that not too long ago we loaded 306,000+ records from the Lexis-Nexis US Serial Set and that has skewed the frequency count for several terms. My sense is that a large load of these sorts of specialized records also has a negative effect on most users; that is, it helps build an ever-larger haystack for those seeking a needle of information that has nothing to do with that particular set of records.

Of course, that’s a problem we need to solve with better search tools, not by restricting the scope of our content.

Beyond developing a workflow that might yet yield an interesting outcome, there was one small spinoff benefit to this little wordcount experiment. Skimming the list of words that appeared only once across all titles, I was able to easily spot a number of misspellings.

For example, if you look at that little excerpt of my original word count file, you’ll see:

00000001 salinisation

I checked the catalog and yes, it’s a misspelling (although the variant title and subject headings saved the “word anywhere” searcher on this one):

Title:           	 Salinisation of land and water resources : human causes, extent...
Variant Title:	 Salinization of land and water resources
Primary Material:	 Book
Subject(s):	 Salinization --Control.
                         Salinization --Control --Case studies.
                         Soil salinization.
	                 Water salinization.

Now I’m wondering if there’s a way I can use this tag-extraction work with a spell-checker to assist in some automated way with our neverending quest for perfect metadata.