Subject Headings

      1 Comment on Subject Headings

Ever since I saw Jeffrey Zeldman’s comment on tag clouds (“Tag Clouds Are The New Mullets“), I’ve been interested in getting one. A tag cloud? No a mullet (just kidding). What I’ve been interested in is the way that tag clouds help you visualize the relative importance of particular content at a site. A list of subjects (tags) ranked by the number of times they occur would give the same information but like a regular haircut, not much visual impact.

With that bit of background, the other day I decided to build a tag cloud that captured and visually ranked the subject headings used in our catalog (710,000+ unique subjects). I realized that this wouldn’t be as informative as a years-consuming conspectus of the collection but what the heck, it would be visually compelling.

Step one involved running an SQL+ query against our Voyager system to produce a list of all the subject headings in our catalog and the number of bibliographic records attached to each of them. Here’s the SQL piece if you’re interested (it’s Voyager-specific and includes sub-headings):

select normal_heading||’~’||count(*)
from bib_index, bib_master
where bib_master.suppress_in_opac = ‘N’
and bib_master.bib_id = bib_index.bib_id
and substr(index_code,1,1) = ‘6’
group by normal_heading

That created a text file listing the subjects and the count, separated by a ‘~’ delimiter. I wrote a short PERL program to take that list and create a ready-to-import file for MySQL, building a new database that I could manipulate via PHP.

#!/usr/bin/perlopen(INFILE,”SC.txt”);
open(OUTFILE,”>subjex.sql”);

print OUTFILE “use subjex;\n”;
print OUTFILE “CREATE TABLE subjex (\n”;
print OUTFILE “ID int(11) DEFAULT ‘0’ NOT NULL auto_increment,\n”;
print OUTFILE “Subject char(230),\n”;
print OUTFILE “Count int(6),\n”;
print OUTFILE “PRIMARY KEY (ID)\n”;
print OUTFILE “);\n\n”;

$increment = 0;
$InputLine = <INFILE>;
while ($InputLine ne “”) {
$inc++;
@Chunks = split(/~/,$InputLine);
print OUTFILE ‘INSERT INTO subjex VALUES (‘.$inc.’,”‘.$Chunks[0].'”,”‘.substr($Chunks[1],0,5).'”);’.”\n”;
$InputLine = <INFILE>;
}
close OUTFILE;
close INFILE;

Two more commands…

# mysqladmin create subjex
# mysql < subjex.sql

…and I had a database. To make sure it was working right, I tried a few simple queries. In the course of this “research” I made what I think must be an important discovery:

Subject cataloging has some problems
I asked “Select Subject,Count from subjex where Count > 500 order by Count” and found this at the bottom of the output:

| ETHICS | 1378 |
| INTERNATIONAL ECONOMIC RELATIONS | 1388 |
| ECONOMICS | 1524 |
| CALIFORNIA MAPS TOPOGRAPHIC | 1528 |
| UNITED STATES RACE RELATIONS | 1559 |
| VIDEO RECORDINGS FOR THE HEARING IMPAIRED | 1645 |
| BUDGET UNITED STATES | 1859 |
| FEATURE FILMS | 3037 |
| ELECTRONIC BOOKS | 32887 |

OK, so I see we have quite a few Electronic Books but is that really a subject? Or how about our second most popular term Feature Films? Would a book on feature films that got made into a movie be filed under Feature Films—Feature Films? I love recursion but there’s a time and place for it.

I think I’m going to have to do a bit more work before I move to the final piece of my OPAC tag cloud…or maybe I’ll just finish the project and see how it looks. All that’s left is a PHP script that builds the webpage of alphabetized subject headings using a CSS stylesheet and the frequency data to scale the font up or down. I’m not yet sure how I’ll develop the necessary AI to toss font-scaling skews like Electronic Books but now that Web 2.0 has made everyone comfortable with a perpetual beta, I guess I don’t have to worry about that right away…