Subject Headings

Ever since I saw Jeffrey Zeldman’s comment on tag clouds (“Tag Clouds Are The New Mullets“), I’ve been interested in getting one. A tag cloud? No a mullet (just kidding). What I’ve been interested in is the way that tag clouds help you visualize the relative importance of particular content at a site. A list of subjects (tags) ranked by the number of times they occur would give the same information but like a regular haircut, not much visual impact.

With that bit of background, the other day I decided to build a tag cloud that captured and visually ranked the subject headings used in our catalog (710,000+ unique subjects). I realized that this wouldn’t be as informative as a years-consuming conspectus of the collection but what the heck, it would be visually compelling.

Step one involved running an SQL+ query against our Voyager system to produce a list of all the subject headings in our catalog and the number of bibliographic records attached to each of them. Here’s the SQL piece if you’re interested (it’s Voyager-specific and includes sub-headings):

select normal_heading||’~’||count(*)

from bib_index, bib_master

where bib_master.suppress_in_opac = ‘N’

and bib_master.bib_id = bib_index.bib_id

and substr(index_code,1,1) = ‘6’

group by normal_heading

That created a text file listing the subjects and the count, separated by a ‘~’ delimiter. I wrote a short PERL program to take that list and create a ready-to-import file for MySQL, building a new database that I could manipulate via PHP.

#!/usr/bin/perlopen(INFILE,”SC.txt”);

open(OUTFILE,”>subjex.sql”);
print OUTFILE “use subjex;\n”;

print OUTFILE “CREATE TABLE subjex (\n”;

print OUTFILE “ID int(11) DEFAULT ‘0’ NOT NULL auto_increment,\n”;

print OUTFILE “Subject char(230),\n”;

print OUTFILE “Count int(6),\n”;

print OUTFILE “PRIMARY KEY (ID)\n”;

print OUTFILE “);\n\n”;
$increment = 0;

$InputLine = <INFILE>;

while ($InputLine ne “”) {

$inc++;

@Chunks = split(/~/,$InputLine);

print OUTFILE ‘INSERT INTO subjex VALUES (‘.$inc.’,”‘.$Chunks[0].'”,”‘.substr($Chunks[1],0,5).'”);’.”\n”;

$InputLine = <INFILE>;

}

close OUTFILE;

close INFILE;

Two more commands…

# mysqladmin create subjex
# mysql < subjex.sql

…and I had a database. To make sure it was working right, I tried a few simple queries. In the course of this “research” I made what I think must be an important discovery:

Subject cataloging has some problemsI asked “Select Subject,Count from subjex where Count > 500 order by Count” and found this at the bottom of the output:

| ETHICS                                         |  1378 |

| INTERNATIONAL ECONOMIC RELATIONS               |  1388 |

| ECONOMICS                                      |  1524 |

| CALIFORNIA MAPS TOPOGRAPHIC                    |  1528 |

| UNITED STATES RACE RELATIONS                   |  1559 |

| VIDEO RECORDINGS FOR THE HEARING IMPAIRED      |  1645 |

| BUDGET UNITED STATES                           |  1859 |

| FEATURE FILMS                                  |  3037 |

| ELECTRONIC BOOKS                               | 32887 |

OK, so I see we have quite a few Electronic Books but is that really a subject? Or how about our second most popular term Feature Films? Would a book on feature films that got made into a movie be filed under Feature Filmsâ€”Feature Films? I love recursion but there’s a time and place for it.

I think I’m going to have to do a bit more work before I move to the final piece of my OPAC tag cloud…or maybe I’ll just finish the project and see how it looks. All that’s left is a PHP script that builds the webpage of alphabetized subject headings using a CSS stylesheet and the frequency data to scale the font up or down. I’m not yet sure how I’ll develop the necessary AI to toss font-scaling skews like Electronic Books but now that Web 2.0 has made everyone comfortable with a perpetual beta, I guess I don’t have to worry about that right away…

Post Views: 155