Ever since I saw Jeffrey Zeldman’s comment on tag clouds (“Tag Clouds Are The New Mullets“), I’ve been interested in getting one. A tag cloud? No a mullet (just kidding). What I’ve been interested in is the way that tag clouds help you visualize the relative importance of particular content at a site. A list of subjects (tags) ranked by the number of times they occur would give the same information but like a regular haircut, not much visual impact.
With that bit of background, the other day I decided to build a tag cloud that captured and visually ranked the subject headings used in our catalog (710,000+ unique subjects). I realized that this wouldn’t be as informative as a years-consuming conspectus of the collection but what the heck, it would be visually compelling.
Step one involved running an SQL+ query against our Voyager system to produce a list of all the subject headings in our catalog and the number of bibliographic records attached to each of them. Here’s the SQL piece if you’re interested (it’s Voyager-specific and includes sub-headings):
from bib_index, bib_master
where bib_master.suppress_in_opac = ‘N’
and bib_master.bib_id = bib_index.bib_id
and substr(index_code,1,1) = ‘6’
group by normal_heading
That created a text file listing the subjects and the count, separated by a ‘~’ delimiter. I wrote a short PERL program to take that list and create a ready-to-import file for MySQL, building a new database that I could manipulate via PHP.
open(OUTFILE,”>subjex.sql”);
print OUTFILE “use subjex;\n”;
print OUTFILE “CREATE TABLE subjex (\n”;
print OUTFILE “ID int(11) DEFAULT ‘0’ NOT NULL auto_increment,\n”;
print OUTFILE “Subject char(230),\n”;
print OUTFILE “Count int(6),\n”;
print OUTFILE “PRIMARY KEY (ID)\n”;
print OUTFILE “);\n\n”;
$increment = 0;
$InputLine = <INFILE>;
while ($InputLine ne “”) {
$inc++;
@Chunks = split(/~/,$InputLine);
print OUTFILE ‘INSERT INTO subjex VALUES (‘.$inc.’,”‘.$Chunks[0].'”,”‘.substr($Chunks[1],0,5).'”);’.”\n”;
$InputLine = <INFILE>;
}
close OUTFILE;
close INFILE;
Two more commands…
# mysqladmin create subjex
# mysql < subjex.sql
…and I had a database. To make sure it was working right, I tried a few simple queries. In the course of this “research” I made what I think must be an important discovery:
| INTERNATIONAL ECONOMIC RELATIONS | 1388 |
| ECONOMICS | 1524 |
| CALIFORNIA MAPS TOPOGRAPHIC | 1528 |
| UNITED STATES RACE RELATIONS | 1559 |
| VIDEO RECORDINGS FOR THE HEARING IMPAIRED | 1645 |
| BUDGET UNITED STATES | 1859 |
| FEATURE FILMS | 3037 |
| ELECTRONIC BOOKS | 32887 |
OK, so I see we have quite a few Electronic Books but is that really a subject? Or how about our second most popular term Feature Films? Would a book on feature films that got made into a movie be filed under Feature Films—Feature Films? I love recursion but there’s a time and place for it.
I think I’m going to have to do a bit more work before I move to the final piece of my OPAC tag cloud…or maybe I’ll just finish the project and see how it looks. All that’s left is a PHP script that builds the webpage of alphabetized subject headings using a CSS stylesheet and the frequency data to scale the font up or down. I’m not yet sure how I’ll develop the necessary AI to toss font-scaling skews like Electronic Books but now that Web 2.0 has made everyone comfortable with a perpetual beta, I guess I don’t have to worry about that right away…