You don’t always have to write code

Sometimes what appears to be a programming task doesn’t actually require firing up your editor.

Consider this problem: Two fixed-length text files, one has 42,000 lines while the other has 13,000. In each file, a single line represents information about a particular user. If a person with status ‘X’ in the first file also appears in the 2nd file with a status of ‘Y’, we need to keep the line in the first file and delete the line for that user from the 2nd file. We can match up the user between files via the person’s ID# field which appears in both files.

Real word example: We have two different fixed-length files that we receive weekly from the campus computer center. For years we just sent those files (in Voyager SIF format) directly into our Voyager system to update patrons (students with one file, faculty and staff with the other). Moving to Alma we decided the best course for our accelerated implementation was to let the computer center continue producing those SIF files and we’d take on the task of converting the information into the XML form that Alma was expecting. That did require a bit of code but it’s been working pretty well. But not perfectly…

Some people appear in both files and the last file loaded sets the borrower status for that individual. For example, the grad student who also works in a department somewhere on campus ends up having the same borrowing privileges as a wages worker in the food court once we overwrite the student status from Student Information System (SIS) file with the same user’s record in the file we get from HR that shows a status of “wages.”

Understandably, a student complained about this the other day and ever since I’ve been trying to think of a programmatic fix. But before walking through my solution—and to make what I came up with a bit easier to follow–here’s a sample line from each of the two files

From the SIS file (42,000 lines)

00000 ro medm 1 2019.06.102020 03.2 54 G00009705 Jones Seneca T. 2205 Any Street, Anytown, Virginia [line continues…]

From the HR file (13,000 lines)

00000 po ireg 1 2019.06.102020 03.03 2803 G00223705 Everyman, Bob E. 1422 Stafford Street, Apt 7, Arlington, VA 22203 [line continues…]

There must be many ways to solve this problem. My first thought was a Perl program but my brute force algorithm (look for a grad student in the first file, then scan the 2nd file for a match) seemed too slow (stupid?) to pursue. Also considered OpenRefine but that quickly got very fiddly and I was looking for a server-side solution I could add into the script that handles the conversion to XML prior to sending the data to Alma. Finally, I thought, “what about a shell script where awk and grep do the heavy lifting?”

My algorithm ended up looking something like this:

1. Examine each line of the Student Information System file and where we find “medm” on a line in the 3rd column (a code for grad students), extract the Mason ID number (in column 8), and write just that ID# to the output file.

2. Examine each line of the HR file and write an output file that contains just the Mason ID number for anyone with the status of “wages” (matching records have “ireg” in the 3rd column).

3. Go through both files and where the Mason ID number in file one matches a Mason ID number in the 2nd file, add that ID# to a third file (contains only ID numbers that are in both original files).

4. Finally, grep the two files using an inverted match (e.g., -vf ) writing out lines from the original HR file that are not in the file we created in step 3 (where ID appeared in both original files). This is equivalent to deleting lines from file the HR.dat file for the wages person who appeared in SIS file as a grad student. Voilà! The grad status record doesn’t get overwritten when the processed HR file (hr_final.out) is loaded into Alma.

Four-step solution looks like this:

awk '{ if ($3 =="medm") { print $ 8}}' < SIS.dat > grad_IDs.txt

awk '{ if ($3 =="ireg") { print $ 8}}' <  HR.dat > wages_IDs.txt

awk -F\| 'NR==FNR{a[$1]++;next}a[$1]' grad_IDs.txt wages_IDs.txt > both.txt

grep -vf both.txt HR.dat > hr_final.out

Note: That third line has a pipe character after the -F\, not a capital “I” … and yes, for that third line I stand on the shoulders of giants who post answers on Stack Overflow 🙂

How long does it take this to run on a rather old RedHat 6.5 server? 1.5 seconds

Here’s a nice awk cheatsheet:

https://www.cheatography.com/tme520/cheat-sheets/awk-english/pdf/

Post Views: 5,942