The awesomeness of awk

Despite all the stupid things happening in the Node.js community, I'm a big fan of Bryan Cantrill. He has pointed out the awesomeness of awk on several occasions but I had never really felt the need to dig deeper on the matter. This all changed last week when I was presented with a 400 megabyte CSV file that I needed to process.

What I needed to do was find aggregated vote counts for all candidates in the Finnish local government election of 2012. The CSV containing votes for all candidates in the entire country can be found here. As I needed to do some rudimentary processing and aggregation on the data, I thought this could provide me with a nice project to try and do something useful with awk.

I had done quite a bit of CSV processing in PHP and Node.js. PHP, of course, has a pretty decent CSV parser in its core and npm holds several equally decent (at least) CSV-related modules for Node.js. However, I'm absolutely certain that using awk was the most straightforward thing to do here. Have a look at the code:

#!/usr/bin/awk -f
BEGIN {
  FS = ";";
}

{
  gsub(/[ \t]+$/, "", $22);
  gsub(/[ \t]+$/, "", $16);
  gsub(/[ \t]+$/, "", $17);
  municipality = $22;
  firstname = $16;
  lastname = $17;
  votes = $32;
  candidates[municipality ";" lastname ";" firstname] += votes;
}

END {
  for (i in candidates) {
    print i ";" (candidates[i]/2);
  }
}

And that's it! It aggregates the total vote count for each candidate and outputs a CSV containing only municipalities, names and vote counts. The whole thing takes less than 15 seconds to run and, due to the wonderful nature of Unix streams, the awk process never uses more than 3,5 megabytes of memory.

Sure, parsing CSV files is not the most difficult task in the world and yes, this was a pretty well-suited case for awk. However, now that I know how efficient a tool awk is, I'm quite eagerly looking forward to using it again soon.


BTW: The 400 MB source CSV seemed to contain the aggregated votes as well. That's why the aggregated vote count is divided by two when writing out the results.