03 May 2014

A PhD Workflow for Programmers

I’ve been working towards my PhD for a bit over three years now and have tried a lot of different tools for improving my workflow during that time. For the sake of others about to start a PhD or currently working towards one, I’ve decided to document what I’ve found works. I can summarise it in two words – build automation. Ideally you want to be able to build all of your intellectual outputs with a single call on the command line. Do this and you’ll have an unambiguous record of the steps you took to produce your results while solving all nature of other problems along the way, which is what this post is largely about.

I’d suggest starting your PhD like this:

cd ~
mkdir -p phd
cd phd
echo "echo \"TODO: complete PhD\"" > build_phd.sh
chmod a+x ./build_phd.sh

This sets you up with a ~/phd directory with an executable main build script build_phd.sh.

The invariant to try and maintain from this point forward is the ability to reproduce all PhD outputs with a single call to build_phd.sh. I’ll get you off to a good start building out build_phd.sh in just a sec, but firstly:

cd ~/phd
mkdir -p thesis
cd thesis
touch main.tex

This sets you up with a skeleton LaTeX document for your PhD. The ~/phd directory now looks like:

├── build_phd.sh
└── thesis
    └── main.tex

Now open main.tex and add the following:

TODO: write my thesis.

Then open build_phd.sh and edit it to read:

echo -e "compiling the PhD..."
cd thesis
pdflatex main.tex
bibtex main
# Run twice to get references right.
for run in {1..2}; do pdflatex main.tex; done
cd ../

Your thesis will now be compiled everytime you run build_phd.sh.

The idea is that over time you’ll start fleshing out your computations in build_phd.sh and your thesis in main.tex. A big portion of main.tex will just be your prose. But, it will also contain figures, tables, variables— data. Before we get onto how to handle this properly, let’s talk about how not to.

The greedy algorithm for completing a PhD is to take your interesting results and manually enter/manipulate/copy/paste them into your thesis. The problem with this strategy is thinking you’ll only ever produce and transfer the results into the thesis once. In reality, you’ll produce the results, insert them into the thesis then later realise they need to be redone for one reason or another. On a timescale of minutes to days, expect the presentation of your results (or the results themselves) to change 10’s of times. At first this doesn’t seem like much of a problem if you enjoy copy/paste. Labouriousness isn’t the only problem though. The bigger issue is that by violating the DRY principle you’ve now got your results in two places—wherever you output them when you produced them and in your thesis document. You’re essentially relying on yourself to remember this and update the thesis accordingly when the results inevitably change.

Things only get worse when you start submitting your work for peer review. As it turns out, academia is slow as shit when it comes to giving feedback. It typically takes on the order of months to get a reviewer’s feedback on a paper. And guess what—they’re probably going to be critical of at least one of your results or at a minimum how they’re presented. By the time you get reviewer feedback six months have passed and you have NFI how you generated your results in the first place. Revising prose will feel like a walk in the park compared to revising results.

The thing about the copy/paste approach to getting results into your thesis is that it allows you to get lazy on results production as well. I’m telling you this from firsthand experience. I once wrote a set of Java simulations for a paper I submitted. Several months after I submitted the paper I got feedback indicating there were improvements that needed to be made. I couldn’t remember a damn thing about how I generated the results. I couldn’t even remember the name of the simulator I used. I still can’t. Did I run it in Eclipse? Did it use any special configuration file? What tool did I use to plot the results? What format was the intermediate data in between running the simulation and plotting the results? Did I even write a script to plot the results, or did I do it all interactively on the command line? This folks, is one very good reason to automate your PhD outputs.

With the greedy solution out of the way, onto the optimal solution. The optimal solution is to build out the build_phd.sh script to compute all of your results from the raw input data and then inject those results into the thesis. An example of this would be to add code in build_phd.sh that runs a simulation over your raw input data, generates an important figure my_awesome_figure.pdf and saves it into your thesis directory. You end up with a directory structure like:

├── build_phd.sh
└── thesis
    ├── main.tex
    └── my_awesome_figure.pdf

where you’ve got:


somewhere in main.tex. Now whenever you run build_phd.sh, my_awesome_figure.pdf gets regenerated based on your code and recompiled into your thesis. build_phd.sh is your record of precisely how you did it, which you can refer back to long after you’ve forgotten the specifics.

Another good candidate for automation are simple variables derived from computation. Say you performed your analysis over a trace with 283 widgets, or at least you think you did, until you found out there were actually 284 widgets. If you’d been following the greedy strategy, you would have hard-coded the value 283 everywhere in your thesis and need to go back and update it. To treat it as data instead of prose, you could have a step in build_phd.sh that emits a file widgets.tex containing the number of widgets. So now your directory looks like:

├── build_phd.sh
└── thesis
    ├── main.tex
    ├── my_awesome_figure.pdf
    └── widgets.tex

In the thesis you then replace the literal 283 with \input{widgets}. At first widgets.tex reads 283, and so 283 ends up in the thesis. Later on when you fix up the computation, widgets.tex reads 284 and this automatically makes its way into the thesis. There’s no worrying about forgetting to replace the value anywhere because you’ve explicitly tied it to the computation.

You can obviously make your directory structure as simple or as complicated as you like. The key point is to try and automate the whole sequence of events starting at input data and ending at the presentation of results in your thesis.

Aside from the aforementioned benefits of build automation, there’s another nice by-product you get for free and the research community will thank you for it. By automating your work, you make your results trivally reproducible. Other fields don’t have this luxury. They have to settle for describing their methods in prose. You on the other hand get a non-ambiguous description of your method that someone else can execute, analyze and modify. Even if you need to add anonymizing features to your data before distributing your code, you’ve still got a very solid foundation to build on.

So automation has largely been covered. There’s still the question of how this all works when you have to collaborate with others, primarily, your supervisors. The good and/or bad news is, your supervisors will probably never read nor execute a single line of your code. They will read and review your written outputs though. I’ve found the best way to collaborate on academic works (papers, technical reports, thesis) is through Scribtex. Scribtex lets you write and compile LaTeX documents in a web browser and share them with collaborators. The good thing about this is, your collaborators can view and edit the document in a web browser. This means you’re not relying on your collaborator’s having LaTeX setup on their machine in order to modify the document. The other nice thing about Scribtex is that it includes a version history feature so when your supervisor asks what changed since the last time they saw it, you’ve got a URL to point them to.

At this point I know what you’re thinking—how am I meant to automate my PhD build when my thesis document is hosted on the web? Well, the good news is, Scribtex supports Git. This means you can sync the document to your own computer into a subdirectory of your phd directory, run your build script and push the changes back to Scribtex when you’re ready. This is actually good for other reasons too. Firstly, it’s now backed up locally on your machine decorrelating your failures. You see it’s 2014 and losing critical data because you only ever kept a single copy or multiple copies in one geographic location is becoming embarrassing. The second reason is, you shouldn’t be writing in a web browser—leave that to your supervisors who will be making minor edits. You want to be using a proper text editor. Seriously, you’re going to write a lot throughout your PhD, and you’ll start realising productivity gains after a few weeks of using a decent text editor. So bite the bullet, pick one, and learn it. If you can’t decide which, use Emacs.

A little note re. “you should definitely use Scribtex”—you might be able to get away with using Dropbox. Dropbox has a kind-of versioning system if you look hard enough. Personally I prefer the Git workflow (that your supervisor’s never have to know about) that comes with Scribtex and find it’s better setup for the task at hand. Seriously, you nominate a main file for each project on Scribtex, and there’s literally a big “View as PDF” button that your supervisor’s can then press to see the document. There’s also a very prominent “History” button, which as an added bonus will include commit comments alongside each version if you’ve been using Git.

While we’re on the topic of Git, one more thing—you should also use it for your main phd directory. I won’t try and explain Git itself here; there’s heaps of good material you can lookup on Google. The way I like to do this though is to have a main Git repository for my phd, i.e. I turn the phd directory into a Git repository. Then, for each paper, I add the corresponding Scribtex Git repository as a submodule of the phd repository. Essentially the way submodules work is that you tie a specific version (commit) of the submodule to a specific version of the parent project when you make a commit in the parent project. This means versions of your code get tied to versions of your papers in each of the main repository’s commits. The way things have played out, GitHub has become synonymous with Git and seems to be the cloud host of choice for syncing Git repositories. At time of writing GitHub offer free micro accounts to students meaning you can host your code privately online.

So that’s pretty much it. Once you get the basic tooling in place to automate your PhD build and ease collaboration you just keep building out the build_phd.sh script with your computations. Of course this doesn’t mean shoving all your code in one file—you should be calling subscripts, but you get the general idea.

I should probably leave it at that, but there’s one other sensitive topic that needs to be addressed—Microsoft Word. Like a programmer that refuses to learn how to touch type, occasionally you’ll come across an academic that refuses to learn LaTeX. To be clear, LaTeX is a massive pain in the arse, but at least it’s a plain text pain in the arse. With Word, it’s very hard (if not impossible?) to automate your PhD build as described above because you have no way to inject variables or figures into your document. You’re also missing out on working in a good keyboard-oriented text editor. Both LaTeX and good text editors have a steeper learning curve than Word, but you’ll reach the intersection point long before completing your PhD. So, do what you can to convince your supervisors to work with you in LaTeX. If you can’t do that, see if they’re happy to just print things and make handwritten notes. It’s not optimal for you, but often it is for them. You’ll probably find they do this regardless of how they feel about the Word/LaTeX debate.

And now for some disclaimers. My PhD is very much data-driven—take input trace data, perform computations over it and generate results. If you’re doing a PhD in pure mathematics then maybe the workflow above doesn’t make sense for you. But I’m sure you’ve figured that out. Take what’s relevant and leave the rest. Another thing I can see some people having a sook about - the command line. I’ll remind you that the title of this post is “A PhD Workflow for Programmers”. Still, you might have been like me and miraculously made it through an IT degree that technically never required you to touch the command line, if you chose your electives just right. I can tell you from experience, picking up the command line really isn’t that scary, particularly not on the timescale of a PhD. You’ll also come to find that I/O redirection often trivially solves routine computation steps with built in commands, meaning you get to write less code. Even if you’re familiar with none of the tools mentioned in this post (Git, Shell scripting, LaTeX, Emacs), think of these as learning opportunities. You’ll walk away with an appreciation of a widely used version control system, a generic scripting tool, professional-grade typesetter and a text editor that might even one day become your preferred IDE.

Final note: don’t get too caught up in tooling. You want to leave some time for, y’know, research.