In my last blog post, I introduced the book that Brian O'Connor and I recently co-authored, Genomics in the Cloud, published by O'Reilly Media. Today I'd like to give you a more personal perspective on where this book is coming from and what it means to me in relation to my past work with GATK.
Getting started is hard
When I joined the GATK team in 2012 to write documentation and provide tech support to the research community, I had no clue what I was getting myself into. All I knew about genomics revolved around bacterial plasmids — I had no idea what "calling variants" even meant — and the most computationally sophisticated analysis I had ever done involved a bunch of hand-rolled Python spaghetti code that ran on my 2008 MacBook.
My first few months (years?) on the job were absolutely brutal. No fault of my new colleagues on the GATK team; they were all lovely and welcoming. But as a newcomer to the field, I was faced with a double whammy of epic proportions, with new complexities coming at me every day from both the scientific side and the computing side. Not to mention the jargon — oh my goodness, so much jargon.
I could go on, but I'll stop there because I think most of you know exactly what I mean. You're probably going through or have gone through something very similar in your own career development. I've heard the same basic story time and time again on the forum, in workshops and at conferences, from people with a wide variety of backgrounds — getting started in genomics is hard.
Wrangling a mountain of information is hard too
Fast forward to present day; the combined efforts of many team members have produced a vast collection of knowledge in various forms — traditional doc articles, dictionary entries, hands-on tutorials, blog posts, workshop presentations — all to alleviate that pain and help you use GATK tools effectively to get your work done. Now the challenge has become: How do you tackle that mass of information? How do you traverse it? In what order do you consume its contents? It's a surprisingly hard problem, both for you and for us (when we have to decide how to present and organize documentation). And it only gets harder as the scope of the toolkit grows and the tools themselves keep evolving.
This is where the book comes in
Three years ago, Brian O'Connor from UCSC reached out to the GATK team in search of a co-author to develop a book on the intersection of genomics and cloud computing, at the behest of software publishing giant O'Reilly (as in "the O'Reilly animal books"). There was a brief email thread as the request pinged-ponged through the team and ended up in my inbox; I think Brian heard my "HECK YEAH" all the way in California.
In my mind, this was the perfect opportunity — the excuse I needed — to take a big step back and spend some time figuring out how to tell the whole story of "how to GATK" in a more linear and accessible fashion. (At the time, the cloud computing part of the brief felt more like a technical detail. And although my thinking on that point has evolved considerably, I'd say at least two thirds of the book ended up largely agnostic of computing infrastructure. I plan to discuss that in more detail in a follow-up post.)
In reality I couldn't just drop everything that was going on with the old day job, and neither could Brian, so this became something of a passion project that he and I worked on during nights and weekends, in fits and starts, over the course of the next three years. We went through multiple iterations of the outline — at one point we were going to devote an entire chapter to Spark-based tools — until we gradually aligned on something close to the final table of contents. Then we proceeded to hammer out the contents: chapter after chapter of detailed explanations and hands-on exercises to demonstrate and reinforce key concepts. As that content took form, we shifted pieces around — sometimes rearranging the order of entire chapters — to ensure that the progression was logical and flowed smoothly from one chapter to the next.
I can't speak for Brian, but for me, the guiding light was the memory of the struggles I experienced first hand when I started working on GATK, and those shared with me by the many researchers and support staff I've interacted with over the years.
It was in many ways a deeply fulfilling, if bittersweet, process to channel years of crowd-sourced confusion, frustration, even anger, into creating the educational resource I wish had been available to me at the time. I hope the result will be especially useful to those of you who are getting started with GATK. The overall scope and audience of the book is quite a bit broader than that, and I will say I'm very proud of the pipelining chapters, which were quite challenging; but ultimately the GATK chapters are where I put the most heart.
To be clear, this book uses a lot of established material from the GATK documentation, including concept explanations and tutorial exercises that were originally developed for workshops, with some key adaptations. For the GATK-focused chapters, a lot of the added value of the book comes from how the content is woven together (and maybe a bit of behind-the-scenes info). We weren't looking to reinvent the wheel, but rather to buff out the kinks, oil the gears, and reassemble the bike so you could ride it right out of the shop.
I look forward to hearing what you think of the ride.