Being able to accurately identify mutations in microbial genomes is an essential piece in the quest to understand drug resistance, immune evasion, and other epidemiological characteristics of infectious disease. We have been aware for some time that many researchers have applied GATK tools to microbial variant discovery, based on the track record of success that our tools have had for human genomic data processing.
However, we have traditionally tested and validated our tools mainly —in some cases, only— on human data, and we acknowledge freely that our default human-focused protocols and parameter recommendations may not produce the best results on other organisms. For example, genomic “trouble spots” (e.g. repetitive loci, regions of high genetic diversity, translocated or entirely absent regions, etc.) are often deprioritized in human datasets because they represent only a small percentage of the overall genome. In contrast, pathogen genomes are much smaller, and these trouble spots are proportionally more abundant. In addition, these same trouble spots often underlie clinically important phenotypes (e.g. severity, transmissibility, resistance to antimicrobials, etc.) that are the immediate subject of investigation, and thus cannot be omitted from analysis.
Big improvements for tiny genomes
A little over a year ago, we received a grant from the Chan Zuckerberg Initiative’s Essential Open Source Software for Science program to work on providing the bacterial research community with more robust variant calling methods. This allowed us to dedicate development effort to improving GATK tooling for calling short variants on bacterial genomic datasets.
First, we developed a workflow that repurposes Mutect2, our somatic short variant caller —originally developed for cancer genome analysis— to call microbial variants. We optimized the workflow parameters to handle low allele frequencies, varying read depths, and sequencing and mapping errors that are typical of microbial data, resulting in much improved sensitivity and precision.
For improved coverage across circular bacterial genomes specifically, we developed a tool that calls variants on reads spanning the artificial breakpoint at which their sequence is linearized. And more generally, we are actively working on other tooling improvements that will extend the GATK’s usability for analyzing other microbial genomes, such as viruses, fungi, and protozoans.
Best Practices workflow and Terra workspace
One of our goals for this project is to make our methods more readily accessible to the microbial research community. To that end, we implemented our microbial variant calling workflow in the WDL language, which is runnable on any standard computing platform, to ensure the analysis will be scalable and reproducible with minimal effort. We've also made sure the workflow is cost-efficient and runs quickly.
As with all our other Best Practices workflows, the microbial variant calling workflow code is fully open-source and available in Github. We also make it available in a pre-configured Terra workspace, with example data and detailed technical documentation, so that anyone can try it out without having to install anything. If you're not familiar with Terra, the Broad's cloud-based analysis platform, check out this previous blog post for a GATK-focused summary and pointers for getting started.
Watch the H3ABioNet webinar
I'm honored to have been invited to present our work on "GATK for Microbes" as part of the H3ABioNet webinar series, on Wednesday 28 April, 3PM CAT (9AM EDT).
In my talk, I'll describe how the GATK has been used for microbial research so far and how we've improved it for the benefit of future research in this field. I'll cover the entire progression of GATK use for microbial variant discovery: from how it was implemented before the latest improvements, to how we've added to its functionality; how this compares with other well-known variant callers, and how we've increased the accessibility of these tools using the cloud.