Our teams have worked tirelessly these last two years to make GATK 4.2 one of the most powerful and useful toolkits for your research arsenal. We’ve been growing and adapting to meet the needs of the research community, with additional thanks to valuable input from researchers like you.
Since version 4.1, we have significantly improved GenomicsDB, Mutect2, and our mitochondrial calling pipelines. Funcotator can now support non-human species, and includes our popular
FuncotateSegments companion tool that performs functional annotation on segment files. We’ve added WDL generation for GATK/Picard tools so that you can have up-to-date WDLs for all of your tools.
Now, Version 4.2 is out (as you may have seen from our recent post), and it’s bursting at the seams with new content and changes that many of you have been asking for, as well as some more you may not have even known you wanted. However, just following along with GATK updates from version to version, it can be difficult to see the forest for the trees.
For that reason, here are some of the highlights of features covered that are new to GATK in version 4.2.
- Implementing features from DRAGEN into GATK, with new tools and pipelines.
- Making phasing fixes to HaplotypeCaller.
- Adding new tools for CNV and SV variant calling.
- Significant performance boost to GenomicsDB.
- And much more!
We haven't seen the sun in two years, but at least the codebase is mostly bug-free.
Setting the stage for DRAGEN-GATK
This version introduces improvements we've implemented from DRAGEN into GATK. We’ve been working closely with Illumina to produce a unified DRAGEN-GATK pipeline for short-read variant calling, and these tools will make this kind of work more accurate than before. We’ve already written extensively about these (and other) changes, so if you haven’t taken a look, please read our post describing the algorithm improvements.
There’s still a lot more work left to do before we make an official release of the full DRAGEN-GATK pipeline, but in the meantime you can already access and use some of the improvements we’ve implemented, below.
Firstly, we are introducing a new tool called DragSTR, which improves indel calling in repetitive regions.
DragSTR is a port of DRAGEN’s STR (Short Tandem Repeats) model. Using a few extra arguments, it adjusts Hidden Markov Model indel priors based on empirical reference contexts. The end result is that you will experience better indel calling, and be able to pass on the results to HaplotypeCaller without having to modify the rest of your pipeline.
We are also introducing two new genotyper error models from DRAGEN:
- The BQD (Base Quality Dropout) model penalizes variants with low average base quality scores and high average sequencing cycle counts among genotyped reads, and reads that were otherwise excluded from the genotyper to model read-context dependent sequencing errors.
- The FRD (Foreign Read Detection) model penalizes reads that are likely to have originated from somewhere else on the genome or from contamination, using an adjusted mapping quality score, as well as read strandedness information.
Both of these can be used within HaplotypeCaller by activating “DRAGEN mode” using the
Phasing in HaplotypeCaller
We have fixed issues with phasing in HaplotypeCaller that had the potential to affect data. Physical phasing information (PGT/PID/PS attributes) is now available to genotypes with spanning deletion alleles, with the added ability to recover indels from right at the edge of active regions. These would have originally been missed by the tool, but GATK 4.2 helps to catch these indels before they fall through the cracks.
We’ve also improved HaplotypeCaller’s handling of indels/spanning deletions, making it easier to handle edge cases that arise when mates have mismatching numbers of bases relative to each-other at the start/end of the reads.
All of these improvements were based on great bug reports from Nils Homer and his team from Fulcrum Genomics, which stemmed from actual clinical cases where the eventual fix made a difference. There's nothing that motivates us more than to hear that our work is making a difference, and ultimately what we want is to make tools that are useful to researchers and clinicians, so please continue to let us know what you need and tell us when you think the tools are misbehaving.
CNVs and SVs get some new tools
We've been putting a lot of effort into CNVs and SVs, with a full structural variation discovery pipeline coming soon for Illumina short-read whole-genome sequencing data.
As part of this, we've released
JointGermlineCNVSegmentation and its associated workflow for gCNV exome joint calling. This which allows you to combine gCNV segments and calls across samples.
The tools in this pipeline are typically used as a component of GATK’s structural variation pipeline for WGS data, and we will be introducing it into Terra soon. This version of GATK also incorporates
PrintSVEvidence, which is able to translate any SV evidence file types (RD, PE, SR, BAF) into a format that is easily accessible to the GATK-SV pipeline. This tool was previously only accessible as a bash script, so having it all in one place should help to streamline your pipelines from start to finish.
Huge performance bump for GenomicsDB
This update positively supercharges
GenomicsDBImport. Ordinarily, GenomicsDB will create a separate folder or partition for each contig in a set, which works well enough if you have a moderate amount of contigs. However, as those of you dealing with datasets with large numbers of contigs already know, this system eventually leads to abysmally slow import speeds.
As of the newest update, using the new
--merge-contigs-into-num-partitions argument, you can now merge multiple contigs into fewer GenomicsDB partitions, resulting in several orders of magnitude improvement to performance.
GATK’s open-source license shifted from BSD-3 to Apache 2.0
Since the release of GATK 4, we have affirmed our strong commitment to adopting an open source license that maximizes the benefits to the research community. We initially chose BSD-3, which is a very liberal license with a lot of freedom in using the software. We are now moving to Apache 2.0, which is almost entirely equivalent, except it includes a patent license granting clause that makes open source software safer to use.
There's still a lot more exciting news on the horizon for GATK in the next months, but until then, we are thrilled to bring more and ever-improving resources to help you do the research you want to do.
But don’t take my word for it — download GATK 4.2 at the following link and see for yourself what features can benefit you. Or, better yet, try out these new GATK features from within Terra. There are already a number of curated sample workspaces covering a wide range of use cases, already preloaded and preconfigured with data for you to explore.
There are many more updates expected in the coming weeks and months, so stay tuned for more updates as we work to improve GATK even further.