The latest GATK release is out, with changes corresponding to the period of February 18, 2021 - July 30, 2021. As always, we highly recommend updating to the newest version, as it will help to correct bugs that may have an effect on your data.
- New Tools:
LocalAssembleris a new tool for SV Calling that performs local assembly of small regions to discover structural variants.
- DRAGEN-GATK: A sporadic out-of-memory error in
CalibrateDragsterModelhas been fixed, as well as the "IllegalArgumentException: Start cannot exceed end" error.
- Mutect2: We've started laying the groundwork for the upcoming
Mutect3release, which will be more machine learning focused. This update includes a training data mode (accessible though
--training-data-mode) that collects and outputs data on variant- and artifact-supporting read sets for fitting to deep learning filtering models.
- CNV Calling: You've been asking for it, so we delivered —
ModelSegmentsnow supports multi-sample segmentation!
- SV Calling: We've added a new tool,
LocalAssembler, which is able to perform local assembly of small regions to discover structural variants.
- Funcotator: In addition to major speed improvements in this update,
Funcotatorwill now alert you whether you are trying to annotate a VCF that has already been annotated. If you would prefer to reannotate your file, use the
--reannotate-vcfargument to override this safety.
In addition, the following changes have been made:
- GenomicsDB update: GATK has now moved to the new version of GenomicsDB version 1.4.1, with native cloud support (previously, it relied on the slower GCS Connector library). At this point, it should basically be indistinguishable from using a network file system.
- GKL update: The Intel Genomics Kernel Library (GKL) has been updated to version 0.8.8, with many important fixes and improvements. This should fix most of the errors encountered when running on GKL infrastructure. We've also moved to an updated ISAL and OTC Zlib libraries to the latest version, among numerous other quality of life improvements.
- Improved pipelining: We've added a GATK-wide option (
--max-variants-per-shard) to split outputted VCFs into completely even shards. This is functional with any GATK tool that outputs VCFs. This is very useful for anyone writing pipelines, allowing you to output multiple VCFs instead of one monolithic one.
- BCI support: GATK support for block compressed interval (
.bci) files, which is useful when working with extremely large interval lists. This is not useful for most cases, but is particularly useful in the SV Pipeline, which handles extremely large interval lists that might affect memory usage. You can use it within any GATK tool that accepts
- New DD annotation: We've added an AlleleDepthPseudoCounts (DD) genotype annotation. Similar to AD, the DD annotation describes the depth of each allele's supporting evidence or reads. However, DD uses a variational Bayes approach that is more robust in some instances.
To use this new non-standard annotation in HaplotypeCaller, use the
This update also fixes bugs and adds features in a number of areas, including:
- HaplotypeCaller: Rare bugs/edge cases for DRAGEN were fixed. Also some dumb bugs like the "Padded span must contain active span" error caused by invalid feature file intervals that weren't being checked against the sequence dictionary.
- RankSumTest Annotation fix: Fixed key ordering bugs in the implementations of
CompressedDataList.iterator(). These bugs would cause the RankSumTest annotation value to be wrong in some cases. If you find that your annotation values have changed between releases, then this is the reason, but they should be more accurate now.
- ToolDocs: There have also been a lot of fixes to our tool documentation in this version, for clarity and to repair broken and outdated hyperlinks.
These changes, and more, are explained in the full GATK release notes.