The work presented here includes significant contributions from Pankaj Vats and Harry Clifford from NVIDIA.
Factors that affect genomic analysis
When it comes to sequencing experiments, ensuring alignment and variant calling accuracy is crucial to guaranteeing that any downstream analyses and insights are of the highest quality.
Both of these tasks are non-trivial, however. Although many modern methods are highly accurate, there is still a need for rigorously tested best practice frameworks and workflows. Many aligners and variant callers utilize different methods in their approach, and there are many benchmark studies and resources that have been made widely available by the community to compare their accuracy.
In addition to the different underlying methods of the tools available, another crucial factor that affects genomic sequencing accuracy is the reference genome used. The quality of a reference genome (and subsequent read alignment) can have a large effect on downstream analysis tasks — from the variant calling itself, to the genotyping, functional annotation and interpretation of the genetic variation.
That's where mask reference genomes come in. These are alternate versions of genomes that "mask out" regions of high similarity across the genome. But how can masked reference genome really work better than a standard reference?
Here, we demonstrate how a GRCh38 reference genome with masked alt regions can improve alignment and downstream variant calling in human germline analysis.
Using a masked reference genome for improved germline analysis
A masked reference genome is a version of the reference genome where regions of high similarity to other regions in the reference (such as alt contigs) have been "masked" to prevent reads from being aligned to those regions. Masked nucleotides are replaced with Ns.
This allows for reads that would otherwise have an ambiguous alignment in the reference to be mapped to a unique position, thereby providing better alignment and variant calling overall.
A study from Illumina found that the accuracy of germline variant calling was significantly improved when using the Illumina alt-masked GRCh38 reference genome as compared to using the non-masked GRCh38 reference genome.
We can see this for ourselves by running an experiment running BWA-MEM and GATK with the Illumina alt-masked reference in comparison to the unmasked reference. The comparison is made by aligning the reads to GRCh38 reference genome and aligning it to the Illumina alt masked GRCh38 reference genome using GATK Best Practice guidelines.
This experiment serves as a proof-of-concept for germline variant calling. It uses Genome in a Bottle whole genome sequencing datasets - (GIAB) HG001 (30X), HG002 (30X), and HG005 (34X). BWA-MEM was used to perform the sequence alignment, and the resultant BAM files were processed with HaplotypeCaller to identify SNPs and indels. All results were generated with NVIDIA Clara Parabricks, a suite of accelerated software that provides GPU-accelerated genomic analysis applications available on the cloud.
The results of this experiment are represented in the below figure. The table shows benchmarking on the GIAB reference samples for SNPs and indels using HaplotypeCaller. Overall, while using the alt-masked reference, there is a marked improvement in F1 scores, as well as a reduction in false negatives, most notable when contrasted against the unmasked reference.
The trend across these three samples shows a consistent reduction in false negatives in SNP and indel calling. For false positives the trend was a consistent reduction in SNPs, but for indels the HG001 and HG002 samples show slightly higher false positive calls while HG005 showed reduced false positive indel calls. Overall, the variant calling was improved while using the alt-masked reference genome.
As a result, we can recommend the use of the Illumina alt-masked GRCh38 reference genome for human germline analysis with BWA-MEM and GATK.
In the pursuit of the most accurate alignments and variant calls, using the best tools and references is paramount. However, sometimes less is more - by using masked references that cut out areas of high similarity across the genome, we end up getting higher quality results.
Deploying this analysis in NVIDIA Clara Parabricks
If you are interested in using NVIDIA Clara Parabricks to produce data like those found in this article, these tools and workflows can be deployed on the Cloud, running GATK Best Practices that have been integrated an accelerated version of the DeepVariant AI while running much quicker on NVIDIA's GPU architecture.
NVIDIA has also released an NVIDIA Clara Parabricks workspace in Terra with resources and documentation to get started with accelerated computing that can yield hour-long analyses in mere minutes.
2 comments
Dear Derek Caetano-Anolles,
thank you for this useful comparison. I just have a small question to clarify something I could not find in the above text or in the Clara Parabricks website. Were the variants called on the alt contigs for the comparison above, following the procedure that is described here?
Hi Eva (Evander) -- Good question.
The variant calls were not made in the alt contig region, but were based on the Illumina masked reference available in the following DRAGEN-reference in the GCP public data repository.
Masked reference genomes like this one "mask" regions of high similarity to other regions in the reference. Alt contigs are certainly one such example of reads that would be masked in the masked reference.
Here in this blog post we are comparing the impact of masked vs unmasked references when used in a pretty standard variant calling pipeline. However, the data were not passed through any additional post-processing steps as mentioned in the tutorial document you linked to.
I hope that this answers your question!
Please sign in to leave a comment.