Factors that affect genomic analysis
That's where mask reference genomes come in. These are alternate versions of genomes that "mask out" regions of high similarity across the genome. But how can masked reference genome really work better than a standard reference?
Here, we demonstrate how a GRCh38 reference genome with masked alt regions can improve alignment and downstream variant calling in human germline analysis.
Using a masked reference genome for improved germline analysis
This allows for reads that would otherwise have an ambiguous alignment in the reference to be mapped to a unique position, thereby providing better alignment and variant calling overall.
A study from Illumina found that the accuracy of germline variant calling was significantly improved when using the Illumina alt-masked GRCh38 reference genome as compared to using the non-masked GRCh38 reference genome.
We can see this for ourselves by running an experiment running BWA-MEM and GATK with the Illumina alt-masked reference in comparison to the unmasked reference. The comparison is made by aligning the reads to GRCh38 reference genome and aligning it to the Illumina alt masked GRCh38 reference genome using GATK Best Practice guidelines.
This experiment serves as a proof-of-concept for germline variant calling. It uses Genome in a Bottle whole genome sequencing datasets - (GIAB) HG001 (30X), HG002 (30X), and HG005 (34X). BWA-MEM was used to perform the sequence alignment, and the resultant BAM files were processed with HaplotypeCaller to identify SNPs and indels. All results were generated with NVIDIA Clara Parabricks, a suite of accelerated software that provides GPU-accelerated genomic analysis applications available on the cloud.
The results of this experiment are represented in the below figure. The table shows benchmarking on the GIAB reference samples for SNPs and indels using HaplotypeCaller. Overall, while using the alt-masked reference, there is a marked improvement in F1 scores, as well as a reduction in false negatives, most notable when contrasted against the unmasked reference.
The trend across these three samples shows a consistent reduction in false negatives in SNP and indel calling. For false positives the trend was a consistent reduction in SNPs, but for indels the HG001 and HG002 samples show slightly higher false positive calls while HG005 showed reduced false positive indel calls. Overall, the variant calling was improved while using the alt-masked reference genome.
As a result, we can recommend the use of the Illumina alt-masked GRCh38 reference genome for human germline analysis with BWA-MEM and GATK.
In the pursuit of the most accurate alignments and variant calls, using the best tools and references is paramount. However, sometimes less is more - by using masked references that cut out areas of high similarity across the genome, we end up getting higher quality results.
Deploying this analysis in NVIDIA Clara Parabricks
NVIDIA has also released an NVIDIA Clara Parabricks workspace in Terra with resources and documentation to get started with accelerated computing that can yield hour-long analyses in mere minutes.