You may find, after running an analysis with HaplotypeCaller or Mutect2, that your results do not show a variant that you were expecting to find. This is because the default assembly/calling options are optimized for maximizing performance on a large number of human samples, which may not be what your particular dataset requires.
This guide is intended to help explain what tuning options within HaplotypeCaller and Mutect2 might affect your calling of specific variants, as well as ways to help diagnose problems.
1. Look at the data
To start off, try to investigate if anything has been called at your site-of-interest. It could be an alternate representation for complex events, rather than the expected variant.
Next, check to see if there is a repeat region.
If not, investigate if the MAPQ or base quals are low. If they are, then this means that HaplotypeCaller or Mutect2 do not have enough data to call the missing variant.
Finally, look at the GVCF file. If there is a GQ=0 hom-ref call, then the tool was able to identify that something was occurring at the region, but was unable to call it correctly.
2. If the data look good, but the call is still missing
--debugargument and look at the stderr or stdout files. Prior to version 4.1.7, regions with non-unique k-mers of 65 bp would be skipped. In addition, if there are a lot of mismatches and many haplotypes are being generated, it is possible that the tool is not going to pick the right haplotype.
Try re-running the analysis with
Try re-running the analysis with
--recover-all-dangling-branchesenabled. By default, the read-threading assembler does not recover dangling branches that fork after splitting from the reference. This argument will tell the assembly engine to recover all dangling branches.
For HaplotypeCaller on low coverage, or on Mutect2, re-run the analysis with
Try looking at the bamout file, (run with
--bam-output) which will show all of the reads and called haplotypes at the site. If you see nothing overlapping your region, then it might not have been flagged as active, or could have failed to assemble. If it failed to assemble running HaplotypeCaller or Mutect2, then re-running it with
--linked-de-bruijn-graphenabled might fix failures in the assembly engine that lead to no-call sites. However, if there are reads overlapping, then this could indicate that the assembly or genotyping failed — it is possible that the assembly failed to recover variants if they only occur at the ends of the reads, due to how the De Bruijn graph was assembled. In this case, you can try running the assemble with
--recover-all-dangling-branchesenabled to see if this correct the issue. If all else fails (and you still think the issue is assembly-related, try running with
--debug-graph-transformationsenabled, which generates .dot files of all the steps in assembly graph creation. This might help us diagnose what the problem is, if you share it with us on the GATK Support Forum.
3. Understanding ADs and DPs, and the reasons they might not match IGV
"IGV Depth" refers to all of the reads that are aligned to a given site by the aligner.
Meanwhile, the "DP" (dot plot) score refers to the number of reads that were set to the genotyper at a given site.
"AD" is the count of informative reads that support a given haplotype. This does not count reads at the site that were uninformative to any of the alleles. This means that reads that are ambiguous, or don't span the entire repeat at STR sites will likely be excluded from AD, but might still be in the DP. This can often lead to a disagreement with DP.
Knowing this, there are several reasons why VCF ADs and DPs might not match the original bam or bamout in IGV, primarily related to when reads are at a site, but are not sent to the genotyper.
The following are reasons for reads to be excluded before running the HMM:
- Reads that are filtered due to low mapping quality, or have failed validation, will be excluded (thus lowering the DP).
- Reads that lack mates, or have distantly mapped mates will also be excluded (thus lowering the DP).
- Reads that have been excluded due to downsampling.
The following are reasons for reads to be excluded from genotyping after running the HMM:
- Reads that have too many errors compared to their best matching haplotype will get excluded (thus lowering the DP).
- Reads may be removed by the contamination downsampler.
- Reads in GATK are realigned to their best haplotype before genotyping. This can cause reads that appear to overlap a site to move (which either lowers or raises the DP).
Off-Label HaplotypeCaller Cases
It is also possible that your data itself is not optimized for HaplotypeCaller and Mutect2 in the way that they are designed to handle. This does not necessarily mean that the tools won't work on your data, but it is a possible explanation. Consider the following two "off-label" cases:
HaplotypeCaller with Pooled-calling
- Using HaplotypeCaller with high ploidies is going to limit your alternate alleles, in most cases, to basically one. This is because the caller limits the number of PLs reported. Instead, try using Mutect2. This will also take into account the probabilistic fluctuations in the proportional coverage (i.e. not assuming that exactly X% of reads come from each of the 100/X samples).
- HaplotypeCaller and Mutect2 are optimized to expect UMIs (Unique Molecular Identifiers). If your data does not have UMIs, then they will not play nice with the callers.
- Do not use MarkDuplicates or you’ll lose all data. Similarly, turn off positional downsampling for the same reason.