Mutect2 Filtering Discrepancy with Different Versions of gnomAD Germline Data
Hello,
I've been experimenting with Mutect2 for variant calling and filtering using different versions of gnomAD germline data. Recently, I tried comparing the performance of Mutect2 with two different versions: the one from the GATK resource bundle and the latest gnomAD version 4.
Here's the scenario:
When using the gnomAD data obtained from the GATK resource bundle, I noticed that the resulting filtered VCF file has significantly fewer entries (2886) compared to when using gnomAD version 4 (6560). This is despite gnomAD version 4 having more entries, which logically would result in fewer entries in the final filtered VCF file.
For instance, following example is from the VCF produced with gnomAD obtained from the GATK resource bundle:
chr1 1719368 . T C . germline AS_FilterStatus=SITE;AS_SB_TABLE=1,0|94,47;DP=149;ECNT=2;GERMQ=1;MBQ=32,33;MFRL=145,159;MMQ=60,60;MPOS=17;POPAF=0.277;TLOD=509.82 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:1,141:0.986:142:0,59:1,68:1,135:1,0,94,47
In this example, the variant is assigned as germline, because there is a corresponding entry in the gnomAD file obtained from the GATK resource bundle:
chr1 1719368 rs1137005 T C 14667600 InbreedingCoeff AC=21500;AF=0.529
However, when using gnomAD version 4, a similar variant was not assigned as germline, even though there's a matching entry, like this:
chr1 1719368 . T C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=1,0|94,47;DP=149;ECNT=2;GERMQ=93;MBQ=32,33;MFRL=145,159;MMQ=60,60;MPOS=17;POPAF=7.30;TLOD=509.82 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:1,141:0.986:142:0,59:1,68:1,135:1,0,94,47
chr1 1719368 rs1137005 T C . PASS AC=121466;AF=0.805317;AN=150830;nhomalt=51467
Upon deeper inspection, I found that there might be additional entries in the gnomAD 4 file that are influencing the filtering process for the same genomic location. For example:
chr1 1719368 . T A . AC0 AC=0;AF=0.0;AN=150760;nhomalt=0
When filtering out such entries tagged with AC0
from the gnomAD 4 VCF file, the resulting entry was reassigned as germline, affecting the overall number of entries in the filtered VCF. Number of entries in filtered VCF decreased from 6560 to 5277.
chr1 1719368 . T C . germline AS_FilterStatus=SITE;AS_SB_TABLE=1,0|94,47;DP=149;ECNT=2;GERMQ=1;MBQ=32,33;MFRL=145,159;MMQ=60,60;MPOS=17;POPAF=0.094;TLOD=509.82 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:1,141:0.986:142:0,59:1,68:1,135:1,0,94,47
Despite this, the number of entries in the VCF produced with gnomAD version 4 remains significantly higher than the one obtained from the GATK resource bundle.
Has anyone encountered a similar issue or have insights into why Mutect2 behaves differently with different versions of gnomAD germline data? Any suggestions on how to reconcile these discrepancies would be greatly appreciated.
I am using GATK version 4.5.0.0 and running mutect2 in tumor-only mode. I am not using variants for contamination or panel of normals data either.
Thanks in advance.
-
Hi Mert Çelik
Can you share the details of your work such as number of variants before filtering within raw Mutect2 call vcfs?
We also wish to hear how you gathered your germline data from gnomAD 4. Looking at those variant entries I noticed that one entry has a malformed POPAF value which is 7.30. POPAF is supposed to be between 0 and 1 therefore something seems off here. We need to understand how this value is formed in a variant context albeit due to a bug or a malformed resource file.
Regards.
-
Thanks Gökalp Çelik
The number of variants before filtering is 33891. After filtering, the number decreased to 6560. So it seems there is a relatively successful filtering process went on already, but it is not sufficient or as expected. Upon inspecting the mutect2 manual, I found out that POPAF value is negative logarithm of population allele frequency of that variant and pipeline calculates it using the AF data from the gnomAD file.
I obtained gnomAD 4 VCFs from their website, trimmed them so that it only contains AF, AN, AC and nhomalt information. After that, I concatenated all VCFs using bcftools concat. Resulting VCF was sorted and indexed.
-
Sorry for my misinterpretting POPAF value. I mixed up with another tag. I will discuss this issue to our team and get their opinion as well. If they think that this could be a bug then I will escalate the issue.
Regards.
-
Thank you, waiting for the response from your side then. If you have any further questions about how I devised the work, I would be glad to answer.
Best.
-
Hi again.
After our discussion with the team, it looks like changes in the gnomAD 4.0 is causing issues during germline AF filtering step. Mutect2 expects a single line per site/locus therefore using the provided germline resource generates the correct behavior. The one you generated has multiple entries per site which causes the problematic behavior. A fix is on its way for the next release of GATK for this issue however we cannot guarantee any additional changes to occur to gnomAD data for future releases. We suggest users to stick with the resource bundles that we provide.
Regards.
-
Hello,
Thank you for the effort Gökalp Çelik, it is appreciated. I can also give you a humble suggestion: as you might have heard, gnomAD 4.1 has been released recently and it has a quite different structure in terms of info tags compared to previous versions. GATK team might also want to take these differences into consideration while devising the new update.
Best. -
Hi
Thank you for the heads up. A more generic fix is in the below PR
https://github.com/broadinstitute/gatk/pull/8837
and will be merged to the next release of GATK.
Regards.
-
Dear Mert Çelik,
Out of curiosity, did you use the code in mutect2.wdl to do the filtering of AF information form gnomAD 4.1?
grep -v "^#" ${input_vcf} | sed -e 's#\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t.*;AF=\([0-9]*\.[e0-9+-]*\).*#\1\t\2\t.\t\4\t\5\t.\t\7\tAF=\8#g' > simplified_body &
I think I am doing similar things as what you have done. I modified and used the mutect_resources.wdl to create the AFonly vcf, and CommonBiallelicSNP. The result looks bit problematic and I am trying to find out why..
Best
-
Hello DS,
In Gnomad 4.1 there is no info tag named "AF". They merged entries of genomes and exomes, then renamed some tags, including the AF. You can try to change the new tag name "AF_joint" to "AF" using sed so that Mutect2 can recognize it. This could be an option to solve your problem.
Best,
Please sign in to leave a comment.
9 comments