FilterMutectCalls, --distance-on-haplotype and variants filtered for haplotype
Answereda) GATK version used: 4.1.4.0
(following best practices closely)
I'm analyzing a set of unpaired WES for somatic calls. I have a sample with, I know from Sanger sequencing, two very near mutations on KRAS (G12D and V9A).
FilterMutectCalls is setting both as filtered by 'haplotype' as the only reason. This is confusing for me: from the documentation I understood that at least one of the variants of the same haplotype would need to be filtered for other reasons.
Overall on this vcf I have only 102 variants filtered for haplotype only, and the majority of them share a PID with something filtered for other reasons - but for 6 of them this is not the case.
What am I missing?
chr12 25245350 . C G . haplotype CONTQ=93;DP=84;ECNT=2;GERMQ=18;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=39;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=155.53 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:42,41:0.494:83:22,16:20,25:0|1:25245350_C_G:25245350:18,24,17,24
chr12 25245359 . A G . haplotype CONTQ=93;DP=83;ECNT=2;GERMQ=18;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=37;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=155.53 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:42,41:0.494:83:22,15:19,24:0|1:25245350_C_G:25245350:18,24,17,24
chr21 44601455 . C T . haplotype CONTQ=93;DP=286;ECNT=2;GERMQ=12;MBQ=20,20;MFRL=160,164;MMQ=60,60;MPOS=29;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=420.40 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:156,124:0.443:280:78,71:75,53:0|1:44601455_C_T:44601455:74,82,51,73
chr21 44601475 . C G . haplotype CONTQ=93;DP=252;ECNT=2;GERMQ=11;MBQ=20,20;MFRL=160,164;MMQ=60,60;MPOS=34;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=393.33 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:142,110:0.437:252:70,61:68,44:0|1:44601455_C_T:44601455:72,70,50,60
chr22 22514169 . T G . haplotype CONTQ=93;DP=52;ECNT=2;GERMQ=16;MBQ=20,20;MFRL=144,177;MMQ=60,60;MPOS=51;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=76.45 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:31,21:0.407:52:11,10:19,9:0|1:22514169_T_G:22514169:16,15,10,11
chr22 22514173 . C G . haplotype CONTQ=93;DP=53;ECNT=2;GERMQ=16;MBQ=20,20;MFRL=144,177;MMQ=60,60;MPOS=47;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=76.45 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:31,21:0.407:52:11,12:20,9:0|1:22514169_T_G:22514169:16,15,10,11
chr22 22514449 . A G . haplotype CONTQ=93;DP=93;ECNT=2;GERMQ=13;MBQ=20,20;MFRL=175,161;MMQ=60,60;MPOS=41;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=208.47 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:37,53:0.587:90:18,35:19,18:0|1:22514449_A_G:22514449:21,16,24,29
chr22 22514452 . A G . haplotype CONTQ=93;DP=93;ECNT=2;GERMQ=13;MBQ=20,20;MFRL=173,161;MMQ=60,60;MPOS=43;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=209.86 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:38,53:0.581:91:19,36:18,17:0|1:22514449_A_G:22514449:22,16,25,28
chr22 22514885 . G C . haplotype CONTQ=93;DP=238;ECNT=2;GERMQ=16;MBQ=20,20;MFRL=153,164;MMQ=60,60;MPOS=40;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=423.89 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:127,111:0.467:238:74,51:50,58:0|1:22514885_G_C:22514885:69,58,55,56
chr22 22514894 . C T . haplotype CONTQ=93;DP=239;ECNT=2;GERMQ=17;MBQ=20,20;MFRL=152,164;MMQ=60,60;MPOS=39;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=427.89 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:125,112:0.473:237:70,53:53,58:0|1:22514885_G_C:22514885:70,55,56,56
chrX 1193297 . T C . haplotype CONTQ=93;DP=50;ECNT=2;GERMQ=16;MBQ=20,20;MFRL=136,155;MMQ=60,60;MPOS=45;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=45.85 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:27,22:0.453:49:9,4:17,18:0|1:1193297_T_C:1193297:13,14,11,11
chrX 1193322 . C T . haplotype CONTQ=93;DP=49;ECNT=2;GERMQ=93;MBQ=0,20;MFRL=0,141;MMQ=60,60;MPOS=52;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=166.80 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:0,49:0.980:49:0,11:0,38:0|1:1193297_T_C:1193297:0,0,23,26
-
Hi ElenaGrassi,
I am going to move your post into our Community Discussions -> Documentation Questions topic, as the Somatic topic is for reporting bugs and issues with GATK.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Best,
Genevieve
-
Hi ElenaGrassi,
Yes you are correct that variants filtered out by the 'haplotype' filter need to have one of the variants in phase filtered for another reason. Could you zero in on one or two of the PID examples and show all the variants with the PID?
There could be something behind the scenes going on that is not visible as of now. There could also be some sort of allele specific filter.
Let me know what you find,
Genevieve
-
Thanks, sure, one example:
$ zgrep 25245350_C_G sample123.filtered.vcf.gz
chr12 25245350 . C G . haplotype CONTQ=93;DP=84;ECNT=2;GERMQ=18;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=39;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=155.53 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:42,41:0.494:83:22,16:20,25:0|1:25245350_C_G:25
245350:18,24,17,24
chr12 25245359 . A G . haplotype CONTQ=93;DP=83;ECNT=2;GERMQ=18;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=37;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=155.53 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:42,41:0.494:83:22,15:19,24:0|1:25245350_C_G:25
245350:18,24,17,24
$ zgrep 25245350_C_G sample123.vcf.gz
chr12 25245350 . C G . . DP=84;ECNT=2;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=39;POPAF=7.30;TLOD=155.53 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:42,41:0.494:83:22,16:20,25:0|1:25245350_C_G:25245350:18,24,17,24
chr12 25245359 . A G . . DP=83;ECNT=2;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=37;POPAF=7.30;TLOD=155.53 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:42,41:0.494:83:22,15:19,24:0|1:25245350_C_G:25245350:18,24,17,24
All the other examples, if my bash-fu is not broken, should be similar.
My FilterMutectCalls parameters (pretty vanilla):
gatk FilterMutectCalls -V mutect/sample123.vcf.gz -O mutect/sample123.filtered.vcf.gz -R /mnt/cold1/snaketree/task/annotations/dataset/gnomad/GRCh38.d1.vd1.fa --stats mutect/sample123.vcf.gz.stats --contamination-table
mutect/sample123.contamination.table --tumor-segmentation=mutect/sample123.tum.seg --filtering-stats mutect/sample123_filtering_stats.tsv 2> mutect/sample123_filtering_stats.tsv.logFiltering stats:
#<METADATA>Ln prior of deletion of length 10=-20.72326583694641
#<METADATA>Ln prior of deletion of length 9=-17.957778925047954
#<METADATA>Ln prior of deletion of length 8=-20.72326583694641
#<METADATA>Ln prior of deletion of length 7=-20.72326583694641
#<METADATA>Ln prior of deletion of length 6=-16.166019455819896
#<METADATA>Ln prior of deletion of length 5=-16.859166636379843
#<METADATA>Ln prior of deletion of length 4=-16.34834101261385
#<METADATA>Ln prior of deletion of length 3=-15.318721595432693
#<METADATA>Ln prior of deletion of length 2=-15.392829567586416
#<METADATA>Ln prior of deletion of length 1=-14.294217278918307
#<METADATA>Ln prior of SNV=-10.471165611907997
#<METADATA>Ln prior of insertion of length 1=-13.435990347998912
#<METADATA>Ln prior of insertion of length 2=-15.18519020280817
#<METADATA>Ln prior of insertion of length 3=-15.655193832053907
#<METADATA>Ln prior of insertion of length 4=-15.318721595432693
#<METADATA>Ln prior of insertion of length 5=-20.72326583694641
#<METADATA>Ln prior of insertion of length 6=-16.34834101261385
#<METADATA>Ln prior of insertion of length 7=-17.264631744488007
#<METADATA>Ln prior of insertion of length 8=-16.571484563928063
#<METADATA>Ln prior of insertion of length 9=-17.264631744488007
#<METADATA>Ln prior of insertion of length 10=-20.72326583694641
#<METADATA>High-AF beta-binomial cluster=weight = 0.0180, alpha = 10.02, beta = 0.50
#<METADATA>Background beta-binomial cluster=weight = 0.1497, alpha = 1.59, beta = 1.55
#<METADATA>Binomial cluster 1=weight = 0.5171, mean = 0.990
#<METADATA>Binomial cluster 1=weight = 0.2910, mean = 0.503
#<METADATA>Binomial cluster 1=weight = 0.0678, mean = 0.392
#<METADATA>Binomial cluster 1=weight = 0.0538, mean = 0.161
#<METADATA>Binomial cluster 1=weight = 0.0415, mean = 0.275
#<METADATA>Binomial cluster 1=weight = 0.0257, mean = 0.064
#<METADATA>Binomial cluster 1=weight = 0.0023, mean = 0.159
#<METADATA>Binomial cluster 1=weight = 0.0006, mean = 0.333
#<METADATA>threshold=0.519
#<METADATA>fdr=0.042
#<METADATA>sensitivity=0.949
filter FP FDR FN FNR
weak_evidence 22.57 0.01 75.07 0.03
strand_bias 3.62 0.0 0.68 0.0
contamination 0.07 0.0 0.02 0.0
slippage 4.03 0.0 4.39 0.0
haplotype 7.89 0.0 8.82 0.0
germline 80.06 0.03 48.97 0.02 -
Hi ElenaGrassi,
Thank you for these examples. It looks like we are going to need to see your data to determine if this is a bug or if there is something internal going on with the algorithm.
Could you try to reproduce this issue without the contamination table and tumor segmentation inputs to decrease the size of the files you need to upload? We think that these files might not be necessary for investigating this issue.
The files we definitely need to see are the unfiltered VCF and the mutect stats file. Did you use the standard hg38 reference or did you make any modifications?
You can upload all these files in a bug report following these instructions: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671. Please put these all in a zipped folder and let me know when you have uploaded the folder.
Best,
Genevieve
-
Will do today, thanks!
It's standard hg38.
-
Done, the uploaded file is called haplotype.tar.gz, there is a .sh with the filtering command used without contamination table and tumor segmentation as requested (the filtered variants due to 'haplotype' are still there as expected).
The used reference is https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834
-
ElenaGrassi I am trying to reproduce the error and it looks like it might already be fixed in the most recent version of the GATK. The 6 pairs you found were all PASS. Could you re-run with version 4.2.5.0 and confirm that you get the same, or let us know if the issue persists?
-
Sorry I completely forgot :/ With 4.2.5.0 the issue is gone!
-
Great! Thanks for the update.
Please sign in to leave a comment.
9 comments