Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

FilterMutectCalls, --distance-on-haplotype and variants filtered for haplotype

Answered
0

9 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi ElenaGrassi,

    I am going to move your post into our Community Discussions -> Documentation Questions topic, as the Somatic topic is for reporting bugs and issues with GATK.

    You can read more about our forum guidelines and the topics here: Forum Guidelines.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi ElenaGrassi,

    Yes you are correct that variants filtered out by the 'haplotype' filter need to have one of the variants in phase filtered for another reason. Could you zero in on one or two of the PID examples and show all the variants with the PID? 

    There could be something behind the scenes going on that is not visible as of now. There could also be some sort of allele specific filter.

    Let me know what you find,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    ElenaGrassi

    Thanks, sure, one example:

     

    $ zgrep 25245350_C_G  sample123.filtered.vcf.gz
    chr12   25245350        .       C       G       .       haplotype       CONTQ=93;DP=84;ECNT=2;GERMQ=18;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=39;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=155.53       GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB     0|1:42,41:0.494:83:22,16:20,25:0|1:25245350_C_G:25
    245350:18,24,17,24
    chr12   25245359        .       A       G       .       haplotype       CONTQ=93;DP=83;ECNT=2;GERMQ=18;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=37;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=155.53       GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB     0|1:42,41:0.494:83:22,15:19,24:0|1:25245350_C_G:25
    245350:18,24,17,24
    $ zgrep 25245350_C_G  sample123.vcf.gz
    chr12   25245350        .       C       G       .       .       DP=84;ECNT=2;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=39;POPAF=7.30;TLOD=155.53    GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB     0|1:42,41:0.494:83:22,16:20,25:0|1:25245350_C_G:25245350:18,24,17,24
    chr12   25245359        .       A       G       .       .       DP=83;ECNT=2;MBQ=20,20;MFRL=142,160;MMQ=60,60;MPOS=37;POPAF=7.30;TLOD=155.53    GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB     0|1:42,41:0.494:83:22,15:19,24:0|1:25245350_C_G:25245350:18,24,17,24

    All the other examples, if my bash-fu is not broken, should be similar.

    My FilterMutectCalls parameters (pretty vanilla):
    gatk FilterMutectCalls -V mutect/sample123.vcf.gz -O mutect/sample123.filtered.vcf.gz -R /mnt/cold1/snaketree/task/annotations/dataset/gnomad/GRCh38.d1.vd1.fa --stats mutect/sample123.vcf.gz.stats --contamination-table 
    mutect/sample123.contamination.table --tumor-segmentation=mutect/sample123.tum.seg --filtering-stats mutect/sample123_filtering_stats.tsv 2> mutect/sample123_filtering_stats.tsv.log

    Filtering stats:

    #<METADATA>Ln prior of deletion of length 10=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 9=-17.957778925047954
    #<METADATA>Ln prior of deletion of length 8=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 7=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 6=-16.166019455819896
    #<METADATA>Ln prior of deletion of length 5=-16.859166636379843
    #<METADATA>Ln prior of deletion of length 4=-16.34834101261385
    #<METADATA>Ln prior of deletion of length 3=-15.318721595432693
    #<METADATA>Ln prior of deletion of length 2=-15.392829567586416
    #<METADATA>Ln prior of deletion of length 1=-14.294217278918307
    #<METADATA>Ln prior of SNV=-10.471165611907997
    #<METADATA>Ln prior of insertion of length 1=-13.435990347998912
    #<METADATA>Ln prior of insertion of length 2=-15.18519020280817
    #<METADATA>Ln prior of insertion of length 3=-15.655193832053907
    #<METADATA>Ln prior of insertion of length 4=-15.318721595432693
    #<METADATA>Ln prior of insertion of length 5=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 6=-16.34834101261385
    #<METADATA>Ln prior of insertion of length 7=-17.264631744488007
    #<METADATA>Ln prior of insertion of length 8=-16.571484563928063
    #<METADATA>Ln prior of insertion of length 9=-17.264631744488007
    #<METADATA>Ln prior of insertion of length 10=-20.72326583694641
    #<METADATA>High-AF beta-binomial cluster=weight = 0.0180, alpha = 10.02, beta = 0.50
    #<METADATA>Background beta-binomial cluster=weight = 0.1497, alpha = 1.59, beta = 1.55
    #<METADATA>Binomial cluster 1=weight = 0.5171, mean = 0.990
    #<METADATA>Binomial cluster 1=weight = 0.2910, mean = 0.503
    #<METADATA>Binomial cluster 1=weight = 0.0678, mean = 0.392
    #<METADATA>Binomial cluster 1=weight = 0.0538, mean = 0.161
    #<METADATA>Binomial cluster 1=weight = 0.0415, mean = 0.275
    #<METADATA>Binomial cluster 1=weight = 0.0257, mean = 0.064
    #<METADATA>Binomial cluster 1=weight = 0.0023, mean = 0.159
    #<METADATA>Binomial cluster 1=weight = 0.0006, mean = 0.333
    #<METADATA>threshold=0.519
    #<METADATA>fdr=0.042
    #<METADATA>sensitivity=0.949
    filter  FP      FDR     FN      FNR
    weak_evidence   22.57   0.01    75.07   0.03
    strand_bias     3.62    0.0     0.68    0.0
    contamination   0.07    0.0     0.02    0.0
    slippage        4.03    0.0     4.39    0.0
    haplotype       7.89    0.0     8.82    0.0
    germline        80.06   0.03    48.97   0.02

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi ElenaGrassi,

    Thank you for these examples. It looks like we are going to need to see your data to determine if this is a bug or if there is something internal going on with the algorithm. 

    Could you try to reproduce this issue without the contamination table and tumor segmentation inputs to decrease the size of the files you need to upload? We think that these files might not be necessary for investigating this issue. 

    The files we definitely need to see are the unfiltered VCF and the mutect stats file. Did you use the standard hg38 reference or did you make any modifications?

    You can upload all these files in a bug report following these instructions: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671. Please put these all in a zipped folder and let me know when you have uploaded the folder.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    ElenaGrassi

    Will do today, thanks!

    It's standard hg38.

    0
    Comment actions Permalink
  • Avatar
    ElenaGrassi

    Done, the uploaded file is called haplotype.tar.gz, there is a .sh with the filtering command used without contamination table and tumor segmentation as requested (the filtered variants due to 'haplotype' are still there as expected).

    The used reference is https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    ElenaGrassi I am trying to reproduce the error and it looks like it might already be fixed in the most recent version of the GATK.  The 6 pairs you found were all PASS.  Could you re-run with version 4.2.5.0 and confirm that you get the same, or let us know if the issue persists?

    0
    Comment actions Permalink
  • Avatar
    ElenaGrassi

    Sorry I completely forgot :/ With 4.2.5.0 the issue is gone!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great! Thanks for the update.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk