Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Mutect2 Filtering Discrepancy with Different Versions of gnomAD Germline Data

0

9 comments

  • Avatar
    Gökalp Çelik

    Hi Mert Çelik

    Can you share the details of your work such as number of variants before filtering within raw Mutect2 call vcfs?

    We also wish to hear how you gathered your germline data from gnomAD 4. Looking at those variant entries I noticed that one entry has a malformed POPAF value which is 7.30. POPAF is supposed to be between 0 and 1 therefore something seems off here. We need to understand how this value is formed in a variant context albeit due to a bug or a malformed resource file. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Mert Çelik

    Thanks Gökalp Çelik

    The number of variants before filtering is 33891. After filtering, the number decreased to 6560. So it seems there is a relatively successful filtering process went on already, but it is not sufficient or as expected. Upon inspecting the mutect2 manual, I found out that POPAF value is negative logarithm of population allele frequency of that variant and pipeline calculates it using the AF data from the gnomAD file. 

    I obtained gnomAD 4 VCFs from their website, trimmed them so that it only contains AF, AN, AC and nhomalt information. After that, I concatenated all VCFs using bcftools concat. Resulting VCF was sorted and indexed. 
     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Sorry for my misinterpretting POPAF value. I mixed up with another tag. I will discuss this issue to our team and get their opinion as well. If they think that this could be a bug then I will escalate the issue. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Mert Çelik

    Thank you, waiting for the response from your side then. If you have any further questions about how I devised the work, I would be glad to answer.

    Best.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    After our discussion with the team, it looks like changes in the gnomAD 4.0 is causing issues during germline AF filtering step. Mutect2 expects a single line per site/locus therefore using the provided germline resource generates the correct behavior. The one you generated has multiple entries per site which causes the problematic behavior. A fix is on its way for the next release of GATK for this issue however we cannot guarantee any additional changes to occur to gnomAD data for future releases. We suggest users to stick with the resource bundles that we provide. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Mert Çelik

    Hello,

    Thank you for the effort Gökalp Çelik, it is appreciated. I can also give you a humble suggestion: as you might have heard, gnomAD 4.1 has been released recently and it has a quite different structure in terms of info tags compared to previous versions. GATK team might also want to take these differences into consideration while devising the new update.

    Best.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi 

    Thank you for the heads up. A more generic fix is in the below PR

    https://github.com/broadinstitute/gatk/pull/8837 

    and will be merged to the next release of GATK. 

    Regards. 

    1
    Comment actions Permalink
  • Avatar
    D S

    Dear Mert Çelik,

    Out of curiosity, did you use the code in mutect2.wdl to do the filtering of AF information form gnomAD 4.1? 

     

            grep -v "^#" ${input_vcf} | sed -e 's#\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t.*;AF=\([0-9]*\.[e0-9+-]*\).*#\1\t\2\t.\t\4\t\5\t.\t\7\tAF=\8#g' > simplified_body &

     

    I think I am doing similar things as what you have done. I modified and used the mutect_resources.wdl to create the AFonly vcf, and CommonBiallelicSNP. The result looks bit problematic and I am trying to find out why..

    Best

    0
    Comment actions Permalink
  • Avatar
    Mert Çelik

    Hello DS, 

    In Gnomad 4.1 there is no info tag named "AF". They merged entries of genomes and exomes, then renamed some tags, including the AF. You can try to change the new tag name "AF_joint" to "AF" using sed so that Mutect2 can recognize it. This could be an option to solve your problem. 

    Best,

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk