Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

MuTect2 annotations

Answered
0

9 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi wiki97,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    AS_SB_TABLE gives read counts for the forward and reverse strands.  I think you want the AD (allelic depths) genotype annotation.  However, we strongly urge you not to filter in this way.  Instead, use FilterMutectCalls.

    0
    Comment actions Permalink
  • Avatar
    wiki97

    Dear David Benjamin, thank you very much for your answer!

    After MuTect2 I ended up with 90,000 variant calls. Then I used FilterMutectCalls and took calls with "PASS" in the FILTER column (created by FilterMutectCalls). As I have only one sample (couldn't use VariantRecalibrator/ApplyVQSR) I applied hard filtering (VariantFiltration: FS, MQ, MQRankSum, ReadPosRankSum).

    In the end, I wanted to filter variants basing on the number of reads at the given position and the number of reads in which alternate alleles have been called - is it not recommended to filter in this way?

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    FilterMutectCalls is the only filtering we recommend with Mutect2 output.  The Mutect2 tutorials describe how to use both tools.

    0
    Comment actions Permalink
  • Avatar
    wiki97

    Dear David Benjamin, thank you very much for your answer!

    Do I understand correctly that FilterMutectCalls already takes into account the number of reads at the given position and the number of reads in which alternate alleles have been called, by looking at the AD value?

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    FilterMutectCalls accounts for these but in a more sophisticated way than one might expect.  It's described in section 2 of our documentation: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf

    0
    Comment actions Permalink
  • Avatar
    wiki97

    Dear David Benjamin, thank you very much for your help!

    0
    Comment actions Permalink
  • Avatar
    Brian Wiley

    David Benjamin, do you think it would be possible for Broad to start documenting things like this and most of the tags in your VCF files.  More like simple things not like the algorithm for clustered events.  For instance I have variant from which I can get the depth from the bam file with samtools with a variant called by Mutect at position chr17:60663118 C>T.  The depth of my bam file at this position is 170 (after removing duplicates; 222 before removing duplicates which is what shows in IGV).

    $ samtools depth C484.TCGA-19-2620-10A-01D-1495-08.5_gdc_realn.bam -r chr17:60663117-60663119
    chr17    60663117    172
    chr17    60663118    170
    chr17    60663119    173

    This coincides exactly with IGV (also shows 170 for position 60663118 when remove duplicates is turned on).

    $ samtools view C484.TCGA-19-2620-10A-01D-1495-08.5_gdc_realn.bam chr17:60663118-60663118 | wc -l
    222

    However Mutect has 2 depth DP tags, an INFO/DP and a FMT/DP.  The format DP shows a depth of 163 at this position which makes sense that it's at least lower.  But the INFO/DP shows a depth of DP=318 which makes almost zero sense (that's almost double!!) and in the VCF file it even indicates some reads are filtered...

    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">

    Additionally the AS_SB_TABLE shows "164,135|3,2" in which the total is 304 which again makes almost zero sense.  So from a researcher perspective it is absolutely necessary to know how Mutect2 is calculating these numbers else the strand bias filter cannot be trusted at all.  I am using version 4.2.1.0 of GATK and below are the variant and the call information:

    Variant

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TCGA-19-2620-10A-01D-1495-08
    chr17   60663118        .       C       T       .       weak_evidence   AS_FilterStatus=weak_evidence;AS_SB_TABLE=164,135|3,2;DP=318;ECNT=1;GERMQ=93;MBQ=33,27;
    MFRL=182,128;MMQ=60,60;MPOS=17;NALOD=1.99;NLOD=28.89;POPAF=6;ROQ=32;TLOD=6.59;AC=1;AN=2;MQ0=0;samtools_DP=170;SAMPLE=TCGA-19-2620-10A-01D-1495-08;PON_RefDepth=
    1532;PON_AltDepth=0;PON_FISHER=0;CSQ=T|stop_gained|HIGH|PPM1D|ENSG00000170836|Transcript|ENST00000305921.8|protein_coding|6/6||ENST00000305921.8:c.1384C>T|ENSP
    00000306682.2:p.Gln462Ter|1606|1384|462|Q/*|Caa/Taa|CM131995&COSV59955543||1||SNV|HGNC|HGNC:9277|YES|NM_003620.4||1|P1|CCDS11625.1|ENSP00000306682|O15297.184|A
    0A0S2Z4M2.32|UPI0000130FE8|O15297-1||Ensembl||C|C||1||||||||||||||||||||||||||0&1|1&1||||||||MAGLYSLGVSVFSDQGGRKYMEDVTQIVVEPEPTAEEKPSPRRSLSQPLPPRPSPAALPGGEVSGK
    GPAVAAREARDPLPDAGASPAPSRCCRRRSSVAFFAVCDGHGGREAAQFAREHLWGFIKKQKGFTSSEPAKVCAAIRKGFLACHLAMWKKLAEWPKTMTGLPSTSGTTASVVIIRGMKMYVAHVGDSGVVLGIQDDPKDDFVRAVEVTQDHKPELPKER
    ERIEGLGGSVMNKSGVNRVVWKRPRLTHNGPVRRSTVIDQIPFLAVARALGDLWSYDFFSGEFVVSPEPDTSVHTLDPQKHKYIILGSDGLWNMIPPQDAISMCQDQEEKKYLMGEHGQSCAKMLVNRALGRWRQRMLRADNTSAIVICISPEVDNQGN
    FTNEDELYLNLTDSPSYNSQETCVMTPSPCSTPPVKSLEEDPWPRVNSKDHIPALVRSNAFSENFLEVSAEIARENVQGVVIPSKDPEPLEENCAKALTLRIHDSLNNSLPIGLVPTNSTNTVMDQKNLKMSTPGQMKAQEIERTPPTNFKRTLEESNS
    GPLMKKHRRNGLSRSSGAQPASLPTTSQRKNSVKLTMRRRLRGQKKIGNPLLHQHRKTVCVC||||||||||||||||||||||||||||||    GT:AD:AF:DP:F1R2:F2R1:SB        0/1:158,5:0.031:163:70,2:79,2:8
    7,71,3,2

    Call (your guys also sequenced this sample for the TCGA since the center is -08 TCGA-19-2620-10A-01D-1495-08):

    NORMAL=$(samtools view -H $normal_bam | /usr/bin/perl -nE 'say $1 if /^\@RG.+\tSM:([ -~]+)/' | head -n 1)
    TUMOR=$(samtools view -H $tumor_bam | /usr/bin/perl -nE 'say $1 if /^\@RG.+\tSM:([ -~]+)/' | head -n 1)

    /gatk/gatk Mutect2 --java-options "-Xmx8g" -O $1 -R $2 -I $3 -tumor "$TUMOR" -I $4 -normal "$NORMAL" -L $5 --f1r2-tar-gz f1r2.tar.gz #Running Mutect2.

    /gatk/gatk LearnReadOrientationModel -I f1r2.tar.gz -O artifact.priors.tar.gz

    /gatk/gatk FilterMutectCalls --java-options "-Xmx8g" -R $2 -V $1 -O $6 -ob-priors artifact.priors.tar.gz #Running FilterMutectCalls on the output vcf.

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Brian Wiley,

    Thank you for writing into the GATK forum with this question. We took a closer look at the headers and agree that the F1R2 and F2R1 annotations could be more clear to describe how they count reads and fragments. We think the AD and FAD descriptions in the headers currently are sufficient.

    We created a ticket so that we can edit the headers to be more clear: https://github.com/broadinstitute/gatk/issues/7904

    The developer team will update that ticket when they are able to work on this.

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk