MuTect2 annotations
AnsweredCategory for my question: "d) Where do I find (......)?"
Hi, I am new to exome-seq and would be grateful for any suggestions :)
I want to check coverage for variants called with MuTect2 (GATK 4.1.8.1).
I would like to filter variants basing on the number of reads at the given position and number of reads in which alternate alleles have been called.
However, I am not sure which annotation should I analyze. I found "AS_SB_TABLE" annotation, and I am wondering - does it show total read counts for reference and alternate alleles? Or are the reads number somehow normalized for the strand bias tests?
Here is the header for "AS_SB_TABLE":
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">
-
Hi wiki97,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
AS_SB_TABLE gives read counts for the forward and reverse strands. I think you want the AD (allelic depths) genotype annotation. However, we strongly urge you not to filter in this way. Instead, use FilterMutectCalls.
-
Dear David Benjamin, thank you very much for your answer!
After MuTect2 I ended up with 90,000 variant calls. Then I used FilterMutectCalls and took calls with "PASS" in the FILTER column (created by FilterMutectCalls). As I have only one sample (couldn't use VariantRecalibrator/ApplyVQSR) I applied hard filtering (VariantFiltration: FS, MQ, MQRankSum, ReadPosRankSum).
In the end, I wanted to filter variants basing on the number of reads at the given position and the number of reads in which alternate alleles have been called - is it not recommended to filter in this way?
-
FilterMutectCalls is the only filtering we recommend with Mutect2 output. The Mutect2 tutorials describe how to use both tools.
-
Dear David Benjamin, thank you very much for your answer!
Do I understand correctly that FilterMutectCalls already takes into account the number of reads at the given position and the number of reads in which alternate alleles have been called, by looking at the AD value?
-
FilterMutectCalls accounts for these but in a more sophisticated way than one might expect. It's described in section 2 of our documentation: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf
-
Dear David Benjamin, thank you very much for your help!
-
David Benjamin, do you think it would be possible for Broad to start documenting things like this and most of the tags in your VCF files. More like simple things not like the algorithm for clustered events. For instance I have variant from which I can get the depth from the bam file with samtools with a variant called by Mutect at position chr17:60663118 C>T. The depth of my bam file at this position is 170 (after removing duplicates; 222 before removing duplicates which is what shows in IGV).
$ samtools depth C484.TCGA-19-2620-10A-01D-1495-08.5_gdc_realn.bam -r chr17:60663117-60663119
chr17 60663117 172
chr17 60663118 170
chr17 60663119 173This coincides exactly with IGV (also shows 170 for position 60663118 when remove duplicates is turned on).
$ samtools view C484.TCGA-19-2620-10A-01D-1495-08.5_gdc_realn.bam chr17:60663118-60663118 | wc -l
222However Mutect has 2 depth DP tags, an INFO/DP and a FMT/DP. The format DP shows a depth of 163 at this position which makes sense that it's at least lower. But the INFO/DP shows a depth of DP=318 which makes almost zero sense (that's almost double!!) and in the VCF file it even indicates some reads are filtered...
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
Additionally the AS_SB_TABLE shows "164,135|3,2" in which the total is 304 which again makes almost zero sense. So from a researcher perspective it is absolutely necessary to know how Mutect2 is calculating these numbers else the strand bias filter cannot be trusted at all. I am using version 4.2.1.0 of GATK and below are the variant and the call information:
Variant
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCGA-19-2620-10A-01D-1495-08
chr17 60663118 . C T . weak_evidence AS_FilterStatus=weak_evidence;AS_SB_TABLE=164,135|3,2;DP=318;ECNT=1;GERMQ=93;MBQ=33,27;
MFRL=182,128;MMQ=60,60;MPOS=17;NALOD=1.99;NLOD=28.89;POPAF=6;ROQ=32;TLOD=6.59;AC=1;AN=2;MQ0=0;samtools_DP=170;SAMPLE=TCGA-19-2620-10A-01D-1495-08;PON_RefDepth=
1532;PON_AltDepth=0;PON_FISHER=0;CSQ=T|stop_gained|HIGH|PPM1D|ENSG00000170836|Transcript|ENST00000305921.8|protein_coding|6/6||ENST00000305921.8:c.1384C>T|ENSP
00000306682.2:p.Gln462Ter|1606|1384|462|Q/*|Caa/Taa|CM131995&COSV59955543||1||SNV|HGNC|HGNC:9277|YES|NM_003620.4||1|P1|CCDS11625.1|ENSP00000306682|O15297.184|A
0A0S2Z4M2.32|UPI0000130FE8|O15297-1||Ensembl||C|C||1||||||||||||||||||||||||||0&1|1&1||||||||MAGLYSLGVSVFSDQGGRKYMEDVTQIVVEPEPTAEEKPSPRRSLSQPLPPRPSPAALPGGEVSGK
GPAVAAREARDPLPDAGASPAPSRCCRRRSSVAFFAVCDGHGGREAAQFAREHLWGFIKKQKGFTSSEPAKVCAAIRKGFLACHLAMWKKLAEWPKTMTGLPSTSGTTASVVIIRGMKMYVAHVGDSGVVLGIQDDPKDDFVRAVEVTQDHKPELPKER
ERIEGLGGSVMNKSGVNRVVWKRPRLTHNGPVRRSTVIDQIPFLAVARALGDLWSYDFFSGEFVVSPEPDTSVHTLDPQKHKYIILGSDGLWNMIPPQDAISMCQDQEEKKYLMGEHGQSCAKMLVNRALGRWRQRMLRADNTSAIVICISPEVDNQGN
FTNEDELYLNLTDSPSYNSQETCVMTPSPCSTPPVKSLEEDPWPRVNSKDHIPALVRSNAFSENFLEVSAEIARENVQGVVIPSKDPEPLEENCAKALTLRIHDSLNNSLPIGLVPTNSTNTVMDQKNLKMSTPGQMKAQEIERTPPTNFKRTLEESNS
GPLMKKHRRNGLSRSSGAQPASLPTTSQRKNSVKLTMRRRLRGQKKIGNPLLHQHRKTVCVC|||||||||||||||||||||||||||||| GT:AD:AF:DP:F1R2:F2R1:SB 0/1:158,5:0.031:163:70,2:79,2:8
7,71,3,2Call (your guys also sequenced this sample for the TCGA since the center is -08 TCGA-19-2620-10A-01D-1495-08):
NORMAL=$(samtools view -H $normal_bam | /usr/bin/perl -nE 'say $1 if /^\@RG.+\tSM:([ -~]+)/' | head -n 1)
TUMOR=$(samtools view -H $tumor_bam | /usr/bin/perl -nE 'say $1 if /^\@RG.+\tSM:([ -~]+)/' | head -n 1)
/gatk/gatk Mutect2 --java-options "-Xmx8g" -O $1 -R $2 -I $3 -tumor "$TUMOR" -I $4 -normal "$NORMAL" -L $5 --f1r2-tar-gz f1r2.tar.gz #Running Mutect2.
/gatk/gatk LearnReadOrientationModel -I f1r2.tar.gz -O artifact.priors.tar.gz
/gatk/gatk FilterMutectCalls --java-options "-Xmx8g" -R $2 -V $1 -O $6 -ob-priors artifact.priors.tar.gz #Running FilterMutectCalls on the output vcf. -
Hi Brian Wiley,
Thank you for writing into the GATK forum with this question. We took a closer look at the headers and agree that the F1R2 and F2R1 annotations could be more clear to describe how they count reads and fragments. We think the AD and FAD descriptions in the headers currently are sufficient.
We created a ticket so that we can edit the headers to be more clear: https://github.com/broadinstitute/gatk/issues/7904
The developer team will update that ticket when they are able to work on this.
Best,
Genevieve
Please sign in to leave a comment.
9 comments