documentation for Mutect2 theory
AnsweredWhat is the best way to map Mutect2 input command-line arguments and output VCF fields to the variables described in the Mutect2 whitepaper in its GitHub repo? It isn't obvious to me which variables in the algorithm descriptions there correspond to which fields in the inputs and outputs. Also, is that whitepaper the latest and greatest description of Mutect2 theory? I see it hasn't been updated in 10 months on the master branch. If not, where should I be looking for this info?
-
On a related note, the table in the section "B. Hard Filters" doesn't seem to match my Mutect2 output VCFs in GATK v4.1.8.0, here are some examples:
"fragment_length" is called "fragment" in VCF -- same thing?
there is no "duplicate_evidence" in my VCF... is it same as "duplicates" in whitepaper?
"base_quality" should be "base_qual" ?
and so forth.
So I am wondering if the whitepaper is up to date or if something is deviant about my workflow.
-
The whitepaper is the most up to date documentation on the theory used for M2. We have not made significant changes to the theory behind M2 in the last 10 months.
Could you be more specific about which variables you are trying to understand? Perhaps we need to make the documentation clearer.
I recommend using this Terra workspace:
https://app.terra.bio/#workspaces/help-gatk/Somatic-SNVs-Indels-GATK4
-
It would be really useful if the whitepaper had a table giving the names of all the INFO and FORMAT fields in the Mutect2 post-filter VCF cross-referenced to the name of the variables in the whitepaper. But the specific fields in which I am interested are, for Mutect2 in GATK v4.1.8.0:
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=CONTQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to contamination">
##INFO=<ID=GERMQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not germline variants">
##INFO=<ID=ROQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to read orientation artifact">
##INFO=<ID=SEQQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not sequencing errors">
##INFO=<ID=STRANDQ,Number=1,Type=Integer,Description="Phred-scaled quality of strand bias artifact">
##INFO=<ID=STRQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles in STRs are not polymerase slippage errors">
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log 10 likelihood ratio score of variant existing versus not existing"> -
I agree, it seems the white paper. needs to be updated.
I've created a github ticket at:
https://github.com/broadinstitute/gatk/issues/6965
Please feel free to add to it if I didn't capture everything.
Please sign in to leave a comment.
4 comments