Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

how to interpret results of gatk Concordance

0

7 comments

  • Avatar
    Genevieve Brandt

    Hi Ana Marija, the GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Ana Marija Your output:

    type TP     FP         FN        RECALL PRECISION
    SNP 285 1876867 2535060 0.0        0.0
    INDEL 0 0 8542 0.0 0.0

    means that there were 285 SNPs found in both new.vcf and old.vcf (true positive) compared to 2535060 SNPs found in old.vcf but missing from new.vcf (false negative).  Since the former was so small the recall is zero up to the default three significant digits.  Likewise, the very large number 1876867 of false positive SNPs gives a precision of zero up to three significant digits.

     

    1
    Comment actions Permalink
  • Avatar
    Ana Marija

    HI David,

     

    thank you so much! It seems that I had old.vcf on build 36 and new.vcf on build 37. I lifted old.vcf to build 37 and repeated the analysis.

    gatk Concordance -R human_g1k_v37.fasta -eval new.vcf.gz --truth old.vcf.gz --summary summary.tsv

     

    and I got:

    type TP FP FN RECALL PRECISION
    SNP 902899 974253 1631557 0.356 0.481
    INDEL 0 0 8926 0.0 0.0

     

    Can you please tell me if I am correct to interpret this as:

     

    Means that there were 902899 SNPs found in both new.vcf and old.vcf (true positive) compared to 1631557 SNPs found in old.vcf but missing from new.vcf (false negative). Since the former was so small the recall is 0.356. Likewise, the very large number 974253 of false positive SNPs gives a precision of 0.481.

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Ana Marija You are correct.

    0
    Comment actions Permalink
  • Avatar
    Ana Marija

    HI David,

     

    thank you. And I would also like to ask you what is the correct interpretation of what TP, true positive means exactly in this context? Is it like it is outlined here: http://broadinstitute.github.io/picard/picard-metric-definitions.html#GenotypeConcordanceContingencyMetrics

    "The list of contingency table values (TP, TN, FP, FN) that are deduced from the truth/call state comparison, given the reference. In general, we are comparing two sets of alleles. Therefore, we can have zero or more contingency table values represented in one comparison. For example, if the truthset is a heterozygous call with both alleles non-reference (HET_VAR1_VAR2), and the callset is a heterozygous call with both alleles non-reference with one of the alternate alleles matching an alternate allele in the callset, we would have a true positive, false positive, and false negative. The true positive is from the matching alternate alleles, the false positive is the alternate allele found in the callset but not found in the truthset, and the false negative is the alternate in the truthset not found in the callset. We also include a true negative in cases where the reference allele is found in both the truthset and callset."

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Concordance is much simpler than that because it was designed for somatic validations.  As such, the concept of genotype doesn't really exist and all it can do it compare the alt alleles present in the truth and evaluation VCFs. More precisely, it assumes that multiallelic truth variants are split into multiple biallelic VCF lines and considers a site to be true positive when the first truth alt allele is present in the evaluation alt alleles.  If other alt alleles are present in the evaluation VCF they are ignored.

    0
    Comment actions Permalink
  • Avatar
    Ana Marija

    Thank you! So does that means that if any SNP that has a mismatch falls into TP?

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk