how to interpret results of gatk Concordance
AnsweredIf you are seeing an error, please provide(REQUIRED) :
a) GATK version used:
b) Exact command used:
c) Entire error log:
If not an error, choose a category for your question(REQUIRED):
a)How do I (......)?
b) What does (......) mean?
c) Why do I see (......)?
d) Where do I find (......)?
e) Will (......) be in future releases?
./gatk Concordance \
-R human_g1k_v37.fasta \
-eval new.vcf \
--truth old.vcf \
--summary summary.tsv
c) Entire error log:
None
If not an error, choose a category for your question(REQUIRED):
a)How do I (......)?
b) What does my output mean?
type TP FP FN RECALL PRECISION
SNP 285 1876867 2535060 0.0 0.0
INDEL 0 0 8542 0.0 0.0
How can I interpret this result?
c) Why do I see (......)?
d) Where do I find (......)?
e) Will (......) be in future releases?
-
Hi Ana Marija, the GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
Ana Marija Your output:
type TP FP FN RECALL PRECISION
SNP 285 1876867 2535060 0.0 0.0
INDEL 0 0 8542 0.0 0.0means that there were 285 SNPs found in both new.vcf and old.vcf (true positive) compared to 2535060 SNPs found in old.vcf but missing from new.vcf (false negative). Since the former was so small the recall is zero up to the default three significant digits. Likewise, the very large number 1876867 of false positive SNPs gives a precision of zero up to three significant digits.
-
HI David,
thank you so much! It seems that I had old.vcf on build 36 and new.vcf on build 37. I lifted old.vcf to build 37 and repeated the analysis.
gatk Concordance -R human_g1k_v37.fasta -eval new.vcf.gz --truth old.vcf.gz --summary summary.tsv
and I got:
type TP FP FN RECALL PRECISION
SNP 902899 974253 1631557 0.356 0.481
INDEL 0 0 8926 0.0 0.0Can you please tell me if I am correct to interpret this as:
Means that there were 902899 SNPs found in both new.vcf and old.vcf (true positive) compared to 1631557 SNPs found in old.vcf but missing from new.vcf (false negative). Since the former was so small the recall is 0.356. Likewise, the very large number 974253 of false positive SNPs gives a precision of 0.481.
-
Ana Marija You are correct.
-
HI David,
thank you. And I would also like to ask you what is the correct interpretation of what TP, true positive means exactly in this context? Is it like it is outlined here: http://broadinstitute.github.io/picard/picard-metric-definitions.html#GenotypeConcordanceContingencyMetrics
"The list of contingency table values (TP, TN, FP, FN) that are deduced from the truth/call state comparison, given the reference. In general, we are comparing two sets of alleles. Therefore, we can have zero or more contingency table values represented in one comparison. For example, if the truthset is a heterozygous call with both alleles non-reference (HET_VAR1_VAR2), and the callset is a heterozygous call with both alleles non-reference with one of the alternate alleles matching an alternate allele in the callset, we would have a true positive, false positive, and false negative. The true positive is from the matching alternate alleles, the false positive is the alternate allele found in the callset but not found in the truthset, and the false negative is the alternate in the truthset not found in the callset. We also include a true negative in cases where the reference allele is found in both the truthset and callset."
-
Concordance is much simpler than that because it was designed for somatic validations. As such, the concept of genotype doesn't really exist and all it can do it compare the alt alleles present in the truth and evaluation VCFs. More precisely, it assumes that multiallelic truth variants are split into multiple biallelic VCF lines and considers a site to be true positive when the first truth alt allele is present in the evaluation alt alleles. If other alt alleles are present in the evaluation VCF they are ignored.
-
Thank you! So does that means that if any SNP that has a mismatch falls into TP?
Please sign in to leave a comment.
7 comments