Interpretting CrosscheckFingerprints metrics
For my somatic-caller workflow, I am exploring using the CrosscheckFingerprints tool to determine if either sample of a tumor-normal paired group has been swapped (i.e. the fastqs are not actually coming from the same individual). However, despite reading the CrosscheckFingerprints documentation over and over, I'm not understanding the output metrics:
# CrosscheckFingerprints --INPUT combined_merged_preprocessed.bam --OUTPUT crosscheck_fingerprint.bam.crosscheck_metrics --HAPLOTYPE_MAP hapmap_3.3.norm.snp.no_dup.vcf.gz --NUM_THREADS 4 --CROSSCHECK_MODE CHECK_SAME_SAMPLE --LOD_THRESHOLD 0.0 --CROSSCHECK_BY READGROUP --CALCULATE_TUMOR_AWARE_RESULTS true --ALLOW_DUPLICATE_READS false --GENOTYPING_ERROR_RATE 0.01 --OUTPUT_ERRORS_ONLY false --LOSS_OF_HET_RATE 0.5 --EXPECT_ALL_GROUPS_TO_MATCH false --EXIT_CODE_WHEN_MISMATCH 1 --EXIT_CODE_WHEN_NO_VALID_CHECKS 1 --TEST_INPUT_READABILITY true --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
# Started on: Tue Mar 10 15:34:45 MDT 2020
## METRICS CLASS picard.fingerprint.CrosscheckMetric
LEFT_GROUP_VALUE RIGHT_GROUP_VALUE RESULT DATA_TYPE LOD_SCORE LOD_SCORE_TUMOR_NORMAL LOD_SCORE_NORMAL_TUMOR LEFT_RUN_BARCODE LEFT_LANE LEFT_MOLECULAR_BARCODE_SEQUENCE LEFT_LIBRARY LEFT_SAMPLE LEFT_FILE RIGHT_RUN_BARCODE RIGHT_LANE RIGHT_MOLECULAR_BARCODE_SEQUENCE RIGHT_LIBRARY RIGHT_SAMPLE RIGHT_FILE
-1114901620 -1114901620 EXPECTED_MATCH READGROUP 19530.898113 15011.13605 15008.272581 ? -1 ? normal normal file:///combined_merged_preprocessed.bam ? -1 ? normal normal file:///combined_merged_preprocessed.bam
-1114901620 1828819356 UNEXPECTED_MATCH READGROUP 13854.437171 10162.261509 10125.668588 ? -1 ? normal normal file:///combined_merged_preprocessed.bam ? -1 ? tumor tumor file:///combined_merged_preprocessed.bam
1828819356 -1114901620 UNEXPECTED_MATCH READGROUP 13929.749602 10128.707783 10160.202878 ? -1 ? tumor tumor file:///combined_merged_preprocessed.bam ? -1 ? normal normal file:///combined_merged_preprocessed.bam
1828819356 1828819356 EXPECTED_MATCH READGROUP 19169.181061 14638.585781 14636.201766 ? -1 ? tumor tumor file:///combined_merged_preprocessed.bam ? -1 ? tumor tumor file:///combined_merged_preprocessed.bam
These bams come from tumor-normal fastqs that I know have come from the same individual. As for the metrics, are the LEFT_GROUP_VALUE and RIGHT_GROUP_VALUE fields just identifiers? It looks like the first and last rows (EXPECTED_MATCH) are the normal-normal and tumor-tumor comparisons, respectively. While the middle two rows (UNEXPECTED_MATCH) are the normal-tumor and tumor-normal comparisons, respectively. If this is the case, then I should probably be most concerned with the middle two rows, and in particular the LOD_SCORE? As per the documents, this score is the log-likelihood of coming from the same sample.
Therefore, in the example above, the likelihoods listed for my tumor-normal pairs are 10^13854 and 10^13929, respectively. Am I interpreting these metrics correctly?
There are no descriptions for the fields at http://broadinstitute.github.io/picard/picard-metric-definitions.html#CrosscheckMetric either
Thank you for the feedback and we do plan on fixing the descriptions for the fields at http://broadinstitute.github.io/picard/picard-metric-definitions.html#CrosscheckMetric
With regards to your first question: Given that the
--INPUT combined_merged_preprocessed.bamincludes a single file, CrosscheckFingerprints will crosscheck all samples inside the BAM. Since you have two sample bams in your input (named
tumor), each line in the output LEFT_ and RIGHT_ will identify which samples are being crosschecked. The LOD score will inform the probability that samples match and tumor-aware LOD score will help access identity in presence of severe loss of heterozygosity.Each comparison is between a "left" group of inputs and a "right" group. For example, a group could be 1) a read-group, or 2) all the readgroups that belong to a certain sample, or 3) everything from a certain file. This grouping is controlled by the CROSSCHECK_BY argument. Once you know what the groups are, the group value is simply something that identifies that group....for example, 1) a readgroup by its PU field, 2) a sample is identified by the sample ID and 3) the file by its path.
Thanks for clarifying some of this Bhanu Gandham, though I would like to mention that compared to many of the other GATK tools, I find the documentation of this tool to be quite confusing. I'm not sure how many users use this tool, but I think it would be helpful to those that do use it to either write a forum article for this one or buffing the documentation, perhaps with a few examples.
I just want to chime in here. I had some of the same concerns. Specifically, how are LOD_SCORE_TUMOR_NORMAL and LOD_SCORE_NORMAL_TUMOR different from LOD_SCORE? Are the first two only looking at homozygous sites or something? Thank you.
Hi Charlie Murphy,
I am looking into the request again for better documentation, thank you for the feedback.
I was able to find out more information about the LOD_SCORE_TUMOR_NORMAL vs the normal LOD_SCORE. The normal LOD_SCORE is the log odds ratio that the identity of the two samples is the same, if they are typical samples. If the two samples that are being compared have the same identity, but one is a tumor sample and one is a normal sample, you can use the LOD_SCORE_TUMOR_NORMAL or the LOD_SCORE_NORMAL_TUMOR. Tumor samples can have a loss of heterozygosity which can make them appear as if they are an "impure" version of the sample. The LOD_SCORE_TUMOR_NORMAL is the log odds ratio that the samples have the same identity, but the first sample is a tumor sample and the second is a normal sample. The LOD_SCORE_NORMAL_TUMOR is the opposite, if the first sample is a normal and the second is a tumor sample.
You can also turn these off if you are not interested in that information.
Hope this helps,
Thanks so much for your reply, it is very helpful. It leads to another question though. How exactly does the software "assume" a sample is a tumor or a normal? I am just trying to see how that fits within the statistical framework presented in the CrossCheck paper published last year. My understanding is that framework assumes no contamination, so I don't understand how it would assume one sample is an "impure" version of another. Sorry for bother you about this and maybe it is too much to ask for typical GATK documentation, but I just want to know more algorithm details. Let me know, thanks!
I am not sure what you mean by the software assuming a sample is a tumor or normal. Could you clarify your question?
The LOD_SCORE_TUMOR_NORMAL is the log odds that the identity of two samples is the same but the first sample is a tumor sample and the second is a normal sample. The opposite is true for the LOD_SCORE_NORMAL_TUMOR. This calculation adds the aspect of tumor samples having a loss of heterozygosity from normal samples of the same identity.
I think I understand the question...I'll try my hand at answering.
When a high-purity (I think that this is where the "impure" issue is stemming from, see note in the end) tumor sample undergoes Loss of Heterozygosity, many heterozygous SNPs will appear to be Homozygous. Whe Crosscheck compares thoese SNPs to the normal sample it will "think" that the two are different individuals since many SNPs apear to be different.
The when LOD_TUMOR_NORMAL/LOD_NORMAL_TUMOR are calculated, the code allows for Hetrozygous SNPs in the Normal "side" to appear as Homozygous with a small probability (it's an input argument.) This normally resolves the issue described above albeit at the cost of getting slightly less power (so for Normal-Normal comparisons the LOD scores are closer to zero)
Note: The reason this is only needed for "pure" tumors, is that an impure tumor that underwent LoH will have enough reads from the "other" allele, that the code will genotype those SNPs are Heterozygous, or at least be rather agnostic between Het and Hom and thus it will not affect the regular LOD_SCORE....only for relatively "pure" tumor samples does the LoH start hurting the LOD_SCORE results.
The mathematical description of this can be found here: https://github.com/broadinstitute/picard/blob/master/docs/fingerprinting/main.pdf
Yossi Farjoun Awesome, that is exactly what I was looking for. That answers my question, thanks! And thank you Genevieve Brandt (she/her) as well!
Are normal sample and tumor sample from the same patient when LOD_SCORE is -5949.581009, LOD_SCORE_TUMOR_NORMAL is -5354.395046, but LOD_SCORE_NORMAL_TUMOR is 4086.050812?
Those are really high LOD scores!
I'm guessing that you're using the large HaplotypeDatabase....or that you have really really deep coverage....
Anyhow, my interpretatino of the results are that the "left" sample is NORMAL and the "Right" sample is tumor, and that they did come form the same individual, but that the Tumor is high purity and suffered some serious LoH on several chromosomes.
You can validate this by running CheckFingerprints (instead of Crosscheck) (though you'll need a VCF for the "normal" sample) and look at the individual snps (in the detailed . What you are looking for is that when theres a mismatch between the expected and the found genotypes the expected is always HET and the found is always HOM (either ref or var). Also, these should happen on a collection of snps that are all on the same region (chromosome arm etc.)
Alternatively, you could have a contaminated haploid sample...but that's really exotic....
Yes, we are using the large HaplotypeDatabase.
Thanks Yossi Farjoun, you answers my question.
Please sign in to leave a comment.