Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Interpretting CrosscheckFingerprints metrics

0

12 comments

  • Avatar
    Alijah O'Connor

    Thanks for clarifying some of this Bhanu Gandham, though I would like to mention that compared to many of the other GATK tools, I find the documentation of this tool to be quite confusing. I'm not sure how many users use this tool, but I think it would be helpful to those that do use it to either write a forum article for this one or buffing the documentation, perhaps with a few examples.

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Charlie Murphy,

    I am looking into the request again for better documentation, thank you for the feedback.

    I was able to find out more information about the LOD_SCORE_TUMOR_NORMAL vs the normal LOD_SCORE. The normal LOD_SCORE is the log odds ratio that the identity of the two samples is the same, if they are typical samples. If the two samples that are being compared have the same identity, but one is a tumor sample and one is a normal sample, you can use the LOD_SCORE_TUMOR_NORMAL or the LOD_SCORE_NORMAL_TUMOR. Tumor samples can have a loss of heterozygosity which can make them appear as if they are an "impure" version of the sample. The LOD_SCORE_TUMOR_NORMAL is the log odds ratio that the samples have the same identity, but the first sample is a tumor sample and the second is a normal sample. The LOD_SCORE_NORMAL_TUMOR is the opposite, if the first sample is a normal and the second is a tumor sample. 

    You can also turn these off if you are not interested in that information. 

    Hope this helps,

    Genevieve

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Charlie,

    I am not sure what you mean by the software assuming a sample is a tumor or normal. Could you clarify your question?

    The LOD_SCORE_TUMOR_NORMAL is the log odds that the identity of two samples is the same but the first sample is a tumor sample and the second is a normal sample. The opposite is true for the LOD_SCORE_NORMAL_TUMOR. This calculation adds the aspect of tumor samples having a loss of heterozygosity from normal samples of the same identity.

    Genevieve

    1
    Comment actions Permalink
  • Avatar
    Yossi Farjoun

    I think I understand the question...I'll try my hand at answering. 

    When a high-purity (I think that this is where the "impure" issue is stemming from, see note in the end) tumor sample undergoes Loss of Heterozygosity, many heterozygous SNPs will appear to be Homozygous. Whe Crosscheck compares thoese SNPs to the normal sample it will "think" that the two are different individuals since many SNPs apear to be different.

    The when LOD_TUMOR_NORMAL/LOD_NORMAL_TUMOR are calculated, the code allows for Hetrozygous SNPs in the Normal "side" to appear as Homozygous with a small probability (it's an input argument.) This normally resolves the issue described above albeit at the cost of getting slightly less power (so for Normal-Normal comparisons the LOD scores are closer to zero) 

     

    Note: The reason this is only needed for "pure" tumors, is that an impure tumor that underwent LoH will have enough reads from the "other" allele, that the code will genotype those SNPs are Heterozygous, or at least be rather agnostic between Het and Hom and thus it will not affect the regular LOD_SCORE....only for relatively "pure" tumor samples does the LoH start hurting the LOD_SCORE results. 

     

    The mathematical description of this can be found here: https://github.com/broadinstitute/picard/blob/master/docs/fingerprinting/main.pdf

     

     

    1
    Comment actions Permalink
  • Avatar
    Alijah O'Connor

    There are no descriptions for the fields at http://broadinstitute.github.io/picard/picard-metric-definitions.html#CrosscheckMetric either

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Alijah O'Connor

     

    Thank you for the feedback and we do plan on fixing the descriptions for the fields at http://broadinstitute.github.io/picard/picard-metric-definitions.html#CrosscheckMetric

     

    With regards to your first question: Given that the --INPUT combined_merged_preprocessed.bam includes a single file, CrosscheckFingerprints will crosscheck all samples inside the BAM. Since you have two sample bams in your input (named normal and tumor ), each line in the output LEFT_ and RIGHT_ will identify which samples are being crosschecked. The LOD score will inform the probability that samples match and tumor-aware LOD score will help access identity in presence of severe loss of heterozygosity.

    Each comparison is between a "left" group of inputs and a "right" group. For example, a group could be 1) a read-group, or 2) all the readgroups that belong to a certain sample, or 3) everything from a certain file. This grouping is controlled by the CROSSCHECK_BY argument. Once you know what the groups are, the group value is simply something that identifies that group....for example, 1) a readgroup by its PU field, 2) a sample is identified by the sample ID and 3) the file by its path.
     
    0
    Comment actions Permalink
  • Avatar
    Charlie Murphy

    I just want to chime in here. I had some of the same concerns. Specifically, how are LOD_SCORE_TUMOR_NORMAL and LOD_SCORE_NORMAL_TUMOR different from LOD_SCORE? Are the first two only looking at homozygous sites or something? Thank you.

    0
    Comment actions Permalink
  • Avatar
    Charlie Murphy

    Genevieve-Brandt-she-her

    Thanks so much for your reply, it is very helpful. It leads to another question though. How exactly does the software "assume" a sample is a tumor or a normal? I am just trying to see how that fits within the statistical framework presented in the CrossCheck paper published last year. My understanding is that framework assumes no contamination, so I don't understand how it would assume one sample is an "impure" version of another. Sorry for bother you about this and maybe it is too much to ask for typical GATK documentation, but I just want to know more algorithm details. Let me know, thanks!

    0
    Comment actions Permalink
  • Avatar
    Charlie Murphy

    Yossi Farjoun Awesome, that is exactly what I was looking for. That answers my question, thanks! And thank you Genevieve-Brandt-she-her as well!

    0
    Comment actions Permalink
  • Avatar
    鹰崖

    Are normal sample and tumor sample from the same patient when LOD_SCORE is -5949.581009, LOD_SCORE_TUMOR_NORMAL is -5354.395046, but LOD_SCORE_NORMAL_TUMOR is 4086.050812?

    0
    Comment actions Permalink
  • Avatar
    Yossi Farjoun

    Those are really high LOD scores!

    I'm guessing that you're using the large HaplotypeDatabase....or that you have really really deep coverage....

     

    Anyhow, my interpretatino of the results are that the "left" sample is NORMAL and the "Right" sample is tumor, and that they did come form the same individual, but that the Tumor is high purity and suffered some serious LoH on several chromosomes.

     

    You can validate this by running CheckFingerprints (instead of Crosscheck) (though you'll need a VCF for the "normal" sample) and look at the individual snps (in the detailed . What you are looking for is that when theres a mismatch between the expected and the found genotypes the expected is always HET and the found is always HOM (either ref or var). Also, these should happen on a collection of snps that are all on the same region (chromosome arm etc.) 

     

    Alternatively, you could have a contaminated haploid sample...but that's really exotic....

     

     

     

    0
    Comment actions Permalink
  • Avatar
    鹰崖

    Yes, we are using the large HaplotypeDatabase.

    Thanks Yossi Farjoun, you answers my question.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk