Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Should joint-calling be performed for the control and the disease group separately?

Answered
0

6 comments

  • Avatar
    Philipp Hähnel

    Hi Jiayi,

    if you mean by joint calling the multi-sample variant calling for Mutect2, then this is not done on a cohort basis, but on a per patient basis. The multi-sample calling pools evidence for a variant across samples and is thus more powered to detect variants in a patient.

    Please read the best practices tutorial.

    Best,

    Philipp

    0
    Comment actions Permalink
  • Avatar
    Jiayi Zhao

    Hi Philipp,

    Thanks for your reply. 

    Actually, I am using HaplotypeCaller, and I am going to try GenotypeGVCF. Is this a good choice? and should I conduct joint-calling on disease and control separately?

    Best,

    Jiayi

    0
    Comment actions Permalink
  • Avatar
    Philipp Hähnel

    Hi Jiayi,

    are you interested in obtaining germline variants or somatic variants? For the former, HaplotypeCaller should be used, for the latter Mutect2.

    Are the disease and control samples patient-matched? If yes, you can use them as tumor-normal pairs in Mutect2 to filter germline variants in the controls.

    Joint calling should only ever be done for multiple samples coming from the same patient. EDIT: this is certainly true for somatic calling. Upon reading documentation for germline calling again, you can run that in cohort mode on multiple patients. If you are interested in which germline variants may be responsible for the disease, then in order to maximize power, I'd run it in two batches: the case batch and the control batch. Maybe someone from the gatk team who is more familiar with germline calling could elaborate on that?

    Best,

    Philipp

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks so much for posting your insight here Philipp Hähnel! I would recommend Jiayi Zhao to run the 50 disease and 20 control samples together, because running them through our joint calling workflow will give the workflow more statistical power to make better calls. You will get a joint called VCF. If you want the VCF calls separated by group, you can divide the VCF with SelectVariants.

    0
    Comment actions Permalink
  • Avatar
    Felix Fisher

    Thanks for the discussion above. I wonder what is the post-analysis that you would recommend after joint calling?

    I have 6 cases and 2 controls, now the variants are called individually by samples. As I am looking for the variants for a dominant disease, I can just use the overlapping strategy to find the variants only in cases. So I wonder if the variants were called by mixing the cases and controls together, how to carry out further analysis to find the variants associated with disease? Thanks in advance!

     

    Best,

    Felix

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Hi Felix Fisher,

    We like to recommend the Genotype Refinement workflow for post-joint calling. https://gatk.broadinstitute.org/hc/en-us/articles/360035531432-Genotype-Refinement-workflow-for-germline-short-variants  If any of your samples are related, that pipeline can utilize a pedigree file to improve genotype calls.  Population priors (e.g. from 1000 Genomes) can be used to reduce false positives, but the flip side is that it will require more alternate reads to support a de novo call.

    Your cohort size is on the small side to do an RVAS analysis, but if you're just looking at each variant separately, you could do a couple of things.  With VariantAnnotator, you can use the -A SampleList annotation to list the samples in which the variant is called.  Then running VariantsToTable on that output, you could use a python or R script to find variants for which the samples in the SampleList don't include any of the controls.  You could also use a JEXL query to filter the VCF in a similar way, but JEXL can be a little difficult to use on account of it being whitespace sensitive and very particular about types.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk