Should joint-calling be performed for the control and the disease group separately?Answered
Dear GATK Team,
I have about 50 disease samples and 30 control samples, I was wondering if I should do joint-calling on these two groups separately or I should treat them as one group.
if you mean by joint calling the multi-sample variant calling for Mutect2, then this is not done on a cohort basis, but on a per patient basis. The multi-sample calling pools evidence for a variant across samples and is thus more powered to detect variants in a patient.
Please read the best practices tutorial.
Thanks for your reply.
Actually, I am using HaplotypeCaller, and I am going to try GenotypeGVCF. Is this a good choice? and should I conduct joint-calling on disease and control separately?
are you interested in obtaining germline variants or somatic variants? For the former, HaplotypeCaller should be used, for the latter Mutect2.
Are the disease and control samples patient-matched? If yes, you can use them as tumor-normal pairs in Mutect2 to filter germline variants in the controls.
Joint calling should only ever be done for multiple samples coming from the same patient. EDIT: this is certainly true for somatic calling. Upon reading documentation for germline calling again, you can run that in cohort mode on multiple patients. If you are interested in which germline variants may be responsible for the disease, then in order to maximize power, I'd run it in two batches: the case batch and the control batch. Maybe someone from the gatk team who is more familiar with germline calling could elaborate on that?
Thanks so much for posting your insight here Philipp Hähnel! I would recommend Jiayi Zhao to run the 50 disease and 20 control samples together, because running them through our joint calling workflow will give the workflow more statistical power to make better calls. You will get a joint called VCF. If you want the VCF calls separated by group, you can divide the VCF with SelectVariants.
Thanks for the discussion above. I wonder what is the post-analysis that you would recommend after joint calling?
I have 6 cases and 2 controls, now the variants are called individually by samples. As I am looking for the variants for a dominant disease, I can just use the overlapping strategy to find the variants only in cases. So I wonder if the variants were called by mixing the cases and controls together, how to carry out further analysis to find the variants associated with disease? Thanks in advance!
Hi Felix Fisher,
We like to recommend the Genotype Refinement workflow for post-joint calling. https://gatk.broadinstitute.org/hc/en-us/articles/360035531432-Genotype-Refinement-workflow-for-germline-short-variants If any of your samples are related, that pipeline can utilize a pedigree file to improve genotype calls. Population priors (e.g. from 1000 Genomes) can be used to reduce false positives, but the flip side is that it will require more alternate reads to support a de novo call.
Your cohort size is on the small side to do an RVAS analysis, but if you're just looking at each variant separately, you could do a couple of things. With VariantAnnotator, you can use the -A SampleList annotation to list the samples in which the variant is called. Then running VariantsToTable on that output, you could use a python or R script to find variants for which the samples in the SampleList don't include any of the controls. You could also use a JEXL query to filter the VCF in a similar way, but JEXL can be a little difficult to use on account of it being whitespace sensitive and very particular about types.
Please sign in to leave a comment.