Joint Call WGS with a Large Cohort
I'm currently trying to joint call variants on a cohort ~4,000 human samples of WGS data. I've previously used the WGS calling interval list provided in the resource bundle, but even running on our HPC cluster I'm not able to finish the jobs before the job time limit runs out. Is there a list that has smaller chunks, or is there a way I can cut down these intervals further without losing variants?
-
I'm also have the similar issue. I am working on a similar project, a cohort of ~5,000 human samples of WGS data. I want to perform joint call for all CRAM files. Since, the size of samples are larger for joint call and computing the calling become harder, I want to divide CRAMs into chunks. I was wondering if somebody helps me what are consensus and established "size" to divide CRAMs into multiple smaller chunks and merging them after calling?
-
Hi Matt Johnson
The provided WGS calling intervals split at Ns in the reference so there are no boundary effects. However in hg38 there are fewer and fewer of those. In our production pipeline, we do further split the intervals with the GATK SplitIntervals tools ( see https://github.com/broadinstitute/warp/blob/develop/tasks/broad/JointGenotypingTasks.wdl for an example command)
For WGS we use number of scatters equal to 2.5X the number of samples/input GVCFs. (That's equivalent to setting unbounded_scatter_count = 2.5 in https://github.com/broadinstitute/warp/blob/develop/pipelines/broad/dna_seq/germline/joint_genotyping/JointGenotyping.wdl) So for 4000 human samples you'd have about 10K scatters. Give that a shot and see if it helps.
-Laura
-
Hay, since the number of samples is high, have you tried ReblockGVCF for the samples? Do look into it, it reduces memory and I/O requirements by quite a lot.
-
Laura Gauthier, thanks! This was immensely helpful, as we were trying to hold on to not subdividing intervals. Certainly gives us a better option that doesn't take multiple months.
-
Great! And do look into ReblockGVCF like kvn95ss suggested. If you're not super concerned about hom-ref GQ precision, it can be a 90% reduction in GVCF size, which speeds up execution time when there are fewer ref blocks to deal with. The most aggressive option is a [0-20) low confidence band and a [20,99] high confidence band, which was the schema that was used for gnomADv3. ReblockGVCF will also drop alleles not used in the most likely genotypes, which can dramatically improve the PL sizes at gross indel sites.
Please sign in to leave a comment.
5 comments