GATK4 - Parallelizing genotypegvcfsAnswered
First of all, I'm posting this on April 1st 2020, so I hope that you and all you love are healthy and avoiding the worse from this terrible pandemic.
This post is mostly about trying to optimize how to run genotypegvcfs. I'm sorry if this has already been figured out, but I wasn't able to find a post that explicitly tried to deal with the issue that I'll present.
I have been ran both genomicsdbimport and genotypegvcfs for 288 specimens genome-wide in a human-sized genome. It is my understanding that neither of these programs can be parallelized with Spark. Would the best way to go about this issue to submit multiple runs with either a subset of the chromosomes, or a subset of the specimens? Is it possible to do so in either of these programs? My feeling is that this is most likely to be done with genotypegvcfs.
If I submit multiple runs with genotypegvcfs, one for each chromosome, for example, can I use any program in GATK or in Picard to merge the vcf files together?
Thanks so much for your time.
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
Fair enough. I understand. Just so that you know I am running GATK4's genotypegvcfs one chromosome at a time as of now. I have also been able to use MergeVcfs from Picard that is available through gatk4 to preliminarily merge smaller chromosomes for which genotypegvcfs have already finished. It feels to me that this is the best way to speed up the process (I was able to cut it from ~ 15 days in a single threaded processor into 3 days using multiple processors, one per job, in a cluster).
I am having a similar issue where GenotypeGVCF is taking a long time to finish. Is there a way to get it to use more than a single core? It is also not using all the memory provided to it.
We have a few other forum discussions where we provide ideas for speeding up GenotypeGVCFs, please look for those. In addition, I would recommend using the new option in GenotypeGVCFs, --genomicsdb-shared-posixfs-optimizations, if you are using a shared cluster and a GenomicsDB workspace.
I was also struggling with this but then according to suggestions from other threads, I divided the whole chromosome.interval_list into different chromosomes(1 to 8).interval_list and rerun the analysis parallely with each chromosome using -L as a different script. And now it is working fine and fast.
Thanks for posting your solution Vinod Kumar!
Please sign in to leave a comment.