First of all, I'm posting this on April 1st 2020, so I hope that you and all you love are healthy and avoiding the worse from this terrible pandemic.
This post is mostly about trying to optimize how to run genotypegvcfs. I'm sorry if this has already been figured out, but I wasn't able to find a post that explicitly tried to deal with the issue that I'll present.
I have been ran both genomicsdbimport and genotypegvcfs for 288 specimens genome-wide in a human-sized genome. It is my understanding that neither of these programs can be parallelized with Spark. Would the best way to go about this issue to submit multiple runs with either a subset of the chromosomes, or a subset of the specimens? Is it possible to do so in either of these programs? My feeling is that this is most likely to be done with genotypegvcfs.
If I submit multiple runs with genotypegvcfs, one for each chromosome, for example, can I use any program in GATK or in Picard to merge the vcf files together?
Thanks so much for your time.
Please sign in to leave a comment.