I am currently running joint calls with GenotypeGVCFs on ~800 whole genome samples, with plans to be importing 1000's more over the next months. The GenomicsDBImport phase is running smoothly and relatively quickly (maybe a few hundred genomes/day) on machines with ~2 TB RAM and 144 CPUs. The process is parallelized using GNUparallel, and we run as many as 25 processes in parallel on one machine with ~70 GB allocated per process during the imports. We are using the 'Standard Set' intervals referenced here:
And so importing based on 50 .interval_list files to 50 databases.
Allocation and parallelization is similar for the GenotypeGVCfs phase though the RAM requirements seem lower so it may be overkill. We also have the option of submitting to a cluster using sbatch for this part (though I've found that slower). However I have noticed that while there is no obvious resource problem, or heavy IO wait, that the GenotypeGVCfs process will sometimes appear to stall for a few hours before resuming.
Why do I see this delay? We wondered if it might be caused by regions that are repetitive for which we should be using the --exclude-intervals option. Or would those excluded regions be implicitly built into the supplied .interval_list files already? In case it's related to how we call the process, here is what the command that runs our GenotypeGVCFs step looks like:
The Genome Analysis Toolkit (GATK) v184.108.40.206
HTSJDK Version: 2.24.0
Picard Version: 2.25.0
Please sign in to leave a comment.