Scatter Count recommendations
I'm looking into parallelizing my somatic whole-exome pipeline using the scatter strategy (though recreating it using Bash and Python). That said, I've been looking through the wdl scripts on the gatk github page, and I've seen the variable scatter_counts being initialized, but never actually given a value.
I was wondering the considerations are for choosing how many intervals (i.e. scatter_count value) to parallelize over. I've arbitrarily chosen 10 splits for my pipeline to start, but I'm curious if there is a better way to figure out this number for scaling purposes or to hear what values some of the Broad teams use in their own analyses.
-
Alijah O'Connor We usually use 10-20 for exomes. For our purposes this is a good compromise between wall clock time and the overhead of starting VMs and delocalizing. If you're not using the cloud you could increase the scatter count, but there's no real need unless you're really in a hurry.
Please sign in to leave a comment.
1 comment