Speeding up GATK4 DepthOfCoverage
- Version: GATK 220.127.116.11 through Docker
- WGS 30x BAM
- 1 CPU + 64gb RAM
- ~ 3-4 days
gatk DepthOfCoverage \
--java-options '-Xmx48G ' \
-I '/path/to/generated.markduped.bam' \
-R '/path/to/Homo_sapiens_assembly38.fasta' \
-O 'NA12878' \
--intervals '/path/to/chr1.bed' ... --intervals '/path/to/chrM.bed'
I'm hoping to speed up the wall-time. I've got one interval per chromosome currently, but that can be changed if required / recommended. I asked internally about splitting the analysis up per interval, but there would be some non-trivial effort required to merge the reports and it might actually change the QC reports.
I believe the GATK3 variant of this tool could take the `-nt` or `-nct` params, just wondering if there's a planned SPARK implementation, or other tricks.
Hi Michael Franklin, unfortunately there is no spark implementation planned for this tool. It is still in BETA development, so there is still progress to be made in terms of the functionalities. And yes, in GATK4 we do not have the -nt and -nct parameters.
Hi Michael Franklin, I got more information about this tool in case you could get some runtime improvements. One question to consider, are you running this in a shared resources cluster with slow disc reading and writing? DepthofCoverage writes a lot of files and so slow reading and writing can lead to an expensive runtime with this tool.
An improvement can be to use --omit-depth-output-at-each-base. DepthofCoverage writes a line for every base in the genome, which can greatly increase the runtime. If you do not need this information for every base in the genome, then using that option will save you a lot of time.
Also, I found that if you were to split the analysis into more intervals, the interval statistics would be fine to merge and would not lead to any changes in the results. However, at this point we do not provide an easy way to merge the outputs.
Please sign in to leave a comment.