Splitting bam files to hg19 intervals provide by gatk
Hi Everyone,
I'm working with large WGS BAM files (~200GB each) and running Mutect2 on an HPC setup. One performance improvement I've already implemented is using the 80-region hg19 interval list from GATK to parallelize variant calling.
However, I'm concerned about I/O bottlenecks due to all 80 parallel jobs reading from the same large BAM file. Would it make sense to split the BAM file into smaller per-interval BAMs, matching the interval list, to reduce I/O contention?
Has anyone tried this in a similar setup, and did it lead to measurable improvements in performance or runtime efficiency?
Thanks!
-
On all hardware in my experience Mutect2's I/O time is very small compared to CPU time. I don't think you need to worry about I/O contention. We parallelize WGS bams 100 ways all the time without issue.
Please sign in to leave a comment.
1 comment