I/O considerations for joint variant calling
Dear GATK community,
I need to perform joint variant calling for 3600 WES sample (gvcfs already generated, no database yet). I have access to a powerful cluster with many CPUs and much memory, however, I am limited by I/O usage and I would like to ask for recommendations to perform the joint variant calling as fast as possible while limiting I/O. I considered the following options:
- no parallelization: GenomicsDBImport has now been running for 23 days with only 6/75 batches yet loaded, so this is too slow.
- maximum parallelization: run GenomicsDBImport and GenotypeGVCFs on each exon. Then combine all raw vcfs and then continue with VQSR. Each run is very fast, yet here I am concerned of over parallelizing, and given I have 205,000 exons in my bed file, I am afraid of the feasibility of combining 205K vcf files.
- the middle ground: I split my exon bed file into 200 equal sized bed files, and am running GenomicsDBImport and GenotypeGVCFs on each chunk, yet due to the I/O restrictions, I can only run about 5 of these jobs in parallel and one already takes >2 days.
Bottom line is that I understand too little about how the GATK tools read the g.vcf files to determine what would be the best strategy for processing when the goal is to minimize I/O. Any advice or thoughts on this matter would be appreciated.
Thanks
Eva
-
Hi Eva König
If you have many samples and I/O is a limiting factor you may wish to use our biggest practices workflow which works with reblocked GVCFs and GnarlyGenotyper. Reblocking will reduce the clutter and excessive data readouts from g.vcf files and GnarlyGenotyper uses the information provided by reblocking to reproduce the joint calling at the levels of GenotypeGVCFs with less ambiguous calls around heavily repetitive homopolymers.
Below is the link to our biggest practices
Of course this document is more about cohort sizes above tens of thousands however I/O limitation is a concern here therefore we recommend you to take a look at it.
Regards.
-
Thank you Gökalp Çelik for this tip, I will look into the biggest practices.
Please sign in to leave a comment.
2 comments