How to Split 3000 WGS CRAM files into 1Mbp length chunks
Hello, I have 3000 WGS CRAM files and I want to split them into 1Mbp chunks. I want to split with exact genomic coordinate locations, e.g. starting from 1 to 1000000bp, 1000001bp to 2000000bp, 2000001bp to 3000000 etc. for all chromosomes. Therefore, each chunks have the similar corresponding region in each sample. Is there any way that I can do this?
I need these small chunks for joint calling and and make the data handling computationally efficient.
-
PrintReads is the tool to use to split up CRAMs, but there a few edge cases to be mindful of such as read pairs spanning boundaries. At the Broad, we accomplish the same goal by scattering the jobs with intervals and utilizing cloud streaming to randomly access the file. This requires an interval file to scatter across and a centralized repository for the data such as Google Storage, Amazon's S3, Azure's Storage Account, or a localized solution like NFS or FUSE. The post linked below describes different strategies for different computing platforms, along with links to the Broad's cloud based best practices pipelines:
https://gatk.broadinstitute.org/hc/en-us/sections/360007134212-Computing-Platforms
Please sign in to leave a comment.
1 comment