Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How to Split 3000 WGS CRAM files into 1Mbp length chunks

0

1 comment

  • Avatar
    Giles Hall

    PrintReads is the tool to use to split up CRAMs, but there a few edge cases to be mindful of such as read pairs spanning boundaries.  At the Broad, we accomplish the same goal by scattering the jobs with intervals and utilizing cloud streaming to randomly access the file.  This requires an interval file to scatter across and a centralized repository for the data such as Google Storage, Amazon's S3, Azure's Storage Account, or a localized solution like NFS or FUSE.  The post linked below describes different strategies for different computing platforms, along with links to the Broad's cloud based best practices pipelines:

    https://gatk.broadinstitute.org/hc/en-us/sections/360007134212-Computing-Platforms

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk