Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

RevertSam running out of disk space

0

2 comments

  • Avatar
    Lindo Nkambule

    I've tested the same command on a BAM that's ~22GB in size. The total size of the unmapped read group BAM files is ~24GB. It ran without hitting the disk space error, however, peak storage usage for the job was ~72GB. Is RevertSam writing temporary files, is it avoidable/is it possible to free up some space frequently? 

    0
    Comment actions Permalink
  • Avatar
    Louis Bergelson

    Hi Lindo Nkambule,

    Sorry you're running into problems.  Generally I would expect to need around ~3x the size of the bam in disk space too do an operation that involves a sort.  1x to store the original bam, 1x for temp files used during the sort, 1x for the output bam.  That's what you saw with your test file.  You gave it >3x so I would have expected it to probably work.  

    I'm not sure exactly why you need more in this case.  Often the reverted bam is bigger than the original.   That's because aligned bams typically compress better than unaligned bams do.  If your original bam was compressed very highly maybe it's writing the output at lower compression level resulting in a larger file.

    Picard/GATK can't do the sort operation in place unless you have sufficient RAM to store the entire bam in memory. (Which is going to be way more expensive than disk space...)  There's no simple way to reduce the amount of temp files that picard generates or delete them as you go.  It needs the temp files in order to perform the sorting operation.  The only real configuration for that is the MAX_RECORDS_IN_RAM argument which could allow you to keep more in memory before spilling to disk.  

    The other option would be to increase the COMPRESSION_LEVEL argument to reduce the size of the final bam.  There's a bit of a size reduction available but it rapidly reaches diminishing returns. 

    My recommendation would be to increase the available disk space a bit and it will probably work.  If that's really not possible you could do things like splittiing the file into chunks and reverting them separately.  It might end up with some mate-pairs in separate files though and complicate re-analyzing them though.  Converting the input file to cram might reduce your disk footprint as well (although I wouldn't expect writing the final file as cram would help much since unaligned cram loses a lot of the compression benefits that make cram smaller than bam in general)

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk