When it comes to research, time is money. And that couldn't be any more literal than when you're accessing data out of the cloud.
Every additional kilobyte of data that your tools need to download and look through means fractions of a cent that can really add up for larger-scale analyses. Fortunately, we put a lot of effort into optimizing our pipelines for cost, and you will find many tools at your disposal in GATK and in our documentation that let you utilize the power of the cloud on your own terms.
If you're using GATK on external cloud data with a high cost of data access/download (for example, using GATK PrintReads to write reads from a cloud BAM to an output file), then you might be looking for areas where you can drop unnecessary data calls from your pipeline.
As such, here is a little life hack that should help decrease server load, cloud access costs, and give you more options for how to analyze your data. And best of all, it can all be done with just a Tweet's worth of content injected into your next script! It's never been easier to be a hacker!
Hang on while I change into my hacking gear, first...
Choosing more speed, but at what cost?
By default, GATK runs an asynchronous prefetcher for cloud data, in which it pre-loads the next 40 MB of a genomic file as it is running an analysis. Prefetching is a standard technique for speeding up computational performance by accessing data before it is actually needed.
If you’ve ever noticed Netflix preloading the next few minutes of your episode as you're watching it, it’s basically the same concept.
The alternative would be forcing you to download all of Season 2 before letting you hit 'Play', which most people might not have the patience for… especially when Season 1 ended on such a cliffhanger!
"What do you mean, there's no Season 2 for The Dark Crystal?"
This can have many benefits, chief among them being a significant increase in performance under normal circumstances. Prefetching is great for cases when you are sorting through a large number of small, non-adjacent genomic intervals that are scattered across the genome.
However, if you are looking at adjacent or overlapping genomic regions, then it is possible that GATK will end up prefetching redundant sequence data, and that might not be what you’re looking for if you want to optimize costs beyond what most people need.
I choose less cost, over speed
Fortunately, GATK has an option to turn off prefetching. If you want to minimize data transfer costs, then this will pull less overall data from the cloud, which means less expense to your pocketbook (but at the expense of performance).
To completely deactivate cloud prefetching, run GATK with the following options included:
--cloud-prefetch-buffer 0 --cloud-index-prefetch-buffer 0
However, if you only wish to minimize data transfer cost (but not completely kill your performance) you can instead try running GATK with the minimum prefetch buffer of 1 MB enabled, instead of the 40 MB of prefetch buffer that GATK defaults to:
--cloud-prefetch-buffer 1 --cloud-index-prefetch-buffer 1
Knowing more about these options will hopefully empower you to stretch every grant dollar to its limit, in order to run GATK the way you want to run it.