Version I'm using: GATK/22.214.171.124 (though I'm planning on switching to the most recent version)
Command (-L is currently blank):
gatk --java-options "-Xmx40g" HaplotypeCaller -pairHMM AVX_LOGLESS_CACHING_OMP --native-pair-hmm-threads 8 -L ?:?-? -R $ref -I $bam -O gVCF_files/$sampleName.g.vcf -ERC GVCF
I'm using GATK HaplotypeCaller for single sample GVCF calling with plant genome data. I was using an old version (version nightly-2017-03-30-g34bd8a3 <- this was installed by my university) to take advantage of the -nct option so the job could finish on my university's cluster. For any given job, I can utilize up to 40 threads/node (ntasks-per-node), but the job can only run for a maximum of 72 hours.
Because of this limitation, only 9/21 of my samples were able to complete without error. For each sample, I experimented with a range of -nct values and 16 threads seems to be the best for what I tried (4, 8, 12, & 16; I'm still waiting on the higher number of threads to finish).
Because not all of my samples were able to finish in time, I dug around and discovered the -L option. Apparently, this option can reduce runtime as discussed in Heldenbrand et al. 2019 (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3169-7). The issue I'm having is in implementing the command and not really finding examples of this option being used so I understand the format for dividing the genome.
I have many more than 23 contigs/chromosomes in my reference, so I hope that isn't an issue. I'm trying to figure out how to divide the data (in half, for example) and how to represent it in my command line. My reference genome is a Masurca assembly 476,996,396 nt in length. Any guidance on how to utilize the interval command would be greatly appreciated. Thank you!
Please sign in to leave a comment.