Cannot parallelize GATK
So I am rather confused what is going on, I am trying to use HaplotyperCaller to genotype ~1100 samples. I ran this is exact code on the entire set, getting through ~800 when it was actively working on around 64 samples I got an error. Now whenever I try to restart the code, I get the error that it cannot find the files and I'm not sure why. I can see it says the file doesn't exist with %09%20%20 and then lists another .bam file. Surely my script is just listing all .bam in the directory and not relying on an out of date file. I'm running on a cluster but doesn't require submission to a scheduler but means I cannot update to the latest GATK easily.
Edit: I should add that I removed all the completed files. I've also tried with just one bam and it's index in the directory and it works completely fine. And I've tried with two older bams that previously completed, and they no longer work.
a) GATK version used:
4.2.0.0
b) Exact command used:
ls *.sorted.bam | parallel --eta -j 64 "gatk --java-options "-Xmx8g" HaplotypeCaller -R CascadeUnmasked.fasta -ploidy 2 -I {} -O {.}.g.vcf -ERC GVCF"
c) Entire program log:
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -jar /local/cluster/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar HaplotypeCaller -R CascadeUnmasked.fasta -ploidy 2 -I AAC5HWMM5_21538.sorted.bam AAC7WMGM5_W1130-068.sorted.bam -O AAC5HWMM5_21538.sorted.bam AAC7WMGM5_W1130-068.sorted.g.vcf -ERC GVCF
16:56:27.082 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/local/cluster/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 02, 2022 4:56:27 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:56:27.225 INFO HaplotypeCaller - ------------------------------------------------------------
16:56:27.226 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.0.0
16:56:27.226 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
16:56:27.227 INFO HaplotypeCaller - Executing as clareshaun@hoser.cgrb.oregonstate.local on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
16:56:27.227 INFO HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_71-b15
16:56:27.227 INFO HaplotypeCaller - Start Date/Time: November 2, 2022 4:56:27 PM PDT
16:56:27.227 INFO HaplotypeCaller - ------------------------------------------------------------
16:56:27.227 INFO HaplotypeCaller - ------------------------------------------------------------
16:56:27.228 INFO HaplotypeCaller - HTSJDK Version: 2.24.0
16:56:27.228 INFO HaplotypeCaller - Picard Version: 2.25.0
16:56:27.228 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
16:56:27.228 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:56:27.228 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:56:27.228 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:56:27.228 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:56:27.228 INFO HaplotypeCaller - Deflater: IntelDeflater
16:56:27.228 INFO HaplotypeCaller - Inflater: IntelInflater
16:56:27.228 INFO HaplotypeCaller - GCS max retries/reopens: 20
16:56:27.228 INFO HaplotypeCaller - Requester pays: disabled
16:56:27.228 INFO HaplotypeCaller - Initializing engine
16:56:27.270 INFO HaplotypeCaller - Shutting down engine
[November 2, 2022 4:56:27 PM PDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1256194048
***********************************************************************
A USER ERROR has occurred: Couldn't read file. Error was: AAC5HWMM5_21538.sorted.bam AAC7WMGM5_W1130-068.sorted.bam with exception: Cannot read non-existent file: file:///nfs4/HORT/Bassil_Lab/HopSex/AAC5HWMM5_21538.sorted.bam%09%20%20AAC7WMGM5_W1130-068.sorted.bam
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
ETA: 0s Left: 0 AVG: 0.52s local:0/139/100%/0.5s
1361.086u 140.424s 1:16.11 1972.8% 0+0k 0+214168io 0pf+0w
-
Hi Shaun Clare,
Thank you for writing to the GATK forum! I hope that we can help you sort this out.
You are correct that it seems to append another filename to the current one; the
%09%20%20
means that there's a TAB character and two spaces betweenfile:///nfs4/HORT/Bassil_Lab/HopSex/AAC5HWMM5_21538.sorted.bam
andAAC7WMGM5_W1130-068.sorted.bam
. I can only explain this by the command getting a malformed input string, which means that not GATK is the problem, but something with your piping fromls | parallel …
doesn't work as you expect. Can you please verify that GATK gets exactly one file path per call?Thank you! I look forward to your reply.
Best,
Anthony
-
You can try writing the bam list into a file, then use parallel -a to use that list as input
For Ex.
ls *.sorted.bam > input_list # Not the best way to be frank
find $PWD -name "*.sorted.bam" > input_list # Recommended, also gives full path of bam files
parallel -a input_list ~~~ rest of the command ~~~We've also noticed, sometimes parallel can get fussy then using multiple repeating {} elements.
You could make a function for haplotypecaller, export it and then call the function with parallel.
# Assuming bash
haplo_call () {
input="$1"
ouptut="${input/.sorted.bam/.gvcf.gz}"
gatk haplotypecaller ~~~ your commands~~~
}
export -f haplo_call
parallel -a bam_list haplo_call {1} -
I managed to fix it by modifying ls to write as one column using:
ls -1 *.sorted.bam | <rest of command>
I'm not sure why it worked in the first place, or maybe I accidentally deleted part of my code at some point. Thank you for your replies. I mainly wanted to use this way to save writing files
-
Hi Shaun Clare,
I'm glad that we were able to help solve this issue collectively! Thank you for being a valued contributor to the GATK community.
Please do not hesitate to reach out with any other questions/issues in the future!
kvn95ss Thank you for your contribution to the GATK forum. We greatly value collaboration between other members of the GATK community.
Best,
Anthony
Please sign in to leave a comment.
4 comments