Haplotype caller taking too long
REQUIRED for all errors and issues:
a) GATK version used:4.2.3.0
b) Exact command used: gatk HaplotypeCaller -R /home/databench/GRCz11_RefData/GRCz11.fa -I Routput_sorted_dedup_bqsr_reads.bam -O Routput_raw_variants.vcf
c) Entire program log:
18:25:05.299 INFO ProgressMeter - 14:1992316 8753.4 3960060 452.4
18:25:34.088 INFO ProgressMeter - 14:1996432 8753.9 3960090 452.4
18:25:55.557 INFO ProgressMeter - 14:1997821 8754.2 3960100 452.4
18:26:26.915 INFO ProgressMeter - 14:2000058 8754.8 3960120 452.3
18:26:37.481 INFO ProgressMeter - 14:2001114 8754.9 3960130 452.3
18:27:45.162 INFO ProgressMeter - 14:2002965 8756.1 3960150 452.3
18:28:21.899 INFO ProgressMeter - 14:2004070 8756.7 3960160 452.2
18:28:35.086 INFO ProgressMeter - 14:2006777 8756.9 3960180 452.2
18:28:47.398 INFO ProgressMeter - 14:2032014 8757.1 3960320 452.2
18:29:19.681 INFO ProgressMeter - 14:2036251 8757.6 3960350 452.2
18:29:39.794 INFO ProgressMeter - 14:2041117 8758.0 3960380 452.2
18:30:01.622 INFO ProgressMeter - 14:2051532 8758.3 3960440 452.2
18:30:13.574 INFO ProgressMeter - 14:2055926 8758.5 3960470 452.2
18:30:29.773 INFO ProgressMeter - 14:2059184 8758.8 3960490 452.2
18:30:47.124 INFO ProgressMeter - 14:2076467 8759.1 3960590 452.2
18:30:58.978 INFO ProgressMeter - 14:2080304 8759.3 3960610 452.2
18:31:22.535 INFO ProgressMeter - 14:2085257 8759.7 3960640 452.1
18:32:57.897 INFO ProgressMeter - 14:2088696 8761.3 3960660 452.1
18:33:08.258 INFO ProgressMeter - 14:2094243 8761.4 3960690 452.1
18:34:02.166 INFO ProgressMeter - 14:2101834 8762.3 3960740 452.0
18:34:14.108 INFO ProgressMeter - 14:2105049 8762.5 3960760 452.0
18:34:24.907 INFO ProgressMeter - 14:2145528 8762.7 3960980 452.0
18:34:37.178 INFO ProgressMeter - 14:2162344 8762.9 3961090 452.0
18:34:58.437 INFO ProgressMeter - 14:2186151 8763.3 3961230 452.0
18:35:08.968 INFO ProgressMeter - 14:2187446 8763.5 3961240 452.0
18:35:43.907 INFO ProgressMeter - 14:2190036 8764.0 3961260 452.0
18:36:46.828 INFO ProgressMeter - 14:2191678 8765.1 3961270 451.9
18:36:57.580 INFO ProgressMeter - 14:2192891 8765.3 3961280 451.9
18:37:17.878 INFO ProgressMeter - 14:2193966 8765.6 3961290 451.9
18:37:41.166 INFO ProgressMeter - 14:2195284 8766.0 3961300 451.9
-
The code ran for 7 days and it did not finish. Any suggestions on how to make it go faster ?
-
Can you share the full log here? HaplotypeCaller runs faster when your system supports latest Intel AVX+ instruction sets. Does your system support AVX? Can you share your system details as well?
If your system permits you may be able to accelerate HaplotypeCaller process by splitting your intervals into seperate instances and call them in parallel. Then you may merge all your calls into a single file.
Regards.
-
The log is very long (went on for 6 days). I stopped it at the end. Yes my system support AVX.
Can you help me split the intervals into separate instances?
-
gatk-4.4.0.0/gatk HaplotypeCaller -R /home/databench/GRCz11_RefData/GRCz11.fa -I Routput_sorted_dedup_bqsr_reads.bam -O Routput_raw_variants.vcf
Using GATK jar /home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar HaplotypeCaller -R /home/databench/GRCz11_RefData/GRCz11.fa -I Routput_sorted_dedup_bqsr_reads.bam -O Routput_raw_variants.vcf
22:25:04.962 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
22:25:04.989 INFO HaplotypeCaller - ------------------------------------------------------------
22:25:04.992 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.4.0.0
22:25:04.992 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
22:25:04.992 INFO HaplotypeCaller - Executing as lab_anjohnson@anjohnson-pve-linux on Linux v5.10.0-23-amd64 amd64
22:25:04.992 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.6+10-Debian-1deb11u1
22:25:04.992 INFO HaplotypeCaller - Start Date/Time: July 10, 2023 at 10:25:04 PM CDT
22:25:04.992 INFO HaplotypeCaller - ------------------------------------------------------------
22:25:04.992 INFO HaplotypeCaller - ------------------------------------------------------------
22:25:04.993 INFO HaplotypeCaller - HTSJDK Version: 3.0.5
22:25:04.993 INFO HaplotypeCaller - Picard Version: 3.0.0
22:25:04.993 INFO HaplotypeCaller - Built for Spark Version: 3.3.1
22:25:04.993 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:25:04.994 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:25:04.994 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:25:04.994 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:25:04.994 INFO HaplotypeCaller - Deflater: IntelDeflater
22:25:04.994 INFO HaplotypeCaller - Inflater: IntelInflater
22:25:04.994 INFO HaplotypeCaller - GCS max retries/reopens: 20
22:25:04.994 INFO HaplotypeCaller - Requester pays: disabled
22:25:04.995 INFO HaplotypeCaller - Initializing engine
22:25:05.121 INFO HaplotypeCaller - Done initializing engine
22:25:05.128 INFO HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
22:25:05.138 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
22:25:05.139 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
22:25:05.151 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
22:25:05.151 INFO IntelPairHmm - Available threads: 4
22:25:05.151 INFO IntelPairHmm - Requested threads: 4
22:25:05.151 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
22:25:05.170 INFO ProgressMeter - Starting traversal
22:25:05.171 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
22:25:05.452 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 1:2036 and possibly subsequent; at least 10 samples must have called genotypes
22:25:16.054 INFO ProgressMeter - 1:20704 0.2 150 827.0
22:25:27.092 INFO ProgressMeter - 1:48916 0.4 370 1012.8
22:25:38.054 INFO ProgressMeter - 1:121562 0.5 850 1551.0
22:25:48.866 INFO ProgressMeter - 1:154324 0.7 1080 1483.0
22:25:59.420 INFO ProgressMeter - 1:190904 0.9 1330 1471.0
22:26:10.782 INFO ProgressMeter - 1:225478 1.1 1560 1426.6
-
Is this running on a virtual machine such as VirtualBox? Those instances may have very limited IO speeds and looks like you also limited the number of threads available to your VM. I would suggest you to use a native linux instance or use the docker version of gatk over your main OS without IO and thread restrictions and you will see an increase in performance.
-
Yes this is a virtual machine. I managed to increase the number of threads on it. And it is currently moving 3 times faster. Will this run take a long time to finish?
-
I bet the biggest time lost is with the IO performance. Increasing PairHMM thread count may increase the speed of PairHMM but not the other parts where IO performance is limited. Lets see if this runs till the end.
-
Is there a way to tell how long it will take to be done, or when it is close to the end?
-
There was a way in the older GATK3.x releases but for 4.0 you have to guestimate based on the length of your contigs. Looking at the size of the Zebrafish genome it should be fairly sooner than the human one but I cannot be sure about the complexity of the genome so your mileage may vary.
-
The current locus is 3 in 238 minutes. I assume there are 26 thousand loci in Zebra Fish. So, I guess I have a long wait! :(
-
Hello Skywarrior! Just wanted to let you know the run is finally over (took 28hrs).
Thank you so much for your response, I really appreciate it.
Please sign in to leave a comment.
11 comments