Haplotypecaller too long ?
AnsweredDear GATK team,
I would like to do variant calling with haplotypecaller in gvcf mode for human genome 30X (aligned to hg38)
I use GATK 4.2.0.0, I do the variant calling by interval with for each interval 1 CPU and 5 GB of memory.
I use the 50 intervals you provide in your bundle
For the majority of the intervals the variant calling takes between 1 and 3 hours, however for some it is really very long, especially for the intervals 0003 and 0041. For these intervals it takes about 12 hours. I have tried on multiple samples and still get this problem. Is this normal? How to improve this?
Command line:
for i in {0000..0049}
do
srun --ntasks=1 gatk --java-options "-Xmx${SLURM_MEM_PER_CPU}M" HaplotypeCaller \
-R ${REF_Genome} \
-L ${Interval_DIR}/${i}.scattered.interval_list \
-I ${BAM_INPUT_DIR}/${BAM_INPUT} \
-O ${GVCF_OUTPUT_DIR}/${GVCF_OUTPUT}.${i} \
-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation \
-GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
-ERC GVCF \
--pcr-indel-model NONE \
--tmp-dir ${TMP_DIR} &
done
-
Quentin Chartreux is it running at the same speed and updating the whole time or are these two examples running normally and then slowing down?
-
Maybe i will show you the region/minute parameter from the log file with graph.
For a "normal" interval i obtained this type of graph (the speed is more or less constant):
For interval 003 :
and for 041 :
-
Thanks for sharing these.
At least for interval 003, a common reason it would slow down so much would be that the process runs out of memory so it uses the temporary storage for the process running. This results in a lot of file I/O as the tool runs and is much slower than if it uses memory. So, you probably want to increase the memory for 003.
041 is a bit puzzling to me, it looks like it never gets started running at a similar speed comparable to the other intervals you shared. You may be getting a lot of read depth over this interval or a lot of alternate alleles. If you don't want to change other parameters that could affect the results, you could first try increasing the memory as well.
-
Thanks a lot,
after increasing the memory and re try all the 50 intervals, all are finished in 2h28.
-
Glad to hear! Thanks for the update.
-
Dear GATK team,
This is my first time running GATK and has faced a long time running GATK4.3 (> 17 hrs) and still not converge/finished.
1. Is there any solution to speed up the process? because I do have 150 WGS samples ....
2. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio', and 'QualByDepth' annotations have been disabled. Does this normal for downstream analysis?
3. Still, the program gives me a warning message in the log file indicated below the command.
#!/bin/bash
#$ -N gatk
#$ -M wonde.ayalew@slu.se
#$ -m seab # this is what notification you want to receive about the job on your mail (start ; end ; error ; killed)
#$ -cwd #Use the directory you're running from
#$ -l h_rt=48:0:0,h_vmem=4G
#Setting running time in hours:min:sec and the memory required for the job per cpus (12*2=24g of RAM)
#$ -j y #Joining the output from standard out and standard error to one file
#$ -pe smp 12 #Setting the number of threads for the job to best fit for the system between 1 and 48.
#$ -e haplo-errAfar0.log #stderr output stored in the log file
#$ -o haplo-outAfar0.log #stdout output stored in the log filemodule load conda
source ../../../opt/sw/conda/3/etc/profile.d/conda.sh
module load gatk/4.3### HaplotypeCaller ####
gatk HaplotypeCaller -R ../wonde/Bos_taurus.ARS-UCD1.2.dna.toplevel.fa -I ../data/Afar_1_dedup.bam -O ../wonde/Afar1.g.vcf.gz -ERC GVCF
Running GATK log file .......
Using GATK jar /export/opt/sw/gatk/4.3/gatk4_env/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /export/opt/sw/gatk/4.3/gatk4_env/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller -R ../Refg/ARS-UCD1.2_Btau5.0.1Y.fa -I ../data/Afar_rD/trim/Afar_11_dedup.bam -O Afar11.g.vcf.gz -ERC GVCF
09:01:10.011 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/export/opt/sw/gatk/4.3/gatk4_env/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:01:10.182 INFO HaplotypeCaller - ------------------------------------------------------------
09:01:10.183 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.3.0.0
09:01:10.183 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
09:01:10.183 INFO HaplotypeCaller - Executing as wondossen@compute4.c.hgen.slu.se on Linux v3.10.0-693.21.1.el7.x86_64 amd64
09:01:10.183 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.13+7-b1751.21
09:01:10.183 INFO HaplotypeCaller - Start Date/Time: 18 February 2023 at 09:01:09 GMT
09:01:10.184 INFO HaplotypeCaller - ------------------------------------------------------------
09:01:10.184 INFO HaplotypeCaller - ------------------------------------------------------------
09:01:10.185 INFO HaplotypeCaller - HTSJDK Version: 3.0.1
09:01:10.185 INFO HaplotypeCaller - Picard Version: 2.27.5
09:01:10.185 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
09:01:10.185 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:01:10.185 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:01:10.185 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:01:10.185 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:01:10.185 INFO HaplotypeCaller - Deflater: IntelDeflater
09:01:10.186 INFO HaplotypeCaller - Inflater: IntelInflater
09:01:10.186 INFO HaplotypeCaller - GCS max retries/reopens: 20
09:01:10.186 INFO HaplotypeCaller - Requester pays: disabled
09:01:10.186 INFO HaplotypeCaller - Initializing engine
09:01:10.584 INFO HaplotypeCaller - Done initializing engine
09:01:10.587 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
09:01:10.686 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
09:01:10.686 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
09:01:10.712 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/export/opt/sw/gatk/4.3/gatk4_env/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
09:01:10.715 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/export/opt/sw/gatk/4.3/gatk4_env/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
09:01:10.756 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
09:01:10.757 INFO IntelPairHmm - Available threads: 48
09:01:10.757 INFO IntelPairHmm - Requested threads: 4
09:01:10.757 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
09:01:10.852 INFO ProgressMeter - Starting traversal
09:01:10.852 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
09:01:11.465 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 1:25960 and possibly subsequent; at least 10 samples must have called genotypes
09:01:21.509 INFO ProgressMeter - 1:168563 0.2 810 4560.4
09:01:31.520 INFO ProgressMeter - 1:298226 0.3 1730 5022.3
09:01:41.755 INFO ProgressMeter - 1:393775 0.5 2460 4776.2
09:01:52.569 INFO ProgressMeter - 1:423496 0.7 2710 3897.7
09:02:02.817 INFO ProgressMeter - 1:440892 0.9 2860 3302.2
09:02:14.172 INFO ProgressMeter - 1:457041 1.1 2990 2833.2
09:02:24.254 INFO ProgressMeter - 1:475053 1.2 3140 2566.7
09:02:34.377 INFO ProgressMeter - 1:494323 1.4 3240 2327.4
09:02:46.505 INFO ProgressMeter - 1:504580 1.6 3310 2076.3
09:02:57.508 INFO ProgressMeter - 1:514393 1.8 3380 1901.4
09:03:07.753 INFO ProgressMeter - 1:523101 1.9 3440 1765.6
09:03:18.212 INFO ProgressMeter - 1:533070 2.1 3510 1653.6
09:03:28.221 INFO ProgressMeter - 1:551630 2.3 3660 1598.6
09:03:39.305 INFO ProgressMeter - 1:574427 2.5 3860 1560.1
09:03:50.586 INFO ProgressMeter - 1:586123 2.7 3960 1487.5
09:04:02.003 INFO ProgressMeter - 1:605887 2.9 4130 1447.8
09:04:12.463 INFO ProgressMeter - 1:619129 3.0 4240 1400.8
09:04:23.784 INFO ProgressMeter - 1:630362 3.2 4330 1346.6
09:04:33.793 INFO ProgressMeter - 1:646653 3.4 4470 1321.6
09:04:43.964 INFO ProgressMeter - 1:655160 3.6 4530 1275.4
09:04:54.235 INFO ProgressMeter - 1:674183 3.7 4680 1257.0
09:05:04.578 INFO ProgressMeter - 1:689703 3.9 4810 1234.8
09:05:14.858 INFO ProgressMeter - 1:719915 4.1 5070 1246.7
09:05:24.965 INFO ProgressMeter - 1:806192 4.2 5750 1357.7
09:05:35.403 INFO ProgressMeter - 1:856692 4.4 6200 1406.2
09:05:45.427 INFO ProgressMeter - 1:973319 4.6 7130 1558.0
09:05:55.136 WARN DepthPerSampleHC - Annotation will not be calculated at position 1:1085154 and possibly subsequent; genotype for sample Afa1 is not called
09:05:55.137 WARN StrandBiasBySample - Annotation will not be calculated at position 1:1085154 and possibly subsequent; genotype for sample Afa1 is not called
09:05:55.450 INFO ProgressMeter - 1:1087402 4.7 8020 1690.8
09:06:05.494 INFO ProgressMeter - 1:1153066 4.9 8590 1749.2
09:06:15.516 INFO ProgressMeter - 1:1218256 5.1 9130 1798.0
09:06:25.564 INFO ProgressMeter - 1:1319886 5.2 9960 1898.9
Please sign in to leave a comment.
6 comments