GATK GenotypeGVCFs stuck at starting traversal
Hi,
I am trying to joint call SNPs using 5million BP intervals. My HPC only allows for 10d jobs, and calling intervals larger than that leads to failure (lack of time). I have managed to call 104 5million bp intervals, but I noticed a few of them were getting stuck for DAYS in the starting traversal step. Most of them were able to finish within 10 days despite the delays, but 2 intervals did not. I tried to run them again separately but again they are stuck. My genomic database is separated by chromosome, and these two intervals are in two different chromosomes. It follows that this delay is not caused by joint calling inside the same workspace. I don't know what to do anymore. This nuisance is delaying my research substantially, and it is too late now to swap snp caller software. I am using gatk version 4.2.6.1. Script is below:
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l node_type=ddy
#$ -l h_vmem=60G
#$ -l h_rt=240:0:0
#$ -t 1-2
file="/data/SBCS-BuggsLab-Oak/Oak/Variants/intervals.txt"
CHROM=$(sed -n "${SGE_TASK_ID}p" $file | cut -f1)
SIZE=$(sed -n "${SGE_TASK_ID}p" $file | cut -f2)
OUTPUT=$(sed -n "${SGE_TASK_ID}p" $file | cut -f3)
module load gatk/4.2.6.1
gatk GenotypeGVCFs -R /data/SBCS-BuggsLab-Oak/Romulo/Raw_sequences/Reference/Qrob_PM1N.fa \
-V gendb:///data/SBCS-BuggsLab-Oak/Oak/DBImport/$CHROM \
--intervals $SIZE \
-O /data/SBCS-BuggsLab-Oak/Romulo/Variants/$OUTPUT\.vcf.gz
Output:
10:53:22.680 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
10:53:22.961 INFO GenotypeGVCFs - ------------------------------------------------------------
10:53:22.962 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.6.1
10:53:22.962 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
10:53:22.962 INFO GenotypeGVCFs - Executing as mpx543@ddy137 on Linux v3.10.0-1160.83.1.el7.x86_64 amd64
10:53:22.962 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
10:53:22.962 INFO GenotypeGVCFs - Start Date/Time: July 4, 2023 10:53:22 AM GMT
10:53:22.962 INFO GenotypeGVCFs - ------------------------------------------------------------
10:53:22.962 INFO GenotypeGVCFs - ------------------------------------------------------------
10:53:22.963 INFO GenotypeGVCFs - HTSJDK Version: 2.24.1
10:53:22.963 INFO GenotypeGVCFs - Picard Version: 2.27.1
10:53:22.963 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
10:53:22.964 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
10:53:22.964 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
10:53:22.964 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
10:53:22.970 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
10:53:22.970 INFO GenotypeGVCFs - Deflater: IntelDeflater
10:53:22.970 INFO GenotypeGVCFs - Inflater: IntelInflater
10:53:22.970 INFO GenotypeGVCFs - GCS max retries/reopens: 20
10:53:22.971 INFO GenotypeGVCFs - Requester pays: disabled
10:53:22.971 INFO GenotypeGVCFs - Initializing engine
10:53:24.200 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
11:54:31.382 info NativeGenomicsDB - pid=99240 tid=99241 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
11:54:31.382 info NativeGenomicsDB - pid=99240 tid=99241 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
11:54:31.382 info NativeGenomicsDB - pid=99240 tid=99241 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
10:55:47.089 INFO IntervalArgumentCollection - Processing 5000000 bp from intervals
10:55:47.110 INFO GenotypeGVCFs - Done initializing engine
10:55:47.377 INFO ProgressMeter - Starting traversal
10:55:47.383 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
-
Hi Rômulo Carleial,
When running this tool with a GenomicsDB database as input, it's important to specify appropriate memory limits for Java in order to leave sufficient free memory for GenomicsDB, which is a native library. Failing to do so can cause significant slowdowns. As an example, if each parallel task has 32 GB of physical memory available, you might try limiting Java to 16 GB so as to leave 16 GB free for GenomicsDB to use. You can limit Java memory usage using the -Xmx argument, which you can pass to GATK like so:
gatk --java-options "-Xmx16G" ...<rest of command>
Another argument that can help in a cluster environment is the "--genomicsdb-shared-posixfs-optimizations" argument, which improves GenomicsDB performance for shared Posix filesystems such as NFS and Lustre commonly used in compute clusters.
Lastly, if these suggestions don't help, you can try splitting up the two problematic intervals into smaller sub-intervals.
Regards,David
Please sign in to leave a comment.
1 comment