GATK GenotypeGVCFs stuck on starting traversal
AnsweredI am attempting to perform WGS variant discovery in soybean on ~1500 genotypes using the HaplotypeCaller -> GenomicsDBImport -> GenotypeGVCFs tools and have ran into an issue with GenotypeGVCFs becoming stuck/not starting traversal. I am working on a shared computing cluster, and have tried to highly parallelize both the GenomicsDBImport and GenotypeGVCFs steps. I have broken down both of these tools into 50 intervals per chromosome, each interval being ~1,200,000 base pairs. I have allowed the jobs to run for up to 12 hours without traversal starting, EXCEPT for the first interval on a chromosome starting successfully after ~15 minutes. I am not sure why all later intervals would be getting stuck, and I have searched through these forums and best practices/tool guides for help.
For GenomicsDBImport I ran the following example command per 1,200,000 bp interval per chromosome:
#!/bin/sh
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --mem=75gb
#SBATCH --time=60:00:00
#SBATCH --job-name=2:1200001
#SBATCH --error=/work/soybean/mhapp95/PopGen/log/Combine.2.part1200001.err
#SBATCH --output=/work/soybean/mhapp95/PopGen/log/Combine.2.part1200001.out
cd /work/soybean/mhapp95/PopGen/data
endint=1200001
endint=`expr $endint + 1200000`
module load java/1.8
module load gatk4/4.2
gatk --java-options "-Xms15g -Xmx60g" GenomicsDBImport \
-R SNPcalling/reference/genome/W82.genome.v4.fasta \
--genomicsdb-workspace-path SNPcalling/reference/genomicsdbdatabase/chr2/section1200001 \
--sample-name-map AllSampleNames.txt \
--tmp-dir SNPcalling/reference/gatktemp/chr2 \
--batch-size 50 \
--max-num-intervals-to-import-in-parallel 8 \
-L Gm02:1200001-$endint \
--seconds-between-progress-updates 120 \
This step worked well and after all intervals on all chromosomes finished, I then I went to genotype with the GenotypeGVCFs tool next and ran into the described issue. The most recent command I have tried running with all the advice I could gather about settings from these forums is -
#!/bin/sh
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=3
#SBATCH --nodes=1
#SBATCH --mem=65gb
#SBATCH --time=75:00:00
#SBATCH --job-name=2:1200001
#SBATCH --error=/work/soybean/mhapp95/PopGen/log/Genotype.2.part1200001.err
#SBATCH --output=/work/soybean/mhapp95/PopGen/log/Genotype.2.part1200001.out
cd /work/soybean/mhapp95/PopGen/data
endint=1200001
endint=`expr $endint + 1200000`
module load java/1.8
module load gatk4/4.2
gatk --java-options "-Xms7g -Xmx53g" GenotypeGVCFs \
-R SNPcalling/reference/genome/W82.genome.v4.fasta \
-V gendb://SNPcalling/reference/genomicsdbdatabase/chr2/section1200001 \
-L Gm02:1200001-$endint \
-stand-call-conf 20 \
--seconds-between-progress-updates 120 \
--only-output-calls-starting-in-intervals \
--genomicsdb-shared-posixfs-optimizations \
-new-qual \
-O SNPcalling/mergedvcf/chr2.section1200001.vcf.gz \
On every interval except 1-1200000 I get stuck on the following output regardless of chromosome even after 12 hours running:
-------------------------| Help message for gatk module
|-------------------------------
Usage: gatk <program name> <arguments>
You can also pass -Xms or -Xmx arguments to control Java memory allocation.
For example, to increase Java heap space to 10GB, run: gatk --java-options
'-Xms512m -Xmx10g' <program name> <arguments>
------------------------------------------------------------------------------------------
Using GATK jar /util/opt/anaconda/deployed-conda-envs/packages/gatk4/envs/gatk4-4.2.0.0/share/gatk4-4.2.0.0-0/gatk-package-4.2.0.0-local.jar defined in environment variable GATK_LOCAL_JAR
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms7g -Xmx53g -jar /util/opt/anaconda/deployed-conda-envs/packages/gatk4/envs/gatk4-4.2.0.0/share/gatk4-4.2.0.0-0/gatk-package-4.2.0.0-local.jar GenotypeGVCFs -R SNPcalling/reference/genome/W82.genome.v4.fasta -V gendb://SNPcalling/reference/genomicsdbdatabase/chr2/section1200001 -L Gm02:1200001-2400001 -stand-call-conf 20 --seconds-between-progress-updates 120 --only-output-calls-starting-in-intervals --genomicsdb-shared-posixfs-optimizations -new-qual -O SNPcalling/mergedvcf/chr2.section1.vcf.gz
14:22:45.609 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/util/opt/anaconda/deployed-conda-envs/packages/gatk4/envs/gatk4-4.2.0.0/share/gatk4-4.2.0.0-0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 10, 2021 2:22:45 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
14:22:45.927 INFO GenotypeGVCFs - ------------------------------------------------------------
14:22:45.929 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.0.0
14:22:45.930 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
14:22:45.930 INFO GenotypeGVCFs - Executing as mhapp95@c1409.rhino.hcc.unl.edu on Linux v4.4.213-1.el7.elrepo.x86_64 amd64
14:22:45.931 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-b11
14:22:45.932 INFO GenotypeGVCFs - Start Date/Time: November 10, 2021 2:22:45 PM CST
14:22:45.933 INFO GenotypeGVCFs - ------------------------------------------------------------
14:22:45.934 INFO GenotypeGVCFs - ------------------------------------------------------------
14:22:45.935 INFO GenotypeGVCFs - HTSJDK Version: 2.24.0
14:22:45.936 INFO GenotypeGVCFs - Picard Version: 2.25.0
14:22:45.937 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
14:22:45.937 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:22:45.938 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:22:45.938 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:22:45.939 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:22:45.939 INFO GenotypeGVCFs - Deflater: IntelDeflater
14:22:45.940 INFO GenotypeGVCFs - Inflater: IntelInflater
14:22:45.941 INFO GenotypeGVCFs - GCS max retries/reopens: 20
14:22:45.941 INFO GenotypeGVCFs - Requester pays: disabled
14:22:45.942 INFO GenotypeGVCFs - Initializing engine
14:22:47.347 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
14:22:49.979 info NativeGenomicsDB - pid=26852 tid=26853 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
14:22:49.979 info NativeGenomicsDB - pid=26852 tid=26853 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
14:22:49.979 info NativeGenomicsDB - pid=26852 tid=26853 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
14:22:52.124 INFO IntervalArgumentCollection - Processing 1200001 bp from intervals
14:22:52.142 INFO GenotypeGVCFs - Done initializing engine
14:22:52.279 INFO ProgressMeter - Starting traversal
14:22:52.280 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
However, on the first interval of any chromosome it starts running within fifteen minutes:
-------------------------| Help message for gatk module
|-------------------------------
Usage: gatk <program name> <arguments>
You can also pass -Xms or -Xmx arguments to control Java memory allocation.
For example, to increase Java heap space to 10GB, run: gatk --java-options
'-Xms512m -Xmx10g' <program name> <arguments>
------------------------------------------------------------------------------------------
Using GATK jar /util/opt/anaconda/deployed-conda-envs/packages/gatk4/envs/gatk4-4.2.0.0/share/gatk4-4.2.0.0-0/gatk-package-4.2.0.0-local.jar defined in environment variable GATK_LOCAL_JAR
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms7g -Xmx53g -jar /util/opt/anaconda/deployed-conda-envs/packages/gatk4/envs/gatk4-4.2.0.0/share/gatk4-4.2.0.0-0/gatk-package-4.2.0.0-local.jar GenotypeGVCFs -R SNPcalling/reference/genome/W82.genome.v4.fasta -V gendb://SNPcalling/reference/genomicsdbdatabase/chr2/section1 -stand-call-conf 20 --seconds-between-progress-updates 120 -L Gm02:1-1200001 --only-output-calls-starting-in-intervals -new-qual --genomicsdb-shared-posixfs-optimizations -O SNPcalling/mergedvcf/chr2.section1.vcf.gz
14:16:44.402 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/util/opt/anaconda/deployed-conda-envs/packages/gatk4/envs/gatk4-4.2.0.0/share/gatk4-4.2.0.0-0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 10, 2021 2:16:44 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
14:16:44.728 INFO GenotypeGVCFs - ------------------------------------------------------------
14:16:44.729 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.0.0
14:16:44.730 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
14:16:44.731 INFO GenotypeGVCFs - Executing as mhapp95@c1417.rhino.hcc.unl.edu on Linux v4.4.213-1.el7.elrepo.x86_64 amd64
14:16:44.732 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-b11
14:16:44.733 INFO GenotypeGVCFs - Start Date/Time: November 10, 2021 2:16:44 PM CST
14:16:44.734 INFO GenotypeGVCFs - ------------------------------------------------------------
14:16:44.735 INFO GenotypeGVCFs - ------------------------------------------------------------
14:16:44.736 INFO GenotypeGVCFs - HTSJDK Version: 2.24.0
14:16:44.737 INFO GenotypeGVCFs - Picard Version: 2.25.0
14:16:44.738 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
14:16:44.739 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:16:44.739 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:16:44.740 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:16:44.741 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:16:44.742 INFO GenotypeGVCFs - Deflater: IntelDeflater
14:16:44.742 INFO GenotypeGVCFs - Inflater: IntelInflater
14:16:44.743 INFO GenotypeGVCFs - GCS max retries/reopens: 20
14:16:44.744 INFO GenotypeGVCFs - Requester pays: disabled
14:16:44.744 INFO GenotypeGVCFs - Initializing engine
14:16:46.332 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
14:16:48.916 info NativeGenomicsDB - pid=15051 tid=15053 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
14:16:48.917 info NativeGenomicsDB - pid=15051 tid=15053 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
14:16:48.917 info NativeGenomicsDB - pid=15051 tid=15053 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
14:16:51.562 INFO IntervalArgumentCollection - Processing 1200001 bp from intervals
14:16:51.580 INFO GenotypeGVCFs - Done initializing engine
14:16:51.745 INFO ProgressMeter - Starting traversal
14:16:51.746 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
14:30:07.181 INFO ProgressMeter - Gm02:1148 13.3 1000 75.4
14:33:15.740 INFO ProgressMeter - Gm02:5148 16.4 5000 304.9
14:40:12.999 INFO ProgressMeter - Gm02:9148 23.4 9000 385.4
I am running with java version 1.8. In my troubleshooting I have tried smaller intervals, limiting alternate alleles to 4, and using GATK version 4.1 and had the exact same results. This happens no matter if I have all jobs submitted at once to our cluster, or running one of the later intervals alone. I have not moved the database since it's creation as I understand this can cause issues as well.
I looked more into the GenomicsDBImport tool and am considering rebuilding the database using the --consolidate and --genomicsdb-shared-posixfs-optimizations options as I understand for my situation this may lead to faster GenotypeGVCFs performance and maybe I won't get this hangup. However, the database build took about 9 days and I'd like to avoid that if possible since at this point I have no idea if the rebuild will help anything. Does anyone have any other insight or things I could try first?
-
Small update, I have found that not maxing out the java -Xms and -Xmx options with respect to the memory I request runs GenomicsDBImport much faster so for now I am creating a new database with the --consolidate --genomicsdb-shared-posixfs-optimizations options which should finish in 1-2 days and then I can retry GenotypeGVCFs
-
Thanks for the update Mary Happ! And your thorough post is very helpful for trying to troubleshoot. It seems that you have really read the other posts thoroughly and are following all of our recommendations. Since you have found a way to run GenomicsDBImport faster, that's probably the best step for now. If you can send another update by Monday midmorning, I can ask any remaining questions to my colleagues who are experts in optimizing genomicsdb workspace usage.
-
Hi Genevieve - A couple updates since Wednesday.
1) Running GenomicsDBImport with the --consolidate flag had some problems for me. After all batches were imported the program stalls and eventually fails with the error:
[TileDB::utils] Error: (gzip_handle_error) Cannot decompress with GZIP: inflate error: Z_BUF_ERROR [TileDB::Codec]
Error: Could not compress with . [TileDB::ReadState]
Error: Cannot decompress tile.
VariantStorageManagerException exception : Error while consolidating TileDB array.2) I reran my scripts without the consolidate flag (still retaining the -genomicsdb-shared-posixfs-optimizations flag) and this afternoon some of the intervals have finished. So I tried starting GenotypeGVCFs on those intervals and I am still stuck at starting traversal after a couple hours.
15:27:40.133 INFO GenotypeGVCFs - Done initializing engine
15:27:40.605 INFO ProgressMeter - Starting traversal
15:27:40.610 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute -
Hi Genevieve - so over the weekend I played with the memory I was requesting for the GenotypeGVCFs jobs. After increasing the requested memory to 150gb and requesting about 120 of it for Java with 'gatk --java-options "-Xms10g -Xmx110g" GenotypeGVCFs' I got the other intervals to be able to start after ~3 hours. They then run fairly quickly, about another 4.5 hours until completion. This is still just happening on intervals past the first one, which starts traversal quickly, but at least I should able to complete the analysis for the time being.
-
Hi Mary,
I spoke with our GenomicsDB team about your issue and have a few recommendations if you still want to try them out.
- First, we recommend that you update your our newest GATK version to 4.2.3.0. It contains a new GenomicsDB version and we think it could help with the problems you are seeing in GenotypeGVCFs. You wouldn't need to re-create your workspaces, you can just run GenotypeGVCFs with the new version. We think that this new version will make the --consolidate flag successful without the TileDB error.
- Are your runs all contained in different GenomicsDB workspaces? If you are running multiple GenotypeGVCF jobs on the same workspace, there could be issues.
- Do you have many small chromosomes? This could lead to a slow down upon start up. We think that the new GATK version could help with this.
Let me know how it goes and if you have any further questions.
Best,
Genevieve
-
Ok, I've updated to the newest version and I will let you know how things go there. As for your other two points - Yes, all my runs are contained to different workspaces (separate folders for each interval nested within a folder for each chromosome). My genome has 20 chromosomes with sizes ranging from 38-58 million bp so I don't think they are particularly small but I guess that might be contextual!
-
Ok, let me know if you see any improvement! I don't think that your chromosomes would be causing the issue with startup. We usually see that problem when people have thousands of chromosomes (like doing exome analysis).
-
Hi Genieve. This is an old thread but GATK is giving me a lot of trouble with the same issue described above. Two 5million bp intervals are stuck at the traversal step. Last time it took GATK more than 6 days to initialise the snp calling, but my HPC only allows for 10 days job (maximum), so it ran out of time. I am trying to re-run the scripts but I am facing the same issue.
This is very weird as most 5million bp intervals worked fine bar these two and a few others which also took days to initialize, but manage to finish in time. These 2 intervals are in different workspaces, and I am using gatk 4.2.6.1.
-
Hi Rômulo Carleial,
Are you working with an organism that has high ploidy? The two things that slow down GenotypeGVCFs the most are high ploidy and high numbers of alternate alleles. I believe that you can most efficiently control the number of alternate alleles with `--genomicsdb-max-alternate-alleles`. Set that to maybe 2 or 3 more than the largest number you want to see in your output VCF. If a site has more than that number of alt alleles (including the non-ref), then it will get skipped, which should save a lot of compute composing the PLs. I believe the site will get output in the final VCF, but the genotypes will all be no-calls.
Please sign in to leave a comment.
9 comments