genotypeGVCFs won't use more than 2G no matter how much memory I give it...
Hi there,
I am using GATK version 4.0.11.0 (kind of outdated, I know...) but I'm experiencing a weird issue I'm not sure is version specific.
No matter what I do, I can't seem to get GenotypeGVCFs to use more than 2 - 2.5G of memory at any time, and I've tried to give it up to 32G. Although the issue does not seem to be specific to GenotypeGVCFs... none of the tools seem to be using the resources I give them. I am also using the '-Xmx 32G' flag too, so I'm really unsure what to do here.
Is there something wrong with our installation? We use SLURM for our scheduler. Here is an example of one of my commands:
#!/bin/bash --login
#SBATCH -J genotype_gvcfs
#SBATCH --mail-user=cgoeckeritz@hudsonalpha.org
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=32g
#SBATCH --time=144:00:00
#SBATCH -o /cluster/home/cgoeckeritz/bwa_full/moms/gatk/output_files/genotype_gvcfs_Madstringens_588898_%j
#SBATCH --export=INFILE=/cluster/home/cgoeckeritz/bwa_full/moms/gatk/intervals.list
#SBATCH -a 1-17
module purge
module load cluster/gatk
CHR=`/bin/sed -n ${SLURM_ARRAY_TASK_ID}p ${INFILE}`
echo ${CHR}
GENO=Madstringens_588898
REFERENCE=/cluster/home/cgoeckeritz/honeycrisp_full_genome/Malus_x_domestica_Honeycrisp_HAP1_v1.1.a1_scaffolded.fasta
cd /cluster/home/cgoeckeritz/bwa_full/moms/gatk/
gatk --java-options "-Xmx32g" GenotypeGVCFs \
-R ${REFERENCE} \
-V gendb:///cluster/home/cgoeckeritz/bwa_full/moms/gatk/${GENO}_DB/${GENO}_database_${CHR} \
-O /cluster/home/cgoeckeritz/bwa_full/moms/gatk/${GENO}_DB/${GENO}_combined_${CHR}.vcf.gz
Here is the program log after a minute or two of running:
chr1A
/cluster/software/gatk-4.0.11.0/gatk:80: SyntaxWarning: "is" with a literal. Did you mean "=="?
if len(args) is 0 or (len(args) is 1 and (args[0] == "--help" or args[0] == "-h")):
/cluster/software/gatk-4.0.11.0/gatk:80: SyntaxWarning: "is" with a literal. Did you mean "=="?
if len(args) is 0 or (len(args) is 1 and (args[0] == "--help" or args[0] == "-h")):
/cluster/software/gatk-4.0.11.0/gatk:117: SyntaxWarning: "is" with a literal. Did you mean "=="?
if len(args) is 1 and args[0] == "--list":
/cluster/software/gatk-4.0.11.0/gatk:301: SyntaxWarning: "is" with a literal. Did you mean "=="?
if call(["gsutil", "-q", "stat", gcsjar]) is 0:
/cluster/software/gatk-4.0.11.0/gatk:305: SyntaxWarning: "is" with a literal. Did you mean "=="?
if call(["gsutil", "cp", jar, gcsjar]) is 0:
/cluster/software/gatk-4.0.11.0/gatk:458: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if not len(properties) is 0:
/cluster/software/gatk-4.0.11.0/gatk:462: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if not len(filesToAdd) is 0:
Using GATK jar /cluster/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32g -jar /cluster/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar GenotypeGVCFs -R /cluster/home/cgoeckeritz/honeycrisp_full_genome/Malus_x_domestica_Honeycrisp_HAP1_v1.1.a1_scaffolded.fasta -V gendb:///cluster/home/cgoeckeritz/bwa_full/moms/gatk/Madstringens_588898_DB/Madstringens_588898_database_chr1A -O /cluster/home/cgoeckeritz/bwa_full/moms/gatk/Madstringens_588898_DB/Madstringens_588898_combined_chr1A.vcf.gz
10:37:39.728 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/cluster/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
10:37:41.477 INFO GenotypeGVCFs - ------------------------------------------------------------
10:37:41.478 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.11.0
10:37:41.478 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
10:37:41.478 INFO GenotypeGVCFs - Executing as cgoeckeritz@hpc0016 on Linux v4.18.0-477.21.1.el8_8.x86_64 amd64
10:37:41.479 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_332-b09
10:37:41.479 INFO GenotypeGVCFs - Start Date/Time: August 26, 2024 10:37:39 AM CDT
10:37:41.479 INFO GenotypeGVCFs - ------------------------------------------------------------
10:37:41.479 INFO GenotypeGVCFs - ------------------------------------------------------------
10:37:41.479 INFO GenotypeGVCFs - HTSJDK Version: 2.16.1
10:37:41.479 INFO GenotypeGVCFs - Picard Version: 2.18.13
10:37:41.480 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
10:37:41.480 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
10:37:41.480 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
10:37:41.480 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
10:37:41.480 INFO GenotypeGVCFs - Deflater: IntelDeflater
10:37:41.480 INFO GenotypeGVCFs - Inflater: IntelInflater
10:37:41.480 INFO GenotypeGVCFs - GCS max retries/reopens: 20
10:37:41.480 INFO GenotypeGVCFs - Requester pays: disabled
10:37:41.480 INFO GenotypeGVCFs - Initializing engine
Does anything stand out to you as being a problem?
Thanks so much for your time,
Charity
P.S. - we did install version 4.6.0.0 but I am seeing the same restriction on what memory is actually being used.
-
Does the process continue and import variants properly? Do you observe any error messages that states the process is not complete?
GenomicsDBImport does not have to use too much memory in fact the less heapsize you give the better since the GenomicsDB library is written in C/C++ and it uses memory outside of the heapspace so it is not bound by the java VM.
I hope this helps.
Regards.
-
Hi Gökalp,
Thanks for your quick response! This issue is for GenotypeGVCFs - is that what you mean? I guess, either way, my problem still stands. GenomicsDBImport was also giving me the same issue of not using resources given to it as GenotypeGVCFs now is. However, I was able to get GenomicsDBImport to complete after giving it ~5 days of wall time; no such luck with GenotypeGVCFs - So I really do need to figure out how to get GenotypeGVCFs to use it's allocated memory. Otherwise it will probably need to run for more than a week, despite only trying to genotype 30ish samples per vcf using a reference genome that is about 620 Mb. As far as I can tell, the log doesn't show any errors or issues aside from the syntax issues posted in the opening comment. It just runs until it has no more wall time. If I could just figure out why the program is not using the allocated memory, I probably could get it to finish much more quickly, which is why I initially opened the issue.
There is one other WARN message though; it just says 'WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples', which is odd (I have more than 10 individuals...) but I don't need this calculated anyway.
Based on your comment about the java VM, I tried leaving out --java-options "-Xmx32G" from my command but GenotypeGVCFs still won't use more than 3.5G.
Any other ideas on what might be going on would be greatly appreciated! Thanks so much.
Kindly,
Charity -
Hi again.
Sorry for our late response. What is the ploidy for your samples? Importing and Genotyping steps take more time to finish depending on the number of expected alleles per loci therefore if the whole process seems slow we recommend dividing your calls into multiple shards and genotype them simultaneously in parallel and later combine them into a single call set. More memory won't make the process faster but reducing the number of alleles per loci may.
I hope this helps.
Please sign in to leave a comment.
3 comments