CombineGVCFs, java heap space, and ploidy
AnsweredHi,
I am using GATK version 4.0.0.0 and am running into an error with CombineGVCFs. I have 24 samples to call SNPs from a very small reference genome (~2.5 mil bp). They are pooled samples (DNA was pooled prior to library preparation/sequencing), and I am comparing using ploidy = 2 and ploidy = 10. All of my pipelines have been successful with ploidy = 2.
With ploidy = 10, I am running into an issue with CombineGVCFs. My g.vcf files were successfully made with HaplotypeCaller using "--sample-ploidy 10" and "--emit-ref-confidence GVCF" mode. Now I am trying to combine all the g.vcf files before the genotyping, selecting variants, and filtering steps.
This was successful with my samples which had a ploidy of 2 specified in the HaplotypeCaller step (example of sample files):
gatk CombineGVCFs --reference REF.fasta --output combined_samples.g.vcf --variant sample1.g.vcf --variant sample2.g.vcf --variant ......
However, it was unsuccessful with my dataset in which I have specified ploidy = 10 in the HaplotypeCaller step. I noticed a java memory error:
Runtime.totalMemory()=40543191040
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
And the progress meter stops after about 20-30 minutes, no matter what --java-options I change:
13:08:26.239 INFO ProgressMeter - 52283_0_000006:2650 0.2 88000 472779.4
13:38:39.879 INFO CombineGVCFs - Shutting down engine
So I added the --java-options based on some other posts (variations of -Xmx and -Xms; from 4g to 160g):
gatk CombineGVCFs --java-options "-Xmx40g -Xms40g" --reference REF.fasta --output combined_samples.g.vcf --variant sample1.g.vcf --variant sample2.g.vcf --variant ......
I have tried moving the scripts to a higher memory partition on my available remote server, but still had the same errors (e.g., all combinations of --java-options, scratch space, etc. have the same error)
I have read that I do not need to specify the ploidy on CombineGVCFs, but I am wondering if that is where it is getting snagged.
Thank you for your help.
-
Hi Alix Matthews,
It is true that CombineGVCFs should be able to handle any ploidy, but there have been significant improvements to the tool since version 4.0.0.0. Could you try running this with the latest version of GATK?
Kind regards,
Pamela
-
Thank you for your quick reply. I will request an update to GATK 4.2.0 on the server I am using and try again. I will let you all know if the problem resolves or doesn't resolve with the update. Thank you again.
-
Hello! I`m using the version 4.4.0.0.
I did the HaplotypeCaller in my 18 samples, and it worked. Now I`m using the CombineGVCFs to put the samples together in a cohort. However I go this error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at htsjdk.tribble.readers.TabixReader.readLong(TabixReader.java:195)
at htsjdk.tribble.readers.TabixReader.readIndex(TabixReader.java:267)
at htsjdk.tribble.readers.TabixReader.readIndex(TabixReader.java:287)
at htsjdk.tribble.readers.TabixReader.<init>(TabixReader.java:165)
at htsjdk.tribble.readers.TabixReader.<init>(TabixReader.java:129)
at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:80)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:117)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:433)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:377)
at org.broadinstitute.hellbender.engine.FeatureDataSource.<init>(FeatureDataSource.java:319)
at org.broadinstitute.hellbender.engine.FeatureDataSource.<init>(FeatureDataSource.java:291)
at org.broadinstitute.hellbender.engine.FeatureManager.addToFeatureSources(FeatureManager.java:225)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$initializeDrivingVariants$0(MultiVariantWalker.java:86)
at org.broadinstitute.hellbender.engine.MultiVariantWalker$$Lambda$196/0x00002ad3f85a89f0.accept(Unknown Source)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.initializeDrivingVariants(MultiVariantWalker.java:76)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:67)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:726)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.onStartup(MultiVariantWalker.java:49)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:147)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289) -
Do I have to do the aggregation step and then do the joint genotype using the GenotypeGVCFs or can I move to the GenotypeGVCFs setp?
-
GenotypeGVCFs require a single VCF input for genotyping therefore GVCF files must be combined or imported to genomicsdb before genotyping. Can you increase the heap size by using the below parameter?
--java-options "-Xmx8G"
This parameter will set the heap size to 8 GB so if needed you can increase the number but do not assign more than 80~90% of all your total memory size as that will result in additional issues.
-
It worked! Thank you!!
Please sign in to leave a comment.
6 comments