Issues with Java heap space in CollectAllelicCounts
I am using the Copy Number Variation (CNV) tutorial in order to analyze whole genome sequencing samples (along with matched normal controls). When I run the GATK tool, CollectAllelicCounts, I get a memory error. I am running GATK version 4.1.7.0. Here is my command line:
GATK="<path to GATK directory>/gatk-4.1.7.0"
ALIGNMENT_RUN="SJOS001101_M1"
REF="<path to GATK indexes>/hg38_osteo/Homo_sapiens_assembly38.fasta"
SNPs_ONLY="<path to gnomAD resources>/af-only-gnomad.hg38_biallelicSNPs_only_sorted.vcf.gz"
INPUT_DIR="<path to sequencing data>/"$ALIGNMENT_RUN
OUTPUT_DIR="<path to data directory>/"$ALIGNMENT_RUN
TEMP_DIR="<path to temp directory>"
srun $GATK/gatk --java-options "-Xmx80g" CollectAllelicCounts \
-L $SNPs_ONLY \
-I $INPUT_DIR/recal_reads.cram \
-R $REF \
-O $OUTPUT_DIR/T_clean.allelicCounts.tsv \
--tmp-dir $TEMP_DIR
I get the following error:
Using GATK jar /home/exacloud/lustre1/jjacobs/programs/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=fa$
09:52:59.243 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/lustre1/jjacobs/programs/gatk-4$
Aug 18, 2020 9:53:01 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
09:53:01.614 INFO CollectAllelicCounts - ------------------------------------------------------------
09:53:01.615 INFO CollectAllelicCounts - The Genome Analysis Toolkit (GATK) v4.1.7.0
09:53:01.615 INFO CollectAllelicCounts - For support and documentation go to https://software.broadinstitute.org/gatk/
09:53:01.616 INFO CollectAllelicCounts - Executing as jacojam@exanode-3-6 on Linux v3.10.0-1062.18.1.el7.x86_64 amd64
09:53:01.617 INFO CollectAllelicCounts - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-b08
09:53:01.617 INFO CollectAllelicCounts - Start Date/Time: August 18, 2020 9:52:58 AM PDT
09:53:01.617 INFO CollectAllelicCounts - ------------------------------------------------------------
09:53:01.617 INFO CollectAllelicCounts - ------------------------------------------------------------
09:53:01.618 INFO CollectAllelicCounts - HTSJDK Version: 2.21.2
09:53:01.618 INFO CollectAllelicCounts - Picard Version: 2.21.9
09:53:01.618 INFO CollectAllelicCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:53:01.619 INFO CollectAllelicCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:53:01.619 INFO CollectAllelicCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:53:01.619 INFO CollectAllelicCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:53:01.620 INFO CollectAllelicCounts - Deflater: IntelDeflater
09:53:01.620 INFO CollectAllelicCounts - Inflater: IntelInflater
09:53:01.620 INFO CollectAllelicCounts - GCS max retries/reopens: 20
09:53:01.620 INFO CollectAllelicCounts - Requester pays: disabled
09:53:01.620 INFO CollectAllelicCounts - Initializing engine
09:54:37.623 INFO FeatureManager - Using codec VCFCodec to read file file:///home/exacloud/lustre1/jjacobs/gnomAD_files/af-only-gn$
10:05:20.613 INFO IntervalArgumentCollection - Processing 231935613 bp from intervals
10:05:58.581 INFO CollectAllelicCounts - Done initializing engine
10:05:58.611 INFO CollectAllelicCounts - Collecting allelic counts...
10:05:58.612 INFO ProgressMeter - Starting traversal
10:05:58.612 INFO ProgressMeter - Current Locus Elapsed Minutes Loci Processed Loci/Minute
14:31:34.707 INFO CollectAllelicCounts - Shutting down engine
[August 18, 2020 2:31:58 PM PDT] org.broadinstitute.hellbender.tools.copynumber.CollectAllelicCounts done. Elapsed time: 279.00 min$
Runtime.totalMemory()=83036078080
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3181)
at java.util.ArrayList.grow(ArrayList.java:265)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:239)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:231)
at java.util.ArrayList.add(ArrayList.java:462)
at htsjdk.samtools.BinningIndexContent.getChunksOverlapping(BinningIndexContent.java:131)
at htsjdk.samtools.CachingBAMFileIndex.getSpanOverlapping(CachingBAMFileIndex.java:75)
at htsjdk.samtools.CRAMFileReader.lambda$coordinatesFromQueryIntervals$0(CRAMFileReader.java:485)
at htsjdk.samtools.CRAMFileReader$$Lambda$105/696031899.accept(Unknown Source)
at java.util.Arrays$ArrayList.forEach(Arrays.java:3880)
at htsjdk.samtools.CRAMFileReader.coordinatesFromQueryIntervals(CRAMFileReader.java:485)
at htsjdk.samtools.CRAMFileReader.access$300(CRAMFileReader.java:48)
at htsjdk.samtools.CRAMFileReader$CRAMIntervalIterator.<init>(CRAMFileReader.java:582)
at htsjdk.samtools.CRAMFileReader.query(CRAMFileReader.java:454)
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:405)
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:$
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.<init>(SamReaderQueryingIterator.java:66)
at org.broadinstitute.hellbender.engine.ReadsDataSource.prepareIteratorsForTraversal(ReadsDataSource.java:416)
at org.broadinstitute.hellbender.engine.ReadsDataSource.iterator(ReadsDataSource.java:342)
at java.lang.Iterable.spliterator(Iterable.java:101)
at org.broadinstitute.hellbender.utils.Utils.stream(Utils.java:1101)
at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:377)
at org.broadinstitute.hellbender.engine.LocusWalker.getAlignmentContextIterator(LocusWalker.java:182)
at org.broadinstitute.hellbender.engine.LocusWalker.traverse(LocusWalker.java:157)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
srun: error: exanode-3-6: task 0: Exited with exit code 1
I just keep adding memory to the "--java-options "-Xmx" but it doesn't seem to make any difference. Is there something that I'm doing wrong here?
-
Hi jejacobs23, have you tried running this without the CRAM input? Using a CRAM as input can lead to performance issues so I think it would be worthwhile to convert your file to a BAM file and see if this tool works then. https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats
-
Thank you Genevieve. I converted the CRAM input to a BAM and created an index file (using Samtools). Unfortunately, I'm still getting the same error with the java heap space running out of memory. I tried as much as 100G in the "-Xmx" option.
-
Hi jejacobs23, thanks for the update, I'll try to get you some more information and update you if I am able to find anything.
-
Thank you so much Genevieve. Here is some additional information:
The SNPs-only file used in the "-L" option is a .vcf.gz file and is 2.5Gb
The .bam file is 177Gb
-
Hi jejacobs23, here are some ideas for troubleshooting:
- One major problem could be an incredibly high peak coverage. You can use a GATK tool, DepthofCoverage to find the max coverage and see if there is a large spike. If you cannot get that tool to run, you can look at the samtools idk stats and see if there is a huge outlier of reads in one region.
- Using a VCF for the -L option can be very slow. This can also be worse if you are using many separate intervals in the VCF type. Could you try using the interval_list format to see if that helps your run time? And if you decrease the number of intervals, it could help this tool.
Hope this can help you out.
-
Thanks again Genevieve for all your help with this issue. I'm running the DepthofCoverage tool but as of yet, it hasn't finished (>24 hours). As for converting the .vcf.gz into and Picard-style interval_list file, I'm using Picard with the tool, IntervalListTools. Here is the command line:
PICARD_DIR=<path to picard directory>
INPUT_FILE=<path to gnomAD resources>"/af-only-gnomad.hg38_biallelicSNPs_only_sorted.vcf.gz"
OUTPUT_DIR=<path to genomAD resoruces>
srun /usr/bin/java -Xmx60g -jar $PICARD_DIR/picard.jar IntervalListTools \
ACTION=CONCAT \
SORT=true \
UNIQUE=true \
I=$INPUT_FILE \
O=$OUTPUT_DIR/af-only-gnomad.hg38_biallelicSNPs_only_sorted.interval_listUnfortunately, I keep getting the same java heap-space error:
INFO 2020-08-28 01:07:32 IntervalListTools
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** IntervalListTools -ACTION CONCAT -SORT true -UNIQUE true -I /home/exacloud/lustre1/jjacobs/gnomAD_files/af-only-gnoma$
**********
01:07:33.278 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/lustre1/jjacobs/programs/picard$
[Fri Aug 28 01:07:33 PDT 2020] IntervalListTools INPUT=[/home/exacloud/lustre1/jjacobs/gnomAD_files/af-only-gnomad.hg38_biallelicSN$
[Fri Aug 28 01:07:33 PDT 2020] Executing as jacojam@exanode-3-4 on Linux 3.10.0-1062.18.1.el7.x86_64 amd64; OpenJDK 64-Bit Server V$
[Fri Aug 28 03:00:43 PDT 2020] picard.util.IntervalListTools done. Elapsed time: 113.17 minutes.
Runtime.totalMemory()=62277025792
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at htsjdk.samtools.util.StringUtil.join(StringUtil.java:54)
at htsjdk.samtools.util.IntervalList.merge(IntervalList.java:405)
at htsjdk.samtools.util.IntervalList.getUniqueIntervals(IntervalList.java:291)
at htsjdk.samtools.util.IntervalList.getUniqueIntervals(IntervalList.java:255)
at htsjdk.samtools.util.IntervalList.uniqued(IntervalList.java:189)
at htsjdk.samtools.util.IntervalList.uniqued(IntervalList.java:180)
at picard.util.IntervalListTools.doWork(IntervalListTools.java:384)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
srun: error: exanode-3-4: task 0: Exited with exit code 1Is there a different tool that you would recommend for this (or different settings)?
Thanks,
James.
-
Hi jejacobs23, because DepthofCoverage is getting this issue, the next place to look is your VCF. Are there any really long lines in the file? DepthofCoverage reads one line at a time, I am wondering if there is a very long individual record. This could be a very long annotation, possibly a PL annotation.
-
I'm not sure I understand what you mean. I'm running GATK DepthOfCoverage on my input .bam file, not the .vcf file.
-
jejacobs23 sorry, my mistake! I meant the IntervalListTools. Your input for that is a VCF with your intervals, correct? And with this issue:
java.lang.OutOfMemoryError: Java heap space
It indicates you are running out of space. Could you check to see if there are any giant lines in that VCF intervals file?
Any updates from DepthofCoverage?
-
Thanks for your patience Genevieve. As for your questions, the DepthOfCoverage never did finish and the .vcf file did not have any unusually long lines in it. I went ahead and tried a few different things:
1) I used the Picard tool, VcfToIntervalList in order to convert my .vcf file into a Picard-style interval_list. This worked, but when I tried to use the interval_list for the -L option in CollectAllelicCounts, it again errored out with the java heap space issue.
2) I wrote a short python script that filters the .vcf.gz file to only include variants with AF > 0.1. This cutoff was somewhat arbitrary and I'm not sure if it's the best choice. I gzipped and resulting .vcf file and created an index. I then used this .vcf.gz file for the -L option in CollectAllelicCounts and IT WORKED!!!
Does this seem like a reasonable solution? Do you think AF of 10% is a good cutoff to use?
Thanks,
James.
-
Hi jejacobs23, I am not sure about that specific cutoff. I noticed in the tutorial you are following that the -L option is an interval list for only one chromosome. You may be able to break up your SNPs_only resource to get around this issue.
-
Hi jejacobs23,
That cutoff should be fine. If you search around a bit, you'll find that we've variously used cutoffs of ~10% and ~2% in the common SNP lists we distribute in the resource package. You can use a different cutoff, if you like, if a different tradeoff between file size and final number of hets found is acceptable to you.
-
Hi Genevieve,
sorry to dig into a 2 years old thread. but I also have this heap memory GC overhead errors while running GATK. I did saw a very long line in vcf file as the final outout but only partially finished because of the heap memory error. i am wondering how to resolve this issue? you mentioned this could be due to PL annotation, which exactly you meant?
-
@rotten_tomato, it seems like you might be running into a memory issue that is unrelated to the CollectAllelicCounts tool with which this thread is concerned. If so, could you start a new thread and provide more details about your issue?
Please sign in to leave a comment.
14 comments