Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Issues with Java heap space in CollectAllelicCounts

0

14 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi jejacobs23, have you tried running this without the CRAM input? Using a CRAM as input can lead to performance issues so I think it would be worthwhile to convert your file to a BAM file and see if this tool works then. https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats

    0
    Comment actions Permalink
  • Avatar
    jejacobs23

    Thank you Genevieve.  I converted the CRAM input to a BAM and created an index file (using Samtools).  Unfortunately, I'm still getting the same error with the java heap space running out of memory.  I tried as much as 100G in the "-Xmx" option. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi jejacobs23, thanks for the update, I'll try to get you some more information and update you if I am able to find anything.

    0
    Comment actions Permalink
  • Avatar
    jejacobs23

    Thank you so much Genevieve.  Here is some additional information:

    The SNPs-only file used in the "-L" option is a .vcf.gz file and is 2.5Gb

    The .bam file is 177Gb

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi jejacobs23, here are some ideas for troubleshooting:

    • One major problem could be an incredibly high peak coverage. You can use a GATK tool, DepthofCoverage to find the max coverage and see if there is a large spike. If you cannot get that tool to run, you can look at the samtools idk stats and see if there is a huge outlier of reads in one region.
    • Using a VCF for the -L option can be very slow. This can also be worse if you are using many separate intervals in the VCF type. Could you try using the interval_list format to see if that helps your run time? And if you decrease the number of intervals, it could help this tool.

    Hope this can help you out.

    0
    Comment actions Permalink
  • Avatar
    jejacobs23

    Thanks again Genevieve for all your help with this issue.  I'm running the DepthofCoverage tool but as of yet, it hasn't finished (>24 hours).  As for converting the .vcf.gz into and Picard-style interval_list file, I'm using Picard with the tool, IntervalListTools.  Here is the command line:

    PICARD_DIR=<path to picard directory>

    INPUT_FILE=<path to gnomAD resources>"/af-only-gnomad.hg38_biallelicSNPs_only_sorted.vcf.gz"
    OUTPUT_DIR=<path to genomAD resoruces>

    srun /usr/bin/java -Xmx60g -jar $PICARD_DIR/picard.jar IntervalListTools \
    ACTION=CONCAT \
    SORT=true \
    UNIQUE=true \
    I=$INPUT_FILE \
    O=$OUTPUT_DIR/af-only-gnomad.hg38_biallelicSNPs_only_sorted.interval_list

    Unfortunately, I keep getting the same java heap-space error:

    INFO 2020-08-28 01:07:32 IntervalListTools

    ********** NOTE: Picard's command line syntax is changing.
    **********
    ********** For more information, please see:
    ********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
    **********
    ********** The command line looks like this in the new syntax:
    **********
    ********** IntervalListTools -ACTION CONCAT -SORT true -UNIQUE true -I /home/exacloud/lustre1/jjacobs/gnomAD_files/af-only-gnoma$
    **********


    01:07:33.278 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/lustre1/jjacobs/programs/picard$
    [Fri Aug 28 01:07:33 PDT 2020] IntervalListTools INPUT=[/home/exacloud/lustre1/jjacobs/gnomAD_files/af-only-gnomad.hg38_biallelicSN$
    [Fri Aug 28 01:07:33 PDT 2020] Executing as jacojam@exanode-3-4 on Linux 3.10.0-1062.18.1.el7.x86_64 amd64; OpenJDK 64-Bit Server V$
    [Fri Aug 28 03:00:43 PDT 2020] picard.util.IntervalListTools done. Elapsed time: 113.17 minutes.
    Runtime.totalMemory()=62277025792
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at htsjdk.samtools.util.StringUtil.join(StringUtil.java:54)
    at htsjdk.samtools.util.IntervalList.merge(IntervalList.java:405)
    at htsjdk.samtools.util.IntervalList.getUniqueIntervals(IntervalList.java:291)
    at htsjdk.samtools.util.IntervalList.getUniqueIntervals(IntervalList.java:255)
    at htsjdk.samtools.util.IntervalList.uniqued(IntervalList.java:189)
    at htsjdk.samtools.util.IntervalList.uniqued(IntervalList.java:180)
    at picard.util.IntervalListTools.doWork(IntervalListTools.java:384)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
    srun: error: exanode-3-4: task 0: Exited with exit code 1

    Is there a different tool that you would recommend for this (or different settings)?

    Thanks,

    James. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi jejacobs23, because DepthofCoverage is getting this issue, the next place to look is your VCF. Are there any really long lines in the file? DepthofCoverage reads one line at a time, I am wondering if there is a very long individual record. This could be a very long annotation, possibly a PL annotation.

    0
    Comment actions Permalink
  • Avatar
    jejacobs23

    I'm not sure I understand what you mean.  I'm running GATK DepthOfCoverage on my input .bam file, not the .vcf file. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    jejacobs23 sorry, my mistake! I meant the IntervalListTools. Your input for that is a VCF with your intervals, correct? And with this issue:

    java.lang.OutOfMemoryError: Java heap space

    It indicates you are running out of space. Could you check to see if there are any giant lines in that VCF intervals file?

    Any updates from DepthofCoverage?

    0
    Comment actions Permalink
  • Avatar
    jejacobs23

    Thanks for your patience Genevieve.  As for your questions, the DepthOfCoverage never did finish and the .vcf file did not have any unusually long lines in it.  I went ahead and tried a few different things:

    1) I used the Picard tool, VcfToIntervalList in order to convert my .vcf file into a Picard-style interval_list.  This worked, but when I tried to use the interval_list for the -L option in CollectAllelicCounts, it again errored out with the java heap space issue.

    2) I wrote a short python script that filters the .vcf.gz file to only include variants with AF > 0.1.  This cutoff was somewhat arbitrary and I'm not sure if it's the best choice.  I gzipped and resulting .vcf file and created an index.  I then used this .vcf.gz file for the -L option in CollectAllelicCounts and IT WORKED!!!

    Does this seem like a reasonable solution?  Do you think AF of 10% is a good cutoff to use?

    Thanks,

    James. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi jejacobs23, I am not sure about that specific cutoff. I noticed in the tutorial you are following that the -L option is an interval list for only one chromosome. You may be able to break up your SNPs_only resource to get around this issue.

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Hi jejacobs23,

    That cutoff should be fine. If you search around a bit, you'll find that we've variously used cutoffs of ~10% and ~2% in the common SNP lists we distribute in the resource package. You can use a different cutoff, if you like, if a different tradeoff between file size and final number of hets found is acceptable to you.

    0
    Comment actions Permalink
  • Avatar
    rotten_tomato

    Hi  Genevieve,

    sorry to dig into a 2 years old thread. but I also have this heap memory GC overhead errors while running GATK. I did saw a very long line in vcf file as the final outout but only partially finished because of the heap memory error. i am wondering how to resolve this issue? you mentioned this could be due to PL annotation, which exactly you meant?

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    @rotten_tomato, it seems like you might be running into a memory issue that is unrelated to the CollectAllelicCounts tool with which this thread is concerned. If so, could you start a new thread and provide more details about your issue?

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk