Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

PathseqPipelineSpark stops without error message

0

6 comments

  • Avatar
    Bhanu Gandham

    Hi vitor heidrich

     

    The most common user error with pathseq is not having read groups in the bam, which causes all reads to get filtered. Can you please check if that is the issue here and/or post the header from one of the bams.

    0
    Comment actions Permalink
  • Avatar
    vitor heidrich

    Hi Bhanu Gandham

    Thank you for your reply

    Apparently this is not the case, because, as I said in the post, I am able to see the filter-metrics output and for all samples there is a non-zero number of reads after filtering stage.

    For example:

    ## METRICS CLASS org.broadinstitute.hellbender.tools.spark.pathseq.loggers.PSFilterMetrics
    PRIMARY_READS 149937214

    READS_AFTER_PREALIGNED_HOST_FILTER 149937214

    READS_AFTER_QUALITY_AND_COMPLEXITY_FILTER 65196700

    READS_AFTER_HOST_FILTER 1672161

    READS_AFTER_DEDUPLICATION 1433205

    FINAL_PAIRED_READS 624720

    FINAL_UNPAIRED_READS 808485

    FINAL_TOTAL_READS 1433205

    LOW_QUALITY_OR_LOW_COMPLEXITY_READS_FILTERED 84740514 

    HOST_READS_FILTERED 63524539

    DUPLICATE_READS_FILTERED 238956

    Thanks again

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Can you please post the header from one of the bams.

     
    0
    Comment actions Permalink
  • Avatar
    Mark Walker

    Hello vitor,

    Thank you for clarifying that. There are a few things I am wondering about:

    1) Double-check the logs for any useful messages. What was the last message generated by GATK? What you are seeing at the end is from Spark and usually does not contain useful information.

    2) Possibly running out of memory. If the JVM tries to allocate too much memory it will be killed by the OS. Is there a return code for the process?

    3) What environment are you running in? Is it a single machine/VM or a Spark cluster? If the latter, the error may be occurring on one of the workers, in which case it would not be produced in the executor log (although I would expect it to exit gracefully if this were the case).

    3) It may be easier to troubleshoot this if you run the pipeline piecemeal with PathSeqFilterSpark, PathSeqBwaSpark, and PathSeqScoreSpark rather than PathseqPipelineSpark.

     

    1
    Comment actions Permalink
  • Avatar
    vitor heidrich

    Hi Mark Walker

    Thank your for the alternatives suggested.

    I just ran a test using the lightest bam file (5gb) of my dataset and it worked perfectly. So probably the process is running out of memory for the other heavier files (10gb+). How the input size influence the memory used by the process? Currently I am setting 'Xmx64G" to run these tests. Any tips on how many more gb should I allocate with 10gb+ files as input?

    0
    Comment actions Permalink
  • Avatar
    Mark Walker

    Memory bottlenecks usually occur during alignment to the microbe reference. At this stage, pathseq requires memory for both the reference image as well as reads. However, the reference is loaded outside of the JVM, so there must be at least as much memory available to the OS as the size of the reference. The JVM must have enough memory to store the non-host reads that remain after filtering. Therefore memory usage does not scale with input BAM size per se, but rather the number of microbial reads present.

    I would recommend doubling the memory to 128GB to accommodate the microbial reference.

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk