Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCF performance on large datasets

Answered
0

8 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Rohan Abraham,

    Have you seen this article on GenomicsDBImport usage and performance guidelines? Even though you are running GenotypeGVCFs, since you are using a GenomicsDB, many of the arguments and recommendations will still apply. 

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Rohan Abraham

    Hi Genevieve,

    Thanks for the reply! I had read through that page, though I'm sure there are things I've missed. With the standard set intervals set, would it likely improve performance then at both stages to run using the --merge-input-intervals flag? That's the one that jumps out at me the most that I don't have active in any case.

    I also noticed --merge-contigs-into-num-partitions, though I hadn't thought there were a large number of contigs in the interval_lists I'm using that would make it necessary/useful.

    I also haven't been importing in batches because I have enough memory to just open everything together (so far), so I haven't been making use of --batch-size or --consolidate as of yet, though that may be something that's needed later.

    Best wishes,

    Rohan

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Rohan Abraham,

    I don't think you need to merge your input intervals, since you are using the standard set. I'm wondering if you could get around the GenotypeGVCFs stalling problem by having a smaller batch size. I'll follow up with my colleagues to take a closer look and get back to you once I have more information.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Rohan Abraham

    Hi Genevieve,

    Sounds great, thanks! Follow-up question also relating to potential scaling/speedup. If we want to use more intervals, are there larger lists provided anywhere? Or do we need to generate them ourselves.

    I assume for instance that I could chop up the standard set if I wished into many more than the 50 files provided by just splitting the intervals among more interval_list files. But if there's a better way to handle that let me know.

    Best wishes,

    Rohan

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    You can use the GATK Tool SplitIntervals to break up your interval files further. I'm not specifically aware of other interval lists available, but you can check out all our Resource Bundle. I'm still waiting on more information from our GenomicsDB experts about the stalling, so I'll get back to you next week with that.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Rohan Abraham,

    I got more information regarding optimizing your case. The batch size and consolidation parameters would not be useful in speeding up GenotypeGVCFs, those are only parameters when doing the import with GenomicsDBImport. 

    When GenotypeGVCFs is stalling, do you just know this because the progress meter is stalled or did you notice other factors? It would be strange for GenotypeGVCFs to truly be hung then continue running. It is most likely just very slow at a certain spot. If you run jstack with the process ID while GenotypeGVCFs is stalled, you can see what process is running so we know why it is stalled.

    Otherwise, you'll also want to note that the speed of the temp storage IO is especially important for GenotypeGVCFs and having fast temp storage can help with performance issues.

    Let me know if you find more about what is going on when the process is stalled and I'll continue to look into it.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Rohan Abraham

    Hi Genevieve,

    Thanks for the follow-up. We noticed just because of the progress-meter stalling, yes. We have since updated some settings with the caching/temp storage on our servers so hopefully that has an impact, though there weren't any obvious IO issues that I recall.

    I will monitor with jstack as you suggest if it comes up again and post back here. Thanks for your help!

    Best wishes,

    Rohan  

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    I see! If these sites have many genotypes or alternate alleles and it is slowing down your processes too much, you can decrease --max-genotype-count or --max-alternate-alleles. These arguments will change your results so they are more advanced.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk