Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GermlineCNVCaller - requirements

0

5 comments

  • Avatar
    Bhanu Gandham

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

     

    0
    Comment actions Permalink
  • Avatar
    Douglas Craig

    With regard to your first question I can give you some feedback.

    Currently running GermlineCNVCaller (4.1.7.0) in cohort mode to build a model for a case.

    WES for 100 samples. 192Mbp across 212K intervals broken into 37 shards.

    So far the problem doesn't seem to be memory. I'm running one shard at a time on a Google Cloud VM with 32 CPUs and 128GB of memory and each is reporting Runtime.totalMemory around 3.3GB, but the 32 CPUs are 100% and it's taking an incredible 7 hours per shard !

    I interpreted the documentation as implying something closer to 30 min / shard on a laptop for a comparable size analysis. It appears that the algorithms are taking forever to converge (> 40 epochs). Not sure if this is data dependent, but I've followed the tutorial closely with default settings.

    Very curious what others are seeing for performance. 

     

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Thank you for jumping in and sharing your experience Douglas Craig!

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi 

    I saw this thread quite late but I need to add my own observations. 

    30 30X Whole genome samples on hg38 from NIST using only main contigs with non overlapping 1000base intervals excluding masked regions (about 2.8 million intervals) with 4 explicit ploidy levels and 128 threads dedicated to docker instance is using only about 200GB of RAM on a single 1TB node. 

    Quite amazing. I can compile a PoN of 100 WGS 30X samples without a hitch using this node only. 

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you for your insight SkyWarrior!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk