Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs stalls while using --all-sites

0

18 comments

  • Avatar
    Genevieve Brandt

    Rachael Bay have you tried increasing your memory allocation to be larger (--java-options "-Xmx8G") ? If there is an issue with memory allocation another trick is to specify a temporary directory with the option --tmp-dir

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Genevieve Brandt - thanks for the response! We've tried with up to 96G for java and both with and without the --tmp-dir option. Any other ideas?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Rachael Bay since it stalls at the same location each time, are there many alleles at that location? It might be useful to check out DepthofCoverage or another tool and look at that area in the genome. 

    Another thing you can try is to update your GATK version since we continually improve the tools.

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Genevieve Brandt, I don't think it's the location, since we stall at the exact same position (200000) when we try a different chromosome. I've also tried updating my GATK version and same issue. We're really stuck here, any help would be great!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Rachael Bay have you had any successful runs using --all-sites with any of your chromosomes? Or do they all fail at position 200000?

    Does the job fail, or just stall? And for how long?

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Genevieve Brandt I have only tried the first 2 chromosomes, but they both stalled at position 200000. The jobs do not actually fail, they just stall. They reach 200000 quickly (an hour or so) and I let them run for 24 hours.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Rachael Bay thank you for this info. I am continuing to look into solutions and will let you know when I get more information.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi Rachael Bay

    Sometimes early positions in the chromosome do not have data and we are wondering if this is an issue with writing an enormous amount of information once it starts outputting data.

    • Could you try writing to a .gz VCF file? It should have better performance.
    • How big is your genomicsDB, how many samples are you working with?
    • Could you give more information about your machine that you are running this on? How much memory is available to the machine and is it the exact amount that you are giving for the java memory? The machine will need slightly more memory than you give in the java options to account for the C++ memory.
    • Could you start in the middle of the chromosome and see if it runs? Perhaps from position 500,000 to 1,000,000
    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Genevieve Brandt,

    *I tried writing to a .gz file - same issue

    *The genomicsDB directory is large - looks like 793G. We have 471 individuals.

    *I am running on the Bridges cluster through the NSF XSEDE program. I have tried both regular memory and large memory. On large memory I requested 128G from the machine while asking for 96 in the java options, so there should be extra memory there.

    *Does the genomicsDB have to be remade to start in the middle of the chromosome? I tried specifying the interval you suggested above with Haplotype caller, but it JUST stalled (never got the progress meter showing position)

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi Rachael Bay, thank you for those updates. I have some follow up questions to try to get to the bottom of this. 

    • We want to determine if this is an issue with the genomicsDB or GenotypeGVCFs with --all-sites. Could you run SelectVariants at position 200k-201k and see if it gets any results?
    • What species are you working with? Is there anything weird happening at the reference at position 200k?
    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Genevieve Brandt. When I look at that region, it actually looks like the last position is 1999988. This is exactly the same across multiple chromosomes (I have tried three of six). I am working with Zostera marina, but I don't really see why there would be something weird happening at that position in multiple chromosomes.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Rachael Bay 

    Did you run SelectVariants yet?

    I am confused about what you mean by the last position is 1999988, since we are looking into position 200K? Am I misunderstanding, or did you possibly make a typo?

     

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Genevieve Brandt Sorry, it was a typo sort of. Just looking right now. The last position recorded for Chr01 is 199988 and for Chr03 is 199998, so not EXACTLY the same position. Let me know if it would be at all useful for me to try this on the other chromosomes. I'm really at a loss for what else to try!

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Also, I guess I'm confused about how running SelectVariants would see anything different since it would just return a subset of the raw vcf file, right?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Rachael Bay thanks for clarifying! 

    Running SelectVariants, though not helpful for your final result, helps us in our troubleshooting of this issue. We are trying to determine if this runtime issue is from the GenomicsDB or if it is from GenotypeGVCFs. 

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Genevieve Brandt thanks, that make sense. When I run SelectVariants for Chr01:200000-201000 it stalls with no variants output.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Rachael Bay thanks for the update, I'll bring this up with my team and let you know when I have more information

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi Rachael Bay, here are a few more things that you could check that would give us more information:

    1. While you run GenotypeGVCFs and it is stalled, could you inspect the java process and what is going on? The command is jstack with the process ID: (https://docs.oracle.com/en/java/javase/13/docs/specs/man/jstack.html). Look at the jstack a few different times to try to figure out what is going on while it is stalled.
    2. Could you give more information about how you ran the import to your GenomicsDB? Was it in batches?
    3. Could you create a new GenomicsDB workspace with all the individuals on a small interval, for example 195,000-200,000 and see if the same issue persists?

    And something to try for a possible improvement:

    1. What version of GATK did you use to create the GenomicsDB? We have been improving GenomicsDB quite a bit and our most up to date version is 4.1.9.0. If you re-run the import you may see better performance and it may help the GenotypeGVCFs process. [You will need to use the option --genomicsdb-shared-posixfs-optimizations for your cluster]
    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk