Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

combineGVCF errors and questions: inconsistent references & combineGVCF twice

0

16 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Mingzhou,

    Thanks for writing in with all this information. This is very helpful in figuring out the best path forward. I think for your case I would recommend running CombineGVCFs for all samples, but separating the runs by chromosome or intervals. You can then merge all the GVCFs for each interval with MergeVCFs

    The sites that you found that were unique in the GVCFs that only ran once through CombineGVCFs (*,<NON_REF> or <NON_REF>) are reference block sites that are helpful when running GenotypeGVCFs and doing joint calling. The method I outlined above is the method most similar to our current best practices. Our production pipeline uses MergeVCFs after running GenomicsDBImport on separate intervals.

    Although I am not totally clear why CombineGVCFs gets different results when being run twice, I don't think you should run it that way because it is pretty different from our recommendations and those sites you found could definitely make a difference.

    Hope this helps with your analysis!

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve,

    Thank you very much for your quick reply! Yes, I definitely wanted to follow the method you suggested (which I did so for some chr), but this method was actually NOT feasible with limited job & memory per users on our HPC. (I did argue with our HPC team but they refused to lift the bar ...)

    Since I should not run CombineGVCFs twice, I'm wondering if it is okay to switch back to gatk3 and do genotypeGVCF with the three batch.g.vcf files? Not sure if it's valid to feed the data generated by gatk4 as input in gatk3.

    Thank you!

    Mingzhou

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Mingzhou,

    I see, I generally wouldn't recommend using data from GATK4 with GATK3. I'll reach out to people on my team who might know more of this, but it might take some time to get an answer since this is outside our best practices.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    That would be great Genevienve! Thank you so much for the help so far!

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve,

    Since the time & memory limiting step for combineGVCF/importgenomicDB for my ~3000 g.vcf is that the ~2000 WGS g.vcf files are too large to open/read in, I'm wondering if I can use selectVariants to trim/filter the WGS g.vcf files to the target regions used by my WES data? I mean, in general can I somehow pre-filter the g.vcf files of my WGS data and only keep the sites within specified target regions?

    Thank you!

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yes, you could do that. You can run SelectVariants with the intervals option to only get the sites of interest in your WES intervals.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Mingzhou Fu,

    I got some more feedback from our dev team to share:

    • We would not recommend going back to GATK3 for GenotypeGVCFs, the annotations will not be compatible and your results will not be accurate.
    • A better solution would probably running WGS and WES separately at this point since you have not found a solution.
    • When you are running your jobs with -Xmx120G, how much physical memory is available for the job? You will want quite a bit available since 120G is so much.
    • You might want to think about taking this research to the cloud because 3000 samples is so many. We also offer Terra, which is a great service and has excellent support! It wouldn't necessarily be faster but you would have more flexibility with the jobs.

    Let me know if I can be of any more help.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve,

    Thank you so much for the reply! 

    If I run WGS and WES separately, how can I put them all together for genotypeGVCF? I want the WGS data from 1KG to joint call together with my sample to improve variant detection. 

    I put 150G physical memory to -Xmx120G. Found from forum that I need to use 80%-90% of to the java script. 

    Yes, I did talk about it with my PI, since I not the only one got stuck in the lab...

    Thank you!

    Mingzhou

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    I got this WARN when using combineGVCF. I guess this was probably due to my WES were captured with different capture kit (and thus target regions) and I used the union of the target regions (as regions of interest) in -L option for combineGVCF. Just wonder if this WARN would affect any later analysis/result if I do only care about the variant within my regions of interest? 

    WARN  CombineGVCFs - You have asked for an interval that cuts in the middle of one or more gVCF blocks. Please note that this will cause you to lose records that don't end within your interval.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Mingzhou Fu We don't have a way for you to combine the samples as you are describing, you would have to run GenotypeGVCFs separately.

    You could lose records that span the end of the interval. If these are long records and you lose the entire record, you could be potentially losing data that could be helpful. 

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Thank you Genevieve! Regarding the lost records, is there a quick way to figure out which records were lost? Should I compare the sites/position in gVCF vs the sites/position in combined gVCF?

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yes you could investigate that way. Start positions missing from your combined GVCF will have been lost.

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Gotcha! Thank you Genevieve!  Last question regarding your previous suggestions - " I think for your case I would recommend running CombineGVCFs for all samples, but separating the runs by chromosome or intervals. You can then merge all the GVCFs for each interval with MergeVCFs" - why do we use MergeVCFs instead of gatherVCF? Both documents say they put together vcf files of same samples. Just wonder what's the difference between the two?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    They are very similar. I am not totally sure of the differences but I believe GatherVCFs was written specifically for our WDL production pipelines. Probably for your case I would stick with MergeVCFs.

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    I see. Thank you very much for the explanation :)

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    You're welcome!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk