Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Convert combineGVCF result to GenomicsDB [Repost]

Answered
1

14 comments

  • Avatar
    Bhanu Gandham

    Hi Jing Yu

     

    I am not sure if it is a good idea to convert the CombinedGVCF files to GenomicdsDB format. As mentioned in this doc, there is a way to incrementally add new GVCFs to an existing GenomicsDB database, and we do not make recommendations to add to a multisample gvcf created by CombinedGVCF to GenomicdsDB database. I will however, confirm with my team and get back to you.

    0
    Comment actions Permalink
  • Avatar
    Jing Yu

    Thank you Bhanu.

    The reason why I wanted to do such conversions is a bit exceptional. I have some multi-sample GVCFs from many years ago, with their bams hard-to-trace and their single-sample GVCFs nowhere to find.

    By the way, is there a way to merge two GenomicsDBs?

     

    0
    Comment actions Permalink
  • Avatar
    Jing Yu

    Hello,

    Any update on this?

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Jing Yu

    I confirmed with the team and there isn't a good way to add a new gvcf to a multi sample gvcf(created by CombineGVCFs), using GenomicsDBImport. Also merging two GenomicsDBs is not possible either.

     

    You can however, incrementally add new GVCFs to an existing GenomicsDB database as described here: https://gatk.broadinstitute.org/hc/en-us/articles/360035891051-GenomicsDB

    0
    Comment actions Permalink
  • Avatar
    Jing Yu

    Hi Bhanu Gandham

    Thank you for your reply. What I'm trying to ask is slightly different:

    Is there a way to convert a multi-sample gvcf (created by CombineGVCFs) to a GenomicsDB?

     

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Jing Yu

     

    There isn't a way to convert a multi-sample gvcf (created by CombineGVCFs) to a GenomicsDB.

    0
    Comment actions Permalink
  • Avatar
    JI YUE

    Hi Bhanu Gandham,

    I am facing a similar problem now since GenotypeGVCFs only permit a single input.

    I have several multi-sample gvcfs, and each gvcf file contains 250 samples. Since single-sample gvcf files were created a long time ago and it is really difficult to trace back to find them.

    I tried used CombineGVCFs, but since the total sample size is over than 2000, and all of them are WGS data, CombineGVCFs consumed me one week and feedbacked me a truncated file which is inaccessible.

    Meanwhile, I cannot use GenomicsDB since only single-sample gvcf is allowed as input, but my gvcf file contains 250 samples, which cannot use GenomicsDB.

    I am looking forward to any suggestions for my situation.

    YUE

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    HI JI YUE

     

    You can use GenomicsDB  to combine your several multi-sample gvcfs and then feed that to GenotypeGVCFs .

     

    Take a look at this doc: https://gatk.broadinstitute.org/hc/en-us/articles/360035889971--How-to-Consolidate-GVCFs-for-joint-calling-with-GenotypeGVCFs

    0
    Comment actions Permalink
  • Avatar
    Jing Yu

    Hi Bhanu Gandham,

    I don't think GenomicsDBImport can take multi-sample GVCFs. The article you cited used GenomicsDBImport to import single-sample GVCFs.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Jing Yu and JI YUE

     

    Jing Yu is absolutely right. GenomicsDBImport cannot take multi-sample GVCFs, my mistake.  JI YUE it seems to me that combinegvcf is the only obvious solution here. However, I am investigating other options and will get back to you if I find one.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    JI YUE

     

    1. You could convert the multi sample gVCFs into single sample gVCFs using SelectVariants, as long as you know (or can get) all the sample names.
    2. You could also try running combinegvcf hierarchically, so split the 2000 samples into 20 groups of 100, and run on each of this 100 so all samples are in 20 gVCF, then combinegvcfs on those 20 to get a single combined gVCF.

    The problem with the CombineGVCFs consuming one week and returning a truncated file is weird. Can you please confirm is this issue is reproducible? Also please provide the memory specs of the machine you are using.

    0
    Comment actions Permalink
  • Avatar
    Jing Yu

    Hi Bhanu Gandham,

    I've tried the first approach and it was error-prone. My frist post illustrated this. Basically, some of the INFO field in the subsetted single-sample gVCF will inherit values from the multi-sample gVCF, such as MQ. This will possibly raise a 'lack of variance' error in the downstream VQSR step, since all single-sample gVCFs subsetted from a multi-sample gVCF will have exactly the same MQ value per site.

    I'm wondering if variance is recorded somewhere in the mult-sample gVCF file?

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Jing Yu

     

    You are right, the first approach is error prone. We gave this a lot of thought and unfortunately I don't think it is possible to resolve this issue with GenomicsDBImport. We are also learning the limitations of GenomicsDBImport as we go and it seems to us that we need to enable GenomicsDBImport to accept mutli-sample gvcfs. We have opened a ticket for it: https://github.com/broadinstitute/gatk/issues/6530

    However, that might take a few months to accomplish. In the meantime, our best bet is to use CombineGVCFs to combine your multisample gvcfs. Yes, it will take a long time with CombineGVCFs since you have ~2000 samples, however this still is your best option. So lets try to resolve the issue you are facing with CombineGVCFs.

     

    You mentioned CombineGVCFs generates a truncated file, we think this might be due to the tool running out of memory. Can you please provide the log output when you run CombineGVCFs. It will help us troubleshoot.

    0
    Comment actions Permalink
  • Avatar
    Jing Yu

    Hi Bhanu Gandham,

    Thank you for your anwer and also the ticket. Looking forward to seeing this getting addressed.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk