Convert combineGVCF result to GenomicsDB [Repost]
AnsweredThe link to the previous post is here:
[gatk 4.1.4.1]
Sorry that I didn't follow the previous post, which is now closed.
Tiffany Miller suggested to use GenomicsDB to merge new single-sampled gvcf with multi-sample gvcf produced from CombineGVCF. I tried with GenomicsDBImport, but it raised an error, saying it only allows single-sample gvcf as its input.
I then tried to use SelectVariant to subset the multi-sample gvcf to single-sample gvcf, then feed the resulting single-sample gvcf as inputs to GenomicsDB. This approach turns out to be issue-prone, as many INFO fields such as MQ will be different after running GenotypeGVCF (as compared to single-sample gvcf->GenomicsDBImport directly), and subsequent VQSR will possibly complain for a lack of variance in MQ from the vcf file produced using this approach (CombineGVCF->SelectVariant->GenomicsDBImport).
Is there a way to convert multi-sample gvcf produced from CombineGVCF directly to GenomicsDB?
Thank you!
[old post]
====
Hello,
I've posted this question in the old forum with no response, so I'm posting here again.
I have legacy multi-sample VCF produced from CombineGVCF from long time ago, and the single-sample gVCF files for each individual are no longer there. I'm wondering if there is a way to convert the CombinedGVCF files to GenomicdsDB format?
Thank you.
Jing
====
-
Hi Jing Yu
I am not sure if it is a good idea to convert the CombinedGVCF files to GenomicdsDB format. As mentioned in this doc, there is a way to incrementally add new GVCFs to an existing GenomicsDB database, and we do not make recommendations to add to a multisample gvcf created by CombinedGVCF to GenomicdsDB database. I will however, confirm with my team and get back to you.
-
Thank you Bhanu.
The reason why I wanted to do such conversions is a bit exceptional. I have some multi-sample GVCFs from many years ago, with their bams hard-to-trace and their single-sample GVCFs nowhere to find.
By the way, is there a way to merge two GenomicsDBs?
-
Hello,
Any update on this?
-
Hi Jing Yu
I confirmed with the team and there isn't a good way to add a new gvcf to a multi sample gvcf(created by CombineGVCFs), using GenomicsDBImport. Also merging two GenomicsDBs is not possible either.
You can however, incrementally add new GVCFs to an existing GenomicsDB database as described here: https://gatk.broadinstitute.org/hc/en-us/articles/360035891051-GenomicsDB
-
Thank you for your reply. What I'm trying to ask is slightly different:
Is there a way to convert a multi-sample gvcf (created by CombineGVCFs) to a GenomicsDB?
-
Hi Jing Yu
There isn't a way to convert a multi-sample gvcf (created by CombineGVCFs) to a GenomicsDB.
-
Hi Bhanu Gandham,
I am facing a similar problem now since GenotypeGVCFs only permit a single input.
I have several multi-sample gvcfs, and each gvcf file contains 250 samples. Since single-sample gvcf files were created a long time ago and it is really difficult to trace back to find them.
I tried used CombineGVCFs, but since the total sample size is over than 2000, and all of them are WGS data, CombineGVCFs consumed me one week and feedbacked me a truncated file which is inaccessible.
Meanwhile, I cannot use GenomicsDB since only single-sample gvcf is allowed as input, but my gvcf file contains 250 samples, which cannot use GenomicsDB.
I am looking forward to any suggestions for my situation.
YUE
-
HI JI YUE
You can use GenomicsDB to combine your several multi-sample gvcfs and then feed that to GenotypeGVCFs .
Take a look at this doc: https://gatk.broadinstitute.org/hc/en-us/articles/360035889971--How-to-Consolidate-GVCFs-for-joint-calling-with-GenotypeGVCFs
-
Hi Bhanu Gandham,
I don't think GenomicsDBImport can take multi-sample GVCFs. The article you cited used GenomicsDBImport to import single-sample GVCFs.
-
-
- You could convert the multi sample gVCFs into single sample gVCFs using SelectVariants, as long as you know (or can get) all the sample names.
- You could also try running combinegvcf hierarchically, so split the 2000 samples into 20 groups of 100, and run on each of this 100 so all samples are in 20 gVCF, then combinegvcfs on those 20 to get a single combined gVCF.
The problem with the CombineGVCFs consuming one week and returning a truncated file is weird. Can you please confirm is this issue is reproducible? Also please provide the memory specs of the machine you are using.
-
Hi Bhanu Gandham,
I've tried the first approach and it was error-prone. My frist post illustrated this. Basically, some of the INFO field in the subsetted single-sample gVCF will inherit values from the multi-sample gVCF, such as MQ. This will possibly raise a 'lack of variance' error in the downstream VQSR step, since all single-sample gVCFs subsetted from a multi-sample gVCF will have exactly the same MQ value per site.
I'm wondering if variance is recorded somewhere in the mult-sample gVCF file?
-
Hi Jing Yu
You are right, the first approach is error prone. We gave this a lot of thought and unfortunately I don't think it is possible to resolve this issue with GenomicsDBImport. We are also learning the limitations of GenomicsDBImport as we go and it seems to us that we need to enable GenomicsDBImport to accept mutli-sample gvcfs. We have opened a ticket for it: https://github.com/broadinstitute/gatk/issues/6530
However, that might take a few months to accomplish. In the meantime, our best bet is to use CombineGVCFs to combine your multisample gvcfs. Yes, it will take a long time with CombineGVCFs since you have ~2000 samples, however this still is your best option. So lets try to resolve the issue you are facing with CombineGVCFs.
You mentioned CombineGVCFs generates a truncated file, we think this might be due to the tool running out of memory. Can you please provide the log output when you run CombineGVCFs. It will help us troubleshoot.
-
Hi Bhanu Gandham,
Thank you for your anwer and also the ticket. Looking forward to seeing this getting addressed.
Please sign in to leave a comment.
14 comments