Merge different individual VCF
Hello!
I'm new to GATK and I have 3 vcf files from 3 different individuals: mother.vcf, father.vcf, and child.vcf. I want to obtain a single vcf file that would have all the variants of each individual in order to analyze the trio. After looking online, I learned that GATK has CombineVariants and MergeVcfs that are supposed to combine/merge the vcf files. However, I do not understand which tool is better for my task.
GATK version used: gatk-4.1.8.0
Thank you in advance!
-
Hi Linda Do,
First of all, there is no CombineVariants. I think you were going for CombineGVCFs. This program is used to combine GVCFs from different samples.
GATK offers two Picard tools for what you want and they have different prerequisites:
MergeVCFs merges complete (whole genome) VCF's from different individuals into one. <- I think this is the one you want.
GatherVCFs merges vcfs that contain all individuals but each contain a different part of the genome. (for example 1/3, 2/3 and 3/3 parts on genome into whole genome).
Best,
-
HI ABours
Thank you so much for your input!
-
I face similar problems. I want to merge (or combine) two vcf files, containing several different samples each. From the tool index I thought it must be GatherVcfs or MergeVcfs. But both say that the sample list have to be identical:
- MergeVcfs (Picard):
"If there are samples, those must be the same across all input files." - GatherVcfs (Picard):
Simple little class that combines multiple VCFs that have exactly the same set of samples and totally discrete sets of loci.
Also I got error messages ("...has sample entries that don't match the other files...") about the sample list with both of the tools when I gave it a try.
So it seems neither MergeVcfs nor GatherVcfs are suitable. Is there any tool to merge vcf files with disjoint sample lists? And also both files have been annotated via SnpEff.
I'm using GATK 4.1.7.
Thanks for reading! - MergeVcfs (Picard):
-
You are looking for either GenomicsDBImport or CombineGVCFs. They both do what you are looking for, but GenomicsDBImport is a newer tool and more optimized. CombineGVCFs can be slow. However, to view the combined file GenomicsDBImport you will need to use SelectVariants.
-
Thanks for the reply Genevieve!
As far as I'm understanding the documentation both of these tools (GenomicsDBImport & CombineGVCFs) are supposed to use _G_VCF files as produced by HaplotypeCaller. Actually I've been using CombineGVCFs for this purpose.
Right now I need a tool to combine VCF files, as I don't have _G_VCF files. I got an already evaluated VCF dataset (all genotyping and filtering of false positives already done) from another group and want to add my own VCF data to this to get a larger data basis.
It might even be the case that there even isn't an application for this task, but I'm not sure. So any confirmation in one direction or the other would be very helpful. Again to be precise on my side: I need a tool to merge two disjoint VCF datasets (totally different individuals), that are already genotyped and filtered for false positives. This is not about _G_VCF files.
Thanks for your support!
-
Have you gotten any errors with these tools? They should accept VCF files.
-
Thanks for your reply Genevieve!
I'm confused. The documentation for CombineGVCFs states it must be a file produced by HaplotypeCaller:
Overview
Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file.
...
Input
Two or more HaplotypeCaller GVCFs to combine.I also got an error message, confiming, that a file produced by HaplotypeCaller is needed:
A USER ERROR has occurred: The list of input alleles must contain <NON_REF> as an allele but that is not the case at position 1364; please use the Haplotype Caller with gVCF output to generate appropriate records
Likewise GenomicsDBImport explicitly says it has to be a single sample GVCF file produced by HaplotypeCaller:
Overview
Import single-sample GVCFs into GenomicsDB before joint genotyping.
...
Input
One or more GVCFs produced by in HaplotypeCaller with the `-ERC GVCF` or `-ERC BP_RESOLUTION` settings, containing the samples to joint-genotype.The resulting error message of GenomicsDBImport, which I got, confirmed, that it has to be a single sample GVCF file:
A USER ERROR has occurred: Input GVCF: path/to/vcf_file.vcf.gz was expected to contain a single sample but actually contained 81 samples.
Seems like there is no chance to use either of those tools.
-
Hi Artamir, I apologize, I was wrong in my earlier comment. I confirmed with my team that these tools do not accept VCF files. In the future we may support it through GenomicsDBImport, however.
The options available that we could think of would be bcftools merge or CombineVariants. CombineVariants is a tool in GATK3 which we no longer support and bcftools merge is not in GATK. You may be able to find information about CombineVariants at our legacy forum page: https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/
-
Hi Genevieve,
thank you so much for your reply! Knowing, that GATK 4 won't merge those files already helps a lot. And even more helpful is your hint about bcftools merge, which I just tried successfully!
Finally only the partial SnpEff annotation needs to be removed to perform a clean reannotation. Python or awk will easily remove the old annotation.
Again thank you so much!
-
Hi Artamir thank you for the update and posting your solution for other users! Glad you found something that worked.
-
If you have to merge VCF files, try Softaken VCF Merge Software. It Merges multiple VCF files in one VCF file without any data loss. This software support all Microsoft Windows Operating System. Download the free demo version software.
Visit at : https://www.osttopstapp.com/merge-vcf.html
-
Hello Artamir
I am facing the same problem as you faced before 3 years ago.
I tried to merge my query.vcf to reference.vcf using the bcftools-merge command, but I was unable to obtain the desired results (Screenshot_attached). I am new to this field. Could you please provide some troubleshooting advice or share the command or script that you used for merging 2 vcf file having the disjoint list of samples?
I would greatly appreciate it if you could review and provide your insights on how to resolve. Any suggestions or guidance you can offer to help me successfully merge the data while maintaining the sample integrity would be invaluable.
Thanks and Regards
Rashi
-
Hi Rashi verma,
Is there a variant record for this position (1:793643) present in both VCFs? If it's only present in one of the two VCFs, I would expect the resulting merged variant record to look like the one you posted, with no-calls for the samples from one of the VCFs.
Regards,
David
Post is closed for comments.
13 comments