Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Merge different individual VCF

1

13 comments

  • Avatar
    ABours

    Hi Linda Do,

    First of all, there is no CombineVariants. I think you were going for CombineGVCFs. This program is used to combine GVCFs from different samples.

    GATK offers two Picard tools for what you want and they have different prerequisites:

    MergeVCFs merges complete (whole genome) VCF's from different individuals into one. <- I think this is the one you want.

    GatherVCFs merges vcfs that contain all individuals but each contain a different part of the genome. (for example 1/3, 2/3 and 3/3 parts on genome into whole genome).

    Best,

     

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    HI ABours

     

    Thank you so much for your input!

    0
    Comment actions Permalink
  • Avatar
    Artamir

    I face similar problems. I want to merge (or combine) two vcf files, containing several different samples each. From the tool index I thought it must be GatherVcfs or MergeVcfs. But both say that the sample list have to be identical:

    • MergeVcfs (Picard):
      "If there are samples, those must be the same across all input files."
    • GatherVcfs (Picard):
      Simple little class that combines multiple VCFs that have exactly the same set of samples and totally discrete sets of loci.

    Also I got error messages ("...has sample entries that don't match the other files...") about the sample list with both of the tools when I gave it a try.

    So it seems neither MergeVcfs nor GatherVcfs are suitable. Is there any tool to merge vcf files with disjoint sample lists? And also both files have been annotated via SnpEff.

    I'm using GATK 4.1.7.

    Thanks for reading!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    You are looking for either GenomicsDBImport or CombineGVCFs. They both do what you are looking for, but GenomicsDBImport is a newer tool and more optimized. CombineGVCFs can be slow. However, to view the combined file GenomicsDBImport you will need to use SelectVariants

    0
    Comment actions Permalink
  • Avatar
    Artamir

    Thanks for the reply Genevieve!

    As far as I'm understanding the documentation both of these tools (GenomicsDBImport & CombineGVCFs) are supposed to use _G_VCF files as produced by HaplotypeCaller. Actually I've been using CombineGVCFs for this purpose.

    Right now I need a tool to combine VCF files, as I don't have _G_VCF files. I got an already evaluated VCF dataset (all genotyping and filtering of false positives already done) from another group and want to add my own VCF data to this to get a larger data basis.

    It might even be the case that there even isn't an application for this task, but I'm not sure. So any confirmation in one direction or the other would be very helpful. Again to be precise on my side: I need a tool to merge two disjoint VCF datasets (totally different individuals), that are already genotyped and filtered for false positives. This is not about _G_VCF files.

    Thanks for your support!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Have you gotten any errors with these tools? They should accept VCF files.

    0
    Comment actions Permalink
  • Avatar
    Artamir

    Thanks for your reply Genevieve!

    I'm confused. The documentation for CombineGVCFs states it must be a file produced by HaplotypeCaller:

    Overview
    Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file.
    ...
    Input
    Two or more HaplotypeCaller GVCFs to combine.

    I also got an error message, confiming, that a file produced by HaplotypeCaller is needed:

    A USER ERROR has occurred: The list of input alleles must contain <NON_REF> as an allele but that is not the case at position 1364; please use the Haplotype Caller with gVCF output to generate appropriate records

     

    Likewise GenomicsDBImport explicitly says it has to be a single sample GVCF file produced by HaplotypeCaller:

    Overview
    Import single-sample GVCFs into GenomicsDB before joint genotyping.
    ...
    Input
    One or more GVCFs produced by in HaplotypeCaller with the `-ERC GVCF` or `-ERC BP_RESOLUTION` settings, containing the samples to joint-genotype.

    The resulting error message of GenomicsDBImport, which I got, confirmed, that it has to be a single sample GVCF file:

    A USER ERROR has occurred: Input GVCF: path/to/vcf_file.vcf.gz was expected to contain a single sample but actually contained 81 samples.

    Seems like there is no chance to use either of those tools.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Artamir, I apologize, I was wrong in my earlier comment. I confirmed with my team that these tools do not accept VCF files. In the future we may support it through GenomicsDBImport, however.

    The options available that we could think of would be bcftools merge or CombineVariants. CombineVariants is a tool in GATK3 which we no longer support and bcftools merge is not in GATK. You may be able to find information about CombineVariants at our legacy forum page: https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/

    1
    Comment actions Permalink
  • Avatar
    Artamir

    Hi Genevieve,

    thank you so much for your reply! Knowing, that GATK 4 won't merge those files already helps a lot. And even more helpful is your hint about bcftools merge, which I just tried successfully!

    Finally only the partial SnpEff annotation needs to be removed to perform a clean reannotation. Python or awk will easily remove the old annotation.

    Again thank you so much!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Artamir thank you for the update and posting your solution for other users! Glad you found something that worked.

    0
    Comment actions Permalink
  • Avatar
    ethansmith

    If you have to merge VCF files, try Softaken VCF Merge Software. It Merges multiple VCF files in one VCF file without any data loss. This software support all Microsoft Windows Operating System. Download the free demo version software.

    Visit at  : https://www.osttopstapp.com/merge-vcf.html

    0
    Comment actions Permalink
  • Avatar
    Rashi verma

    Hello Artamir

    I am facing the same problem as you faced before 3 years ago.

    I tried to merge my query.vcf to reference.vcf using the bcftools-merge command, but I was unable to obtain the desired results (Screenshot_attached). I am new to this field. Could you please provide some troubleshooting advice or share the command or script that you used for merging 2 vcf file having the disjoint list of samples?

     I would greatly appreciate it if you could review and provide your insights on how to resolve. Any suggestions or guidance you can offer to help me successfully merge the data while maintaining the sample integrity would be invaluable.

    Thanks and Regards

    Rashi

    0
    Comment actions Permalink
  • Avatar
    David Roazen

    Hi Rashi verma,

    Is there a variant record for this position (1:793643) present in both VCFs? If it's only present in one of the two VCFs, I would expect the resulting merged variant record to look like the one you posted, with no-calls for the samples from one of the VCFs.

    Regards,
    David

    0
    Comment actions Permalink

Post is closed for comments.

Powered by Zendesk