Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Genomics DB Datastore

0

4 comments

  • Avatar
    David Roazen

    Hi Arman Seuylemezian,

    Yes, the callset.json file contains the list of samples currently in the datastore.

    When incrementally adding more samples to an existing GenomicsDB, it's important to keep the intervals the same, otherwise you'll run into problems. Can you please confirm that the intervals for the new samples match the intervals for the old samples? 

    Regards,

    David

    0
    Comment actions Permalink
  • Avatar
    Arman Seuylemezian

    Got it, I was able to confirm that the callset.json file in my instance does accurately contain all the samples that are in the datastore however there is a discrepancy with the vcfheader.vcf file as that is not being updated with the new samples. 

     

    I can also confirm that the exact same intervals were used when adding new samples in fact the same exact bed file was supplied.  

    0
    Comment actions Permalink
  • Avatar
    David Roazen

    Hi Arman Seuylemezian,

    The vcfheader.vcf file is just used internally for bookkeeping purposes by GenomicsDB to store VCF header metadata, and does not contain any sample information.

    When you use a tool like GATK's SelectVariants to extract sites/samples from the GenomicsDB to a VCF, you should see all of your samples appear in the final VCF. Is this not the case?

    Regards,

    David

    0
    Comment actions Permalink
  • Avatar
    Arman Seuylemezian

    Got it, yes when I select variants to extract sites/samples the resulting VCF does contain all of my samples but for pipelining purposes I would like to extract the list of samples that are currently in my datastore to be able to know what new samples to add to the datastore and it seems like for this purpose working with the callset.json file is going to be the best path forward as I don't necessarily want to go through the process of extracting sites/samples to a vcf only to figure out what samples are currently in the datastore unless it is absolutely necessary. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk