Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDB Follow

5 comments

  • Avatar
    Daniel J McGoldrick

    Hi I was running a sample of over 1000 vcf files and the second joint calling step (as one call) was taking days - even weeks and doing that on a cloud is not cost effective. FAIL

    It is mentioned

    "You can use any of the usual SelectVariants modifiers to extract e.g. only a subset of samples, a subset of genomic intervals, and so on."

    but it would be nice if you expand on larger use-cases - eg larger than one trio - much larger cohorts?  Do we address highly mixed or structured diverged populations? Finnish and Ashkenazi Jewish + AMR??? Can we skip joint genotyping?

    Locally on the HPC we split this into batches of 10 and by chromosome and it completed in a few hours - vs days/weeks.  To make this useful  larger datasets and how to effectively parallelize the joint calling of samples in step 2 needs a lot more content. Maybe provide more documentation on the use of the GenomicsDB for thousands or 10's of thousands of samples?  As is, this leads the user to believe that they can run one call on the GenomicsDB and I just don't think that is feasible? Thank you for helping us get to a data format other than VCF! I would also like to know more about GATK using HAIL and alternatives for using that format like apache spark...Why did you choose this implementation over another?

    0
    Comment actions Permalink
  • Avatar
    Shahryar Alavi

    The increment argument efficiently speeds up DB renewal:

    gatk GenomicsDBImport --genomicsdb-update-workspace-path path/to/DB --sample-name-map new_samples.map --batch-size 45

    But, is it possible to add a similar argument to joint genotyping? e.g.:

    gatk GenotypeGVCFs --vcf-update path/to/vcf -V gendb://path/to/DB -R reference/hg38.fasta

    As the joint genotyping is the bottleneck on cohort scaling.

    0
    Comment actions Permalink
  • Avatar
    Kshama Aswath

    Hi all,

    I made GenomicsDB sometime ago with specific intervals (using bed files for the regions I need from Agilent exome padded regions of interest) and now we decided to explore another region that is not one of the intervals I initially fed into GenomicsDB. Based on what I have read, looks like we can incrementally add samples to the genomicsDB but cannot incrementally add genomic intervals to GenomicsDB.

    So I think I will have to make another GenomicsDB with that new interval using the same sample list as before. My question is , to know if this is  the only solution,to remake genomicsDB with new interval or is there a shortcut to add the new interval to the exiting GenomicsDB ?

    Thankyou!

    0
    Comment actions Permalink
  • Avatar
    Jonathan Klonowski

    It is not clear by this article what the file extension is for the GenomicsDB object, where it is shored, how to find it, or how to identify it.

    0
    Comment actions Permalink
  • Avatar
    Yun Gyeong Lee

    Hi I am going to make GenomicsDB with 25K samples and do  joint calling. Is it possible to do that with that much samples? if so, what is the most efficient way? 

    Thank you.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk