Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDB Follow

3 comments

  • Avatar
    Daniel J McGoldrick

    Hi I was running a sample of over 1000 vcf files and the second joint calling step (as one call) was taking days - even weeks and doing that on a cloud is not cost effective. FAIL

    It is mentioned

    "You can use any of the usual SelectVariants modifiers to extract e.g. only a subset of samples, a subset of genomic intervals, and so on."

    but it would be nice if you expand on larger use-cases - eg larger than one trio - much larger cohorts?  Do we address highly mixed or structured diverged populations? Finnish and Ashkenazi Jewish + AMR??? Can we skip joint genotyping?

    Locally on the HPC we split this into batches of 10 and by chromosome and it completed in a few hours - vs days/weeks.  To make this useful  larger datasets and how to effectively parallelize the joint calling of samples in step 2 needs a lot more content. Maybe provide more documentation on the use of the GenomicsDB for thousands or 10's of thousands of samples?  As is, this leads the user to believe that they can run one call on the GenomicsDB and I just don't think that is feasible? Thank you for helping us get to a data format other than VCF! I would also like to know more about GATK using HAIL and alternatives for using that format like apache spark...Why did you choose this implementation over another?

    0
    Comment actions Permalink
  • Avatar
    Shahryar Alavi

    The increment argument efficiently speeds up DB renewal:

    gatk GenomicsDBImport --genomicsdb-update-workspace-path path/to/DB --sample-name-map new_samples.map --batch-size 45

    But, is it possible to add a similar argument to joint genotyping? e.g.:

    gatk GenotypeGVCFs --vcf-update path/to/vcf -V gendb://path/to/DB -R reference/hg38.fasta

    As the joint genotyping is the bottleneck on cohort scaling.

    0
    Comment actions Permalink
  • Avatar
    Kshama Aswath

     am working with GATK best practices pipeline and am at a point of making a joint call on vcfs across all my samples in my cohort. I intend to do a germ line analysis.

    I may be over thinking here but I had these questions:

    1. I want to specifically screen variants in certain genes only. Do I need to still merge vcfs ( joint call) that contain not only variants in the genes of interest but also other variants in other regions ? what I meant is can I just extract variants from chromosomal locations that span my genes of interest and use those vcfs to do a joint call? probably quicker to do.

    OR

    1. do I need to go ahead and do a joint call on the entire vcf across my cohort and take it to vqsr and then extract my variants in my region of interest? this seems overkill but not sure if I will miss anything if I strictly do a selective extraction as mentioned above?

    Why unnecessarily run the computation time and resources if it is not necessary. Any guidance is greatly appreciated !!

    Thank you in advance for your time and I appreciate your intent to help !

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk