Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDB Follow

2 comments

  • Avatar
    Daniel J McGoldrick

    Hi I was running a sample of over 1000 vcf files and the second joint calling step (as one call) was taking days - even weeks and doing that on a cloud is not cost effective. FAIL

    It is mentioned

    "You can use any of the usual SelectVariants modifiers to extract e.g. only a subset of samples, a subset of genomic intervals, and so on."

    but it would be nice if you expand on larger use-cases - eg larger than one trio - much larger cohorts?  Do we address highly mixed or structured diverged populations? Finnish and Ashkenazi Jewish + AMR??? Can we skip joint genotyping?

    Locally on the HPC we split this into batches of 10 and by chromosome and it completed in a few hours - vs days/weeks.  To make this useful  larger datasets and how to effectively parallelize the joint calling of samples in step 2 needs a lot more content. Maybe provide more documentation on the use of the GenomicsDB for thousands or 10's of thousands of samples?  As is, this leads the user to believe that they can run one call on the GenomicsDB and I just don't think that is feasible? Thank you for helping us get to a data format other than VCF! I would also like to know more about GATK using HAIL and alternatives for using that format like apache spark...Why did you choose this implementation over another?

    0
    Comment actions Permalink
  • Avatar
    Shahryar Alavi

    The increment argument efficiently speeds up DB renewal:

    gatk GenomicsDBImport --genomicsdb-update-workspace-path path/to/DB --sample-name-map new_samples.map --batch-size 45

    But, is it possible to add a similar argument to joint genotyping? e.g.:

    gatk GenotypeGVCFs --vcf-update path/to/vcf -V gendb://path/to/DB -R reference/hg38.fasta

    As the joint genotyping is the bottleneck on cohort scaling.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk