Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Germline short variant discovery (SNPs + Indels) Follow


  • Avatar

    I'm not really clear from the above text if VariantRecalibrator and ApplyRecalibration are used in conjunction with CNNScoreVariants or if these are alternatives to one another.  The text appears to say the neural network based approach is experimental but the schematic for the current best practices doesn't even have VariantRecalibrator and ApplyRecalibration anywhere.  And I just noticed a slide in the recent Costa Rica 2020 workshop which shows GATK CNN consistently beating the two VQSR based approaches on that slide.

    Comment actions Permalink
  • Avatar

    Does the Broad/GATK-team only provide the best practices as WDL scripts now? At one point, there used to be step-by-step tutorials on how to use and apply these tools, but these don't seem to be available anymore.

    If you are no longer providing step-by-step tutorials, I wish that you would either A) make that clear on the site, or B) revisit/reconsider that decision (which would be my personal preference).

    While it's great that there make WDL scripts that are availabe, if that's all that you provide now, that essentially means that either people have to adopt Cromwell and WDL in order to use the GATK, or that folks have to figure out how to backwards engineer those scripts if that isn't an option or preference, i.e. some groups prefer CWL to WDL.

    Comment actions Permalink
  • Avatar
    Brian Simison

    These pipelines are a great resource and have been essential to developing our own pipelines. For the Germline Cohort pipeline, I am unclear on the consolidate gVCFs bit. I have 100s of genomes with 25 chromosomes each. Does the pipeline result in 25 separate databases (one per chrom) or can one generate a single database with all 25 chromosmes? 

    Would the following command be a potential approach to generating a single database?

    gatk --java-options "-Xmx200g -Xms200g" \
    GenomicsDBImport \
    --genomicsdb-workspace-path ../Tse_ChrAll_database \
    -L Chr_1 \
    -L Chr_2 \
    -L Chr_3 \
    -L Chr_4 \
    -L Chr_5 \
    --sample-name-map ../gVCFs/Chr1_sample_names_map.txt \
    --sample-name-map ../gVCFs/Chr2_sample_names_map.txt \
    --sample-name-map ../gVCFs/Chr3_sample_names_map.txt \
    --sample-name-map ../gVCFs/Chr4_sample_names_map.txt \
    --sample-name-map ../gVCFs/Chr5_sample_names_map.txt \
    --tmp-dir /tmp \

    Comment actions Permalink
  • Avatar
    Ed Ryder

    This guide is only really useful for people who already have in-depth knowledge of all the tools. A step-by-step guide with examples would be greatly appreciated.

    Comment actions Permalink
  • Avatar
    Matt Snyder

    This is super useful. However, none of the WDL workflows linked at the top actually use VariantRecalibrator or ApplyVQSR. It would be nice if there was an example workflow somewhere using these tools.


    Comment actions Permalink
  • Avatar

    Hi,I'm a newer to gatk. I try to use gatk to call germline mutation from my tumor data. I see that there are to modules, per-sample and joint genotyping, I wonder that per sample module compatible to hg19 reference genome? the attached picture shows that per sample module only support hg38. Hope your response!

    Comment actions Permalink
  • Avatar
    Layne Sadler

    Reference for HaplotypeCaller commands:

    Prior that, the GATK subtools are found via `gatk -h`

    • SortSam (Picard) = Sorts a SAM, BAM or CRAM file
    • multiple commands for finding and downsampling duplicates =\ 
    • ApplyBQSR = Apply base quality score recalibration
    • [wysiwyg editor won't let me delete this bullet] 
    Comment actions Permalink
  • Avatar

    Hi, Thank you for this best practices documentation. For single-sample variant calling for a non-model organism, is it correct that one can use HaplotypeCaller to generate a gVCF and skip the GenotypeGVCFs step? Or do you still need to GenotypeGVCFs for this single sample?

    Thank you for clarifying. 


    Comment actions Permalink
  • Avatar
    Burair Alsaihati

    Hello Derek,

    I came to update my GATK4 pipeline using the latest best practices and was surprised with the following:

    1. Both universal per-sample and joint genotyping pipeline links point to the same page (per-sample calling).

    2. No mention of mark duplicates step (is it no longer needed)

    3. No mention of BQSR step (is it no longer needed)

    4. Though VQSR is mentioned in this page, the universal pipeline page lacks this step.

    Please advise regarding these notes as I am trying to follow the most recent best practices and this article seem to have confused me instead of helping me. 

    Comment actions Permalink
  • Avatar
    Christopher Fields

    Wow, I'm pretty disappointed to see that the really great step-by-step descriptions for GATK's best practices are now gone, and what's in their place now boils down to 'just use Terra'.  Searching for useful information on using command-line GATK is pretty terrible; I don't always need an end-to-end WDL workflow when a single step will do, so now the process is pretty inflexible.

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk