Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Germline short variant discovery (SNPs + Indels) Follow

3 comments

  • Avatar
    WVNicholson

    I'm not really clear from the above text if VariantRecalibrator and ApplyRecalibration are used in conjunction with CNNScoreVariants or if these are alternatives to one another.  The text appears to say the neural network based approach is experimental but the schematic for the current best practices doesn't even have VariantRecalibrator and ApplyRecalibration anywhere.  And I just noticed a slide in the recent Costa Rica 2020 workshop which shows GATK CNN consistently beating the two VQSR based approaches on that slide.

    1
    Comment actions Permalink
  • Avatar
    pwaltman

    Does the Broad/GATK-team only provide the best practices as WDL scripts now? At one point, there used to be step-by-step tutorials on how to use and apply these tools, but these don't seem to be available anymore.

    If you are no longer providing step-by-step tutorials, I wish that you would either A) make that clear on the site, or B) revisit/reconsider that decision (which would be my personal preference).

    While it's great that there make WDL scripts that are availabe, if that's all that you provide now, that essentially means that either people have to adopt Cromwell and WDL in order to use the GATK, or that folks have to figure out how to backwards engineer those scripts if that isn't an option or preference, i.e. some groups prefer CWL to WDL.

    4
    Comment actions Permalink
  • Avatar
    Brian Simison

    These pipelines are a great resource and have been essential to developing our own pipelines. For the Germline Cohort pipeline, I am unclear on the consolidate gVCFs bit. I have 100s of genomes with 25 chromosomes each. Does the pipeline result in 25 separate databases (one per chrom) or can one generate a single database with all 25 chromosmes? 

    Would the following command be a potential approach to generating a single database?

    gatk --java-options "-Xmx200g -Xms200g" \
    GenomicsDBImport \
    --genomicsdb-workspace-path ../Tse_ChrAll_database \
    -L Chr_1 \
    -L Chr_2 \
    -L Chr_3 \
    -L Chr_4 \
    -L Chr_5 \
    --sample-name-map ../gVCFs/Chr1_sample_names_map.txt \
    --sample-name-map ../gVCFs/Chr2_sample_names_map.txt \
    --sample-name-map ../gVCFs/Chr3_sample_names_map.txt \
    --sample-name-map ../gVCFs/Chr4_sample_names_map.txt \
    --sample-name-map ../gVCFs/Chr5_sample_names_map.txt \
    --tmp-dir /tmp \

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk