Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

DRAGEN-GATK Webinar - Call for questions!

1

10 comments

  • Official comment
    Avatar
    Genevieve Brandt (she/her)

    DRAGEN-GATK webinar recording: https://www.youtube.com/watch?v=zpw1TIFTjUI. For more information on DRAGEN-GATK, please see this page: https://gatk.broadinstitute.org/hc/en-us/articles/360045944831.

    Comment actions Permalink
  • Avatar
    John Didion

    Please provide details of how the benchmark in the blog post was performed. Which reference genome was used? Was post-alt processing run on BWA alignments prior to variant calling?

    1
    Comment actions Permalink
  • Avatar
    Mar Gonzàlez-Porta

    Congratulations on this big milestone!

    Here a few questions from our team:

    • Does DRAGEN-GATK support low-coverage WGS (5X-1X) / has that been tested?
    • DRAGEN provides a forced genotyping setting to ensure that a user-provided set of SNPs are always included in the VCF output. Is this possible here?
    • Are there plans to expand DRAGEN-GATK to large variants?

    Thanks in advance.

    1
    Comment actions Permalink
  • Avatar
    Quentin Chartreux

    First of all, congratulations on all of this work. I have tested the pipeline (actually not really as a pipeline, but the commands are the same) and it is clear that the results are excellent!
    My main questions are:
    how to do joint genotyping from gvcf produced by dragen-gatk? Just perform genomicdbimport ans genotypegvcf as usual ?
    can the gvcf produced by this pipeline be directly used by dragen to do joint genotyping?
    Have you measured the number of cpu / hour needed to make a 30x human genome? What is the cost difference between using dragen-gatk and illumina dragen? For example on Aws illumina dragen + the cost of aws is $ 18.40 / hour, so around $ 9 / genome. Alignment with dragmap alone seems to take 200cpu / hour, so is it really more economical?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    John Didion here is the answer to your question from the webinar, "How was the benchmarking for DRAGEN-GATK performed? Which reference was used and was post-alt processing run on BWA samples prior to variant calling?"

    "The data was processed and benchmarked with the masked version of the hg38 reference. The masking is dealing with some mapping improvements in the alt contigs and decoy contigs of the hg38 reference genome. Illumina is planning to provide some more documentation about this." - Michael Gatzen, Broad

    "There is a blog post coming soon. We will share it will the Broad so that it can be distributed once it is available." - Heidi Norton, Illumina

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Quentin Chartreux here is the answer to your question from the webinar, "How do I do joint calling with GVCFs produced by DRAGEN-GATK? Just run GenomicsDBImport and GenotypeGVCFs as usual?"

    "Right now we have not fully tested joint genotyping between DRAGEN-GATK and DRAGEN. We know that some of the annotations do not match yet. The annotations are not guaranteed to be 100% compatible or functionally equivalent to each other. The calls, however, are. If you are just interested in the calls, you can absolutely combine DRAGEN-GATK and DRAGEN with the configurations we have showed. We are working on getting those annotations aligned in the future"  - Michael Gatzen, Broad

    "Joint genotyping for DRAGEN and DRAGEN-GATK results is not something we have tested. We know about the annotation differences, those are not super important. This is a use at your own risk approach. Though given that they are functionally equivalent and the results are very similar, it would probably work. Barring some issues with annotations." - James Emery, Broad

    "How much processing power do these pipelines need? If we are actually going to be running DRAGEN-GATK or DRAGEN on a 30X human genome, what's the price going to be like? Is it going to be more economical to use DRAGEN or DRAGEN-GATK?"

    "I am not 100% sure the costs of running DRAGEN. For a 30X genome on the DRAGEN-GATK side, running the whole pipeline is between $5-$8 per sample, on the cloud." - Michael Gatzen, Broad

    "On our Illumina connected analytics platform, it costs $5 per hour to run DRAGEN. It's a little slower on the cloud than on premise. It takes about an hour to do that fully featured genome with all the callers. It's around $5-$7 on our Illumina Connected Analytics Platform." -Heidi Norton, Illumina

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Mar Gonzàlez-Porta, thank you! Here are the answers to your questions:

    "Does DRAGEN-GATK work for low coverage data, 1X-5X?"

    "There shouldn't be any special considerations for DRAGEN-GATK. This hasn't been in our testing framework, looking at very very low coverage data like that. For DRAGEN-GATK, it's about the same as running regular GATK for low coverage data. There's nothing uniquely worse or better about the algorithms for DRAGEN that should affect it. It might even have overall better scientific results, which is obviously what we expected, so that's why we went through this process.That is something you'll have to evaluate for yourself." - James Emery, Broad

    "DRAGEN provides a forced genotyping setting to ensure that a user provided set of SNPs are always provided in the VCF output. Is that possible with DRAGEN-GATK?"

    "Yes. There's a feature in GATK, in both HaplotypeCaller and DRAGEN-GATK, called genotype given alleles (--alleles). It lets you input a VCF file and it will force that VCF file to be genotyped at all the sites where it is relevant. That is a feature and it should work just fine." -James Emery, Broad

    "Is that described in the technical documentation?"

    "It's certainly in our technical documentation for GATK HaplotypeCaller." -James Emery, Broad *Note: check out tool documentation for HaplotypeCaller.*

    "Are there plans to expand DRAGEN-GATK to large variants?"

    "My question for this user is what they mean by large variants in this case? Do they mean structural events? In which case, there are tools that work on structural events and there might be news in the future about collaborations stemming from this about that. But we don't have anything to say on that now. There are other tools that are better tools for structural variants than DRAGEN-GATK. In terms of DRAGEN-GATK on slightly longer events, insertions, deletions, in the range of 100+ bases, there are options in GATK and in DRAGEN-GATK that might improve performance there. Those events are fairly rare and we haven't spent a lot of time evaluating how we perform on those. That is sort of tinkering that would have to happen on the side for your own application." -James Emery, Broad. *Note: For structural variants, on the GATK side we recommend GATK-SV.*

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    The blog post from Illumina regarding the hg38 masked reference is live. You can take a look here: https://www.illumina.com/science/genomics-research/articles/dragen-demystifying-reference-genomes.html

    0
    Comment actions Permalink
  • Avatar
    Archana Verma

    Hi,

    I am using DRAGEN-GATK for CNV calling of WGS data of S.pombe. But S. Pombe has a haploid genome. so getting error.

    I have used this command -

    dragen --build-hash-table true --output-directory reference/ --ht-reference reference/Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa --enable-sv true --enable-cnv true

    dragen -r reference/

    dragen -r reference/ -1 sample1_S176_R1.fastq.gz -2 sample1_S176_R2.fastq.gz --output-directory output_sample1/ --output-file-prefix sample1_S176 --enable-map-align-output true --enable-bam-indexing true --RGID S176 --RGLB SANGER --RGPL ILLUMINA --RGSM sample1_S176 --enable-duplicate-marking true --enable-variant-caller true  --enable-sv true --enable-cnv true

    but getting error -

    Binner Configuration
    ==================================================================
      predicted_output_size :  200GB
      max_bin_size          :  900MB
      total_system_memory   :  251GB
      total_bin_memory      :  20GB
      sort_buffer_size      :  1024kB
      predicted_num_bins    :  228
      max_num_bins          :  8192
      num_bins              :  228
      partition key         :  POSITION
      generate_SA           :  1
      sort_threads          :  4
      dedup_threads         :  4
      dbam2bam_threads      :  8

    Duplicate Marking Initialization Threads: 45
    DRAGEN registers saved to /var/log/dragen/dragen_info_1647411833477_48312.log
    Hang diagnostic saved to /var/log/dragen/hang_diag_1647411833477_48312.txt
    Dividing sort intermediate data into 228 partitions for mapped, and an equal number for unmapped, records. 
    pstack saved to /var/log/dragen/pstack_1647411833817_48312.log

    ==================================================================
    Writing out target counts
    ==================================================================
    WARNING: Zero reads were counted - check that the correct reference was used for 
    Generating raw counts to output_sample1/sample1_S176.target.counts.gz

    ==================================================================
    Sex Genotyper
    ==================================================================
    Predicted sex of sample
      sample1_S176: UNDETERMINED        0

    ==================================================================
    Performing GC bias correction
    ==================================================================
    Initial crash reason: Assertion failed in ../src/host/cnv/calculate_target_counts.cpp line 305 -- m_targets->GetNumTargets() != 0 -- Number of target intervals cannot be 0
    Assertion failed in ../src/host/cnv/gc_bias.cpp line 99 -- rowNum > 0 -- 
    Dumping diagnostics....

    Fatal error: Assertion failed in ../src/host/cnv/gc_bias.cpp line 99 -- rowNum > 0 -- 

    ***************************************************************************************
    Please run sosreport to collect diagnostic and configuration information:

       sudo sosreport --batch

    This requires root privileges and may take several minutes to execute.  When completed,
    sosreport generates a compressed file in /tmp or /var/tmp.  The location of this file
    is given in the script output.  For example:

      Your sosreport has been generated and saved in:
        /tmp/sosreport-hostname.companyname.com-20160526151939.tar.xz

    Please send this report to your Illumina support representative.
    ***************************************************************************************

    Aborting the application - it may take several minutes to dump the core file
    FATAL: Caught signal Aborted (6)
    Dumping diagnostics....

    Resetting the DRAGEN Bio-IT processor and stopping all software threads
    Aborted

     

    Can anyone help that how can we change the setting to Haploid and how can solve this error?

    Thank you!

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Archana Verma please open a new post for your question.

    0
    Comment actions Permalink

Post is closed for comments.

Powered by Zendesk