DRAGEN-GATK Webinar - Call for questions!
Please post your questions in this GATK Forum topic for the upcoming DRAGEN-GATK Webinar on December 2nd at 10 am EST. We will use your topics of interest to plan what we cover during our Q&A portion of the talk.
Join the webinar at this ZOOM LINK. Please see the DRAGEN-GATK landing page for more details regarding DRAGEN-GATK.
DRAGEN-GATK webinar recording: https://www.youtube.com/watch?v=zpw1TIFTjUI. For more information on DRAGEN-GATK, please see this page: https://gatk.broadinstitute.org/hc/en-us/articles/360045944831.Comment actions
Please provide details of how the benchmark in the blog post was performed. Which reference genome was used? Was post-alt processing run on BWA alignments prior to variant calling?
Congratulations on this big milestone!
Here a few questions from our team:
- Does DRAGEN-GATK support low-coverage WGS (5X-1X) / has that been tested?
- DRAGEN provides a forced genotyping setting to ensure that a user-provided set of SNPs are always included in the VCF output. Is this possible here?
- Are there plans to expand DRAGEN-GATK to large variants?
Thanks in advance.
First of all, congratulations on all of this work. I have tested the pipeline (actually not really as a pipeline, but the commands are the same) and it is clear that the results are excellent!
My main questions are:
how to do joint genotyping from gvcf produced by dragen-gatk? Just perform genomicdbimport ans genotypegvcf as usual ?
can the gvcf produced by this pipeline be directly used by dragen to do joint genotyping?
Have you measured the number of cpu / hour needed to make a 30x human genome? What is the cost difference between using dragen-gatk and illumina dragen? For example on Aws illumina dragen + the cost of aws is $ 18.40 / hour, so around $ 9 / genome. Alignment with dragmap alone seems to take 200cpu / hour, so is it really more economical?
John Didion here is the answer to your question from the webinar, "How was the benchmarking for DRAGEN-GATK performed? Which reference was used and was post-alt processing run on BWA samples prior to variant calling?"
"The data was processed and benchmarked with the masked version of the hg38 reference. The masking is dealing with some mapping improvements in the alt contigs and decoy contigs of the hg38 reference genome. Illumina is planning to provide some more documentation about this." - Michael Gatzen, Broad
"There is a blog post coming soon. We will share it will the Broad so that it can be distributed once it is available." - Heidi Norton, Illumina
Quentin Chartreux here is the answer to your question from the webinar, "How do I do joint calling with GVCFs produced by DRAGEN-GATK? Just run GenomicsDBImport and GenotypeGVCFs as usual?"
"Right now we have not fully tested joint genotyping between DRAGEN-GATK and DRAGEN. We know that some of the annotations do not match yet. The annotations are not guaranteed to be 100% compatible or functionally equivalent to each other. The calls, however, are. If you are just interested in the calls, you can absolutely combine DRAGEN-GATK and DRAGEN with the configurations we have showed. We are working on getting those annotations aligned in the future" - Michael Gatzen, Broad
"Joint genotyping for DRAGEN and DRAGEN-GATK results is not something we have tested. We know about the annotation differences, those are not super important. This is a use at your own risk approach. Though given that they are functionally equivalent and the results are very similar, it would probably work. Barring some issues with annotations." - James Emery, Broad
"How much processing power do these pipelines need? If we are actually going to be running DRAGEN-GATK or DRAGEN on a 30X human genome, what's the price going to be like? Is it going to be more economical to use DRAGEN or DRAGEN-GATK?"
"I am not 100% sure the costs of running DRAGEN. For a 30X genome on the DRAGEN-GATK side, running the whole pipeline is between $5-$8 per sample, on the cloud." - Michael Gatzen, Broad
"On our Illumina connected analytics platform, it costs $5 per hour to run DRAGEN. It's a little slower on the cloud than on premise. It takes about an hour to do that fully featured genome with all the callers. It's around $5-$7 on our Illumina Connected Analytics Platform." -Heidi Norton, Illumina
Mar Gonzàlez-Porta, thank you! Here are the answers to your questions:
"Does DRAGEN-GATK work for low coverage data, 1X-5X?"
"There shouldn't be any special considerations for DRAGEN-GATK. This hasn't been in our testing framework, looking at very very low coverage data like that. For DRAGEN-GATK, it's about the same as running regular GATK for low coverage data. There's nothing uniquely worse or better about the algorithms for DRAGEN that should affect it. It might even have overall better scientific results, which is obviously what we expected, so that's why we went through this process.That is something you'll have to evaluate for yourself." - James Emery, Broad
"DRAGEN provides a forced genotyping setting to ensure that a user provided set of SNPs are always provided in the VCF output. Is that possible with DRAGEN-GATK?"
"Yes. There's a feature in GATK, in both HaplotypeCaller and DRAGEN-GATK, called genotype given alleles (--alleles). It lets you input a VCF file and it will force that VCF file to be genotyped at all the sites where it is relevant. That is a feature and it should work just fine." -James Emery, Broad
"Is that described in the technical documentation?"
"It's certainly in our technical documentation for GATK HaplotypeCaller." -James Emery, Broad *Note: check out tool documentation for HaplotypeCaller.*
"Are there plans to expand DRAGEN-GATK to large variants?"
"My question for this user is what they mean by large variants in this case? Do they mean structural events? In which case, there are tools that work on structural events and there might be news in the future about collaborations stemming from this about that. But we don't have anything to say on that now. There are other tools that are better tools for structural variants than DRAGEN-GATK. In terms of DRAGEN-GATK on slightly longer events, insertions, deletions, in the range of 100+ bases, there are options in GATK and in DRAGEN-GATK that might improve performance there. Those events are fairly rare and we haven't spent a lot of time evaluating how we perform on those. That is sort of tinkering that would have to happen on the side for your own application." -James Emery, Broad. *Note: For structural variants, on the GATK side we recommend GATK-SV.*
The blog post from Illumina regarding the hg38 masked reference is live. You can take a look here: https://www.illumina.com/science/genomics-research/articles/dragen-demystifying-reference-genomes.html
I am using DRAGEN-GATK for CNV calling of WGS data of S.pombe. But S. Pombe has a haploid genome. so getting error.
I have used this command -
dragen --build-hash-table true --output-directory reference/ --ht-reference reference/Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa --enable-sv true --enable-cnv true
dragen -r reference/
dragen -r reference/ -1 sample1_S176_R1.fastq.gz -2 sample1_S176_R2.fastq.gz --output-directory output_sample1/ --output-file-prefix sample1_S176 --enable-map-align-output true --enable-bam-indexing true --RGID S176 --RGLB SANGER --RGPL ILLUMINA --RGSM sample1_S176 --enable-duplicate-marking true --enable-variant-caller true --enable-sv true --enable-cnv true
but getting error -
predicted_output_size : 200GB
max_bin_size : 900MB
total_system_memory : 251GB
total_bin_memory : 20GB
sort_buffer_size : 1024kB
predicted_num_bins : 228
max_num_bins : 8192
num_bins : 228
partition key : POSITION
generate_SA : 1
sort_threads : 4
dedup_threads : 4
dbam2bam_threads : 8
Duplicate Marking Initialization Threads: 45
DRAGEN registers saved to /var/log/dragen/dragen_info_1647411833477_48312.log
Hang diagnostic saved to /var/log/dragen/hang_diag_1647411833477_48312.txt
Dividing sort intermediate data into 228 partitions for mapped, and an equal number for unmapped, records.
pstack saved to /var/log/dragen/pstack_1647411833817_48312.log
Writing out target counts
WARNING: Zero reads were counted - check that the correct reference was used for
Generating raw counts to output_sample1/sample1_S176.target.counts.gz
Predicted sex of sample
sample1_S176: UNDETERMINED 0
Performing GC bias correction
Initial crash reason: Assertion failed in ../src/host/cnv/calculate_target_counts.cpp line 305 -- m_targets->GetNumTargets() != 0 -- Number of target intervals cannot be 0
Assertion failed in ../src/host/cnv/gc_bias.cpp line 99 -- rowNum > 0 --
Fatal error: Assertion failed in ../src/host/cnv/gc_bias.cpp line 99 -- rowNum > 0 --
Please run sosreport to collect diagnostic and configuration information:
sudo sosreport --batch
This requires root privileges and may take several minutes to execute. When completed,
sosreport generates a compressed file in /tmp or /var/tmp. The location of this file
is given in the script output. For example:
Your sosreport has been generated and saved in:
Please send this report to your Illumina support representative.
Aborting the application - it may take several minutes to dump the core file
FATAL: Caught signal Aborted (6)
Resetting the DRAGEN Bio-IT processor and stopping all software threads
Can anyone help that how can we change the setting to Haploid and how can solve this error?
Archana Verma please open a new post for your question.
Post is closed for comments.