Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

DRAGEN-GATK Update: Let's get more specific Follow

22 comments

  • Avatar
    Kurt Hetrick

    Is this single sample calling only? Is there a 2ish step process still involved (creating a gvcf like file) and then joint call?

    0
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    Hi Kurt, this applies to both the single sample case and the multisample (joint calling) cases. Here we show the evaluation results from the single sample case because they're more straightforward to interpret, but the modifications we describe will benefit joint calling as well. So if you're calling variants on cohorts, you will still do the 2-step process of creating a GVCF for each sample separately then joint calling all your samples together. 

    0
    Comment actions Permalink
  • Avatar
    MICHAEL MCMANUS

    Is there a plan for improving and merging Mutect for both the Dragen accelerated and non-accelerated pipelines? 

    1
    Comment actions Permalink
  • Avatar
    Shahryar Alavi

    What about duplicated genes?

    To my knowledge, GATK does not call variants if reads are mapped to more than one locus on the genome; e.g. we never see variants of SMN1 and SMN2 genes in our annotated variants list.

    Does the new DRAGEN-GATK pipeline have a solution to this challenge?

    1
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    MICHAEL MCMANUS Mutect2 and the somatic short variants pipeline are on the list of use cases we want to work on together, but we haven't yet decided which will be next after the germline short variants. We'd love to hear from you all on what would be most valuable to the research community, so don't hesitate to comment. 

    Shahryar Alavi You're correct that GATK will only take into account uniquely mapped reads; dealing with duplicated genes is more a matter of mapping strategy. The updated DRAGEN-GATK pipeline does not change anything compared to the current behavior in that respect.

    2
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    FYI our colleagues in the Genomics Platform (who do the sequencing at the Broad Institute) are presenting a poster at the AGBT meeting this week about their testing of the DRAGEN-GATK pipeline. For the lucky ducks over there, it's poster 1010 and will be presented on Tuesday Feb 25th from 4:40 pm to 6:10pm. For the rest of the world, here is the PDF: https://drive.google.com/open?id=1fa0RG9RjoPOBlqvPoH6CXL121NldqAph

    1
    Comment actions Permalink
  • Avatar
    Niko Balanis

    Hello Geraldine, 

    We have been playing with Dragen on an FPGA AWS instance a little. For the figure above, we would be curious for your hard filter settings for DRAGEN (based on my reading of this blog  this sounds like it is  just quality score based) and for GATK  (this might have more parameters owing to the CNN) so that we can recreate it. This would be helpful to benchmark and give us a baseline when we test the two with our own whole genomes.

    Dragen is screaming fast by the way, coupled with FPGA it absolutely tears through large WGS samples.  This feels like a big step toward making the processing into the thousands of whole genomes more routine. Probably also opens the door on processing data for truly massive studies.  Personally, I am excited for the applications to cancer, gwas, etc that this will enable. Thanks again.

    Niko

    0
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    Hi Niko Balanis, glad to hear you're enjoying that blazing speed :) Let me loop in Bhanu Gandham to find out if we have the evaluation parameters in a shareable format. I'm pretty sure it's in the public github somewhere but hopefully we can get you a direct pointer so you don't have to go digging for it. When the full pipeline is ready for release we'll have a public workspace on Terra that will exercise the whole pipeline on test data to make it fully reproducible -- but I assume you'd prefer to start benchmarking sooner if possible. 

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Niko Balanis

     

    Very glad and excited to hear that you are already getting started with this! Here is the script that was used to filter the GATK variants: https://github.com/broadinstitute/gatk/blob/master/scripts/cnn_variant_wdl/cram2filtered.wdl . You are right, for DRAGEN, the variants were filtered based on qual scores and the default threshold used there was 10. Please note: the GATK and DRAGEN versions used in this blog are older - GATK v4.1.1.0 and DRAGEN v3.4. Since then there have been new improvements made to both. I would love to hear your thoughts once you are done with benchmarking on your end.

    And as Geraldine said, we will soon have a whole pipeline and test data made publicly available. So stay tuned! :)

    0
    Comment actions Permalink
  • Avatar
    Ury Alon

    Hi Geraldine Van der Auwera,

    Now that v4.2.0.0 is released, are there any news regarding the DRAGEN-GATK somatic short variants pipeline (Mutect2) implementations?  I am using the software-only version extensively and the only way I've found to reduce the long processing time from ~24 hours to ~20 minutes is using "a lot" of intervals and running "a lot" of Mutect2 commands in parallel, which is quite expensive.  DRAGEN-GATK might be an excellent cost-effective alternative.

    0
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    Hi Ury Alon, the next pipeline/use case we are tackling as part of the DRAGEN-GATK collaboration is germline SV (structural variation), which is a big area of focus for us right now. We don't currently have plans to tackle somatic short variants with the DRAGEN team. Sorry to disappoint! 

    Are you currently using our WDL workflows to operationalize your analyses, and are you running on cloud? We have some optimizations in place to reduce cost that may help if you're not already using those. 

    1
    Comment actions Permalink
  • Avatar
    Ury Alon

    Thanks for the prompt response, Geraldine Van der Auwera.

    We are not using actual WDLs, as we use AWS Step Functions and Batch for pipeline orchestration and execution.  We have done significant optimizations, but I would love to read the methods you suggest.

    0
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    Ah I see. Unfortunately we don't yet have systematic documentation on the pipeline optimizations, but off the top of my head, the top three are (1) using preemptible VMs on GCP (analogous to Spot instances on AWS) which saves us a ton of money (cost is ~20% of full price, barring preemptions); (2) auto-sizing disk allocations (calculating based on input sizes and output expectations); and (3) streaming data from buckets instead of localizing entire files. You can see the WDL scripts and the summaries of cost per sample in the Mutect2 workspace here: https://app.terra.bio/#workspaces/help-gatk/Somatic-SNVs-Indels-GATK4 (it's public after sign-in, which is free/open to all).

    The Terra knowledge base has docs on how to use these optimizations (which are not specific to Terra, to be clear), let me know if you're interested in specific pointers. Also there's a general cost control article here that covers a range of considerations; some of them are Google-specific but some are fairly cross-platform. 

     

    1
    Comment actions Permalink
  • Avatar
    Ury Alon

    Thanks Geraldine Van der Auwera, I've been actually reading this link earlier today :)

    I don't know if this is the place to get more specific, but I'll give it a try:

    In terms of optimizations, we've gone way beyond what you mentioned (will be happy to share):  We are using Spot instances exclusively, we use piping wherever possible (which unfortunately is not in all cases), and we use parallelization using SPARK function variants and by processing only intervals (in the case of Mutect2, for example).

    I think we can further reduce processing time and resource consumption by splitting the reference we're using (gh38) to chromosomes and intervals (in Mutect2, once again), but testing it, I saw the results are not the same as when the full reference file is used.  Is using only the "relevant" FASTA reference segment (for example, just chr1 when mutecting tumor chr1 and normal chr1) as a reference to Mutect something you would recommend?

    0
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    Oh yes I didn't mention processing only intervals since you already mentioned doing that, but it is definitely key. I'm not entirely sure I understand what you mean by "using only the "relevant" FASTA reference segment" though -- are you feeding in different reference files corresponding to subsets? On that front, we use the whole (uncut) reference file for everything, though I think we recently started using persistent disks for the reference files to speed up the process and avoid having to copy it all to each VM. I haven't used that yet myself so I can't speak to the details, but there was a feature update on the Terra blog that mentioned this (because it's now used by default when using our built-in references on Terra): https://terra.bio/coming-soon-faster-cheaper-workflows/

    The other thing we do wrt intervals is that we make "lists of interval lists". We can then parallelize execution into multiple GATK tasks, each of which runs on its own list of intervals, and we balance the lists so they all take roughly the same amount of time, which avoids eg waiting an extra hour just because one interval list has a lot more ground to cover. We've done quite a lot of tuning to figure out how widely to parallelize to get the balance of cost vs speed that is right for us (recognizing that when you parallelize more widely, you pay more for the overhead of setting up VMs etc) but it's not necessarily one size fits all. We tend to prefer cheap >> super fast for routine processing, but will occasionally dial things up if we need faster turnover at the expense of higher cost. 

    1
    Comment actions Permalink
  • Avatar
    Ury Alon

    Great, thanks for clarifying.  In our case, processing time is usually prioritized over costs, but we try to have both :)

    What I meant is, should there be a difference (in terms of identifying mutations) between executing 

    gatk Mutect2 \
    --input ${PATIENT_ID}.${TUMOR_SAMPLE_ID}.${CHROMOSOME}.bam \
    --input ${PATIENT_ID}.${NORMAL_SAMPLE_ID}.${CHROMOSOME}.bam \
    --output ${PATIENT_ID}.mutect2.${CHROMOSOME}.${INTERVAL}.vcf.gz \
    --tumor-sample ${TUMOR_SAMPLE_ID} \
    --normal-sample ${NORMAL_SAMPLE_ID} \
    -L ${INTERVAL}.interval_list \
    --reference hg38.fa

    and

    gatk Mutect2 \
    --input ${PATIENT_ID}.${TUMOR_SAMPLE_ID}.${CHROMOSOME}.bam \
    --input ${PATIENT_ID}.${NORMAL_SAMPLE_ID}.${CHROMOSOME}.bam \
    --output ${PATIENT_ID}.mutect2.${CHROMOSOME}.${INTERVAL}.vcf.gz \
    --tumor-sample ${TUMOR_SAMPLE_ID} \
    --normal-sample ${NORMAL_SAMPLE_ID} \
    -L ${INTERVAL}.interval_list \
    --reference hg38.${CHROMOSOME}.fa

    Note that the tumor and normal samples are already split into chromosomes in both cases, and that the interval list further limits the extent of processing.

    After all Mutect2 tasks are completed, we concatenate the resulting VCF into one.

    0
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    Got it — that should not make any difference. The only case where things could get weird is if you're splitting up chromosomes internally, because then you could run into edge effects. But if you're simply separating entire chromosomes, it should be fine, so if you do observe differences, I would recommend posting the details in the support forum so the support team can take a look. 

    1
    Comment actions Permalink
  • Avatar
    Ury Alon

    Awesome, thank you very much Geraldine Van der Auwera.  I'll get to it :)

     

    0
    Comment actions Permalink
  • Avatar
    Jonathan Klonowski

    Are there/where there any advances on getting GenomicsDbImport to have the ability to run in parallel, or is GenomicsDbImport just limited to a single core still?

    0
    Comment actions Permalink
  • Avatar
    Yangyxt

    Hi Jonathan Klonowski,

    Actually you can use GNU parallel to wrap the GenomicdDBImport for parallel running. 

    0
    Comment actions Permalink
  • Avatar
    Yangyxt

    Hi Geraldine Van der Auwera,

    I wonder, since the new added mode of dragen-mode for HC(which should be used on dragmap generated BAM files I suppose), can we mix the GVCFs generated by the old, traditional HC and the new dragen-mode HC in GenomicsDBImport and GenotypeGVCFs? 

    0
    Comment actions Permalink
  • Avatar
    Quentin Chartreux

    Hi,
    do you have any idea when a gatk-dragen pipeline will be released? Or at least the different steps and parameters to use to get results at least as good as with the classic gatk pipeline?
    in particular does it work if we use bwa mem 2? Because I tried dragmap and so far it's way too slow.
    moreover after having generated the gvcf with haplotypecaller in dragen mode we can do joint genotyping with genomicsdbimports followed by genotypegvcf?
    I read that with dragen-gatk it was enough for a hard filtering but with what value if it is just a qual <10 there are no additional steps after genotypegvcf (which already eliminates the variants lower than 10)?
    as you can see I really want to use these new tools ;-)

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk