DRAGEN-GATK is an open-source, GATK-based pipeline that aims to produce results that have functional equivalence to the results produced by Illumina's hardware-based DRAGEN pipeline, which essentially means that the results can be considered interchangeable for statistical purposes. The specifics of how to determine functional equivalence are a bit more in-depth than just that, so article will explains what we mean by "functional equivalence" in more detail, and will discuss several options and alignment approaches available in the DRAGEN-GATK and DRAGEN pipelines that can affect both functional equivalence and the quality of the output.
For the initial release of DRAGEN-GATK, we've targeted functional equivalence against DRAGEN version 3.4.12. Future DRAGEN-GATK releases will target later releases of the DRAGEN pipeline.
What is "functional equivalence?"
Since both DRAGEN-GATK and DRAGEN are separate codebases under active development, it would be impractical to require them to produce bit-identical output. Even if it were feasible to do so, there is a certain amount of unavoidable "background noise" that is inherent to the process of high-throughput sequencing — due to natural inconsistencies in steps of the sequencing protocol (ie. library preparation), there will be small differences in the variant calling results when the same sample is sequenced by two different sequencing centers, even when using identical variant calling software and parameters. These small differences have been shown not to produce batch effects, and are expected by anyone using variant calling pipelines. Therefore, they can be considered part of the natural "background noise" of the sequencing process.
If two pipelines have fewer differences than the inherent variability that results when the same sample is sequenced by different centers, we can say that for all practical purposes they are interchangeable. This is how we define functional equivalence.
This definition is the standard that we have adopted for comparing DRAGEN-GATK and DRAGEN pipelines. To determine whether two versions of DRAGEN-GATK and DRAGEN are functionally equivalent, we perform the following steps:
- The baseline level of noise is computed by running multiple independent sequencing replicates of the same individuals at different sequencing centers, processed by the same pipeline (data replicates, see below), and identifying the differences. (Note that, for practical reasons, we generally calculate baseline noise by using technical replicates sequenced at the same center, rather than multiple centers. This is a stricter standard than evaluating replicates sequenced at different centers, since there will be less noise.)
- The differences between the DRAGEN hardware and DRAGEN-GATK software implementations are computed by running them on the exact same sample and comparing them to each other.
- The DRAGEN-GATK and DRAGEN implementations are considered functionally equivalent if there are fewer differences between outputs than what is seen in the baseline noise.
Which DRAGEN-GATK and DRAGEN arguments affect functional equivalence?
For the initial release of DRAGEN-GATK, we ported the major improvements introduced in the hardware DRAGEN pipeline, including DragSTR, BQD, and FRD. However, there are still features implemented in DRAGEN but not in GATK, and vice-versa, as well as a different alignment approach (see below), that can affect functional equivalence. Users of the DRAGEN-GATK pipeline should be aware of these differences when deciding which options to use when running the pipeline, which will differ depending on whether the goal is to maximize functional equivalence or to maximize the quality of the output.
Features that only exist in DRAGEN
- Column-wise detection,
Features that only exist in GATK
- Spanning event genotyping
Turning on each of these features will improve the quality of the output. However, in order to achieve functionally equivalent results, they need to be turned off while we work to implement the features in both pipelines.
This means that, in order to maximize either quality or functional equivalence, we need to pass the following arguments to DRAGEN or GATK HaplotypeCaller, respectively:
|Tool||Argument||Functional Equivalence||Maximum Quality|
The complete command line to run DRAGEN is outlined below.
In order to achieve functional equivalence on the GATK side, we recommend running the WARP pipeline. There are two pipeline-level preset arguments that will automatically apply the desired configuration for the DRAGEN-GATK pipeline:
WARP Pipeline-level arguments
There is also an individual
use_spanning_event_genotyping argument that can be used to toggle the GATK's spanning event genotyping on and off. However, we recommend using one of the mode arguments above (
dragen_maximum_quality_mode), since those represent the configurations we evaluated.
The WARP pipeline WDL also contains an argument to run BWA-MEM (
use_bwa_mem) for the alignment step. Using with the masked reference (see below) will produce very similar, yet not strictly functionally equivalent results. Users may choose to run BWA-MEM instead of DRAGMAP in the alignment step for cost reasons if functional equivalence is not required., however this is completely untested by us and use at your own risk.
In the interests of completeness, there is also an option in the WARP pipeline WDL to turn BQSR on or off, however we recommend leaving it off when running the pipeline in DRAGEN mode. Both of the mode arguments above (
dragen_maximum_quality_mode) turn BQSR off.
Masked reference approach
In addition to the variant caller features mentioned above, DRAGMAP (Illumina’s open source version of their DRAGEN aligner) uses a masked version of the hg38 reference. Consequently, in order to produce functionally equivalent results, DRAGEN users will have to use the same masked reference. Using the standard hg38 DRAGEN reference will not produce functionally equivalent results to DRAGEN-GATK. Furthermore, no other references than hg38 can be used to produce functionally equivalent calls at this moment.
Information about the FASTA file for this masked reference is accessible on Illumina's Support Resources page, and the hash table can be generated on DRAGEN using the following command:
dragen \ -–output-directory [output directory] \ --output-file-prefix [reference prefix] \ --ht-reference [path to Homo_sapiens_assembly38_masked.fasta] \ --build-hash-table=true \ --ht-build-rna-hashtable=true --ht-alt-aware-validate=False \ --ht-num-threads=40
More information about DRAGEN and GRCh38/hg38 reference genomes is available on Illumina's Genomic Research Hub blogpost on DRAGEN reference genome compatibility.
How to evaluate functional equivalence
For evaluating functional equivalence, we use a workflow that is available on Github (and has a lot of documentation there!). For a brief overview of the concept of functional equivalence, feel free to take a look at this recorded conference talk.
As discussed above, we define two pipelines as "functionally equivalent" if the difference in results between them (in this case between the accelerated hardware DRAGEN pipeline and the open-source DRAGEN-GATK pipeline) is smaller than the "background noise" that is inherent to genomic data analysis. We quantify the differences by taking multiple technical replicates of the same sample, sequence these replicates, and analyze them with both pipelines. We can measure the background noise by looking at the differences between the replicates called with the same pipeline (“intra-pipeline difference”), while we evaluate the differences between the pipelines by looking at the output of the two pipelines run on the same replicate (“inter-pipeline difference”).
We still need to define the notion of “differences” in this context. We look at two metrics to evaluate functional equivalence: concordance (visualized by FE plots) and quality variability (visualized by F1 plots).
FE plots: Concordance
Functional Equivalence (FE) plots visualize the intra and inter pipeline concordance between samples by running vcfeval with one sample arbitrarily defined as truth and the other one as the evaluation sample. Subsequently, a Jaccard score is calculated as
J = TP / (TP + FP + FN)
where TP refers to calls that are in both sample 1 and sample 2, and FP and FN refer to calls that are in one sample but not the other. A higher score relates to a high concordance or more similar calls. Note that the terms true positive, false positive and false negative only represent the output from vcfeval and do not reflect any relation to actual true calls, since no actual truth VCF is provided for this comparison.
Here is an example of an FE plot:
The left bars represent the intra-pipeline concordance when comparing different replicates called with pipeline A and the right bars represent the intra pipeline concordance between samples called with pipeline B. The middle bar represents the inter-pipeline concordance (the difference between pipelines A and B).
The points in the bars refer to the mean Jaccard score among the comparisons with multiple replicates and the error bars represent the standard deviation. If the middle bars are above the outer bars then the inter-pipeline difference is smaller than the intra-pipeline difference, and the two pipelines are functionally equivalent.
F1 plot: Quality variability
While the concordance metric is important to quantify the similarity between the calls, we require a different metric to ensure that the quality of the results produced by the two pipelines are in fact comparable. To illustrate this, consider the following illustration:
This plot is meant to illustrate a scenario where two pipelines are functionally equivalent, but there is a significant difference in the quality of the calls between the two pipelines. In this (exaggerated) illustration, green points represent true variants while red points represent false variants, and the ellipses represent the calls that are present in each input file. Concordance measures the overlap between the ellipses. In this case, the overlap between pipeline A rep1 and pipeline B rep1 is greater than the overlap between the two pipeline A replicates, which means that the concordance plot would show a smaller difference between the pipelines than between the replicates, which would in turn suggest functional equivalence. It is apparent, however, that there is a significant difference in the quality of the calls between pipeline A and pipeline B.
In order to address this scenario, we take a look at the F1 scores produced by each individual output compared to a known truth dataset.
Here is an example of an F1 plot:
In this plot, the absolute difference in F1 score vs. truth for a given quality threshold is visualized. In the above plot, we can expect a mean intra-pipeline difference between replicates called with pipeline A of 0.002 when filtering with a quality threshold of 10. That means, if the F1 score hard-filtered with that quality threshold of replicate 1 is, for example, 0.9 then the F1 score of replicate 2 could be, for example, 0.902. This is the background noise that is expected when looking at different replicates of the same individual. The absolute inter-pipeline difference between pipelines (green line) is very small and well below the intra-pipeline differences.
The area around the line represents the standard deviation when the comparison is carried out with multiple replicates.
If the green line is below the blue and orange lines, then the inter-pipeline difference is smaller than the intra-pipeline difference, and the pipelines are functionally equivalent.
DRAGEN version 3.4.12 for functionally equivalent configuration:
dragen \ -r [masked reference directory] \ --bam-input [unmapped BAM input] \ --intermediate-results-dir [temp directory] \ --output-directory [output directory] \ --output-file-prefix [output prefix] \ --enable-duplicate-marking true \ --enable-bam-indexing true \ --enable-map-align-output true \ --enable-vcf-compression true \ --enable-variant-caller true \ --vc-enable-profile-stats true \ --enable-save-bed-file true \ --vc-enable-dragstr true \ --generate-sa-tags true \ --vc-enable-frd true \ --vc-enable-bqd true \ --vc-emit-ref-confidence NONE \ --dump-map-align-registers=true \ --output-format=bam \ --enable-sort=true \ --enable-map-align=true \ --ht-alt-aware-validate=false \ --vc-enable-columnwise-detection false \ --vc-use-pdhmm false
Functional equivalence plots for different configurations of the DRAGEN-GATK pipeline
Plot showing "max FE" mode (columnwise detection + pdhmm OFF for DRAGEN, spanning deletion genotyping OFF for GATK)
Plot showing "max quality" mode (columnwise detection + pdhmm ON for DRAGEN, spanning deletion genotyping ON for GATK)