Purpose
Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.
Reference Implementations
Pipeline | Summary | Notes | Github | Terra |
---|---|---|---|---|
Prod* germline short variant per-sample calling | uBAM to GVCF | optimized for GCP | yes | hg38 |
Prod* germline short variant joint genotyping | GVCFs to cohort VCF | optimized for GCP | yes | hg38 |
Generic germline short variant per-sample calling | analysis-ready BAM to GVCF | universal | yes | hg38 |
Generic germline short variant joint genotyping | GVCFs to cohort VCF | universal | yes | hg38 & b37 |
Intel germline short variant per-sample calling | uBAM to GVCF | Intel optimized for local architectures | yes | NA |
- "Prod" refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.
Expected input
This workflow is designed to operate on a set of samples constituting a study cohort; specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.
Main steps for Germline Cohort Data
We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.
Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.
Call variants per-sample
Tools involved: HaplotypeCaller (in GVCF mode)
In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.
In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes
In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.
Consolidate GVCFs
Tools involved: GenomicsDBImport
This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.
Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using GenomicsDBImport, which produces a datastore instead of a GVCF file.
Joint-Call Cohort
Tools involved: GenotypeGVCFs
At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.
This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.
Filter Variants by Variant (Quality Score) Recalibration
Tools involved: VariantRecalibrator, ApplyVQSR
The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.
The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.
The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.
We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.
Main steps for Germline Single-Sample Data
Single sample variant discovery uses HaplotypeCaller in its default single-sample mode to call variants in an analysis-ready BAM file. The VCF that HaplotypeCaller emits errs on the side of sensitivity, so some filtering is often desired. To filter variants first run the CNNScoreVariants tool. This tool annotates each variant with a score indicating the model's prediction of the quality of each variant. To apply filters based on those scores run the FIlterVariantTranches tool with SNP and INDEL sensitivity tranches appropriate for your task.
Notes on methodology
The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.
In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.
Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.
11 comments
I'm not really clear from the above text if VariantRecalibrator and ApplyRecalibration are used in conjunction with CNNScoreVariants or if these are alternatives to one another. The text appears to say the neural network based approach is experimental but the schematic for the current best practices doesn't even have VariantRecalibrator and ApplyRecalibration anywhere. And I just noticed a slide in the recent Costa Rica 2020 workshop which shows GATK CNN consistently beating the two VQSR based approaches on that slide.
Does the Broad/GATK-team only provide the best practices as WDL scripts now? At one point, there used to be step-by-step tutorials on how to use and apply these tools, but these don't seem to be available anymore.
If you are no longer providing step-by-step tutorials, I wish that you would either A) make that clear on the site, or B) revisit/reconsider that decision (which would be my personal preference).
While it's great that there make WDL scripts that are availabe, if that's all that you provide now, that essentially means that either people have to adopt Cromwell and WDL in order to use the GATK, or that folks have to figure out how to backwards engineer those scripts if that isn't an option or preference, i.e. some groups prefer CWL to WDL.
These pipelines are a great resource and have been essential to developing our own pipelines. For the Germline Cohort pipeline, I am unclear on the consolidate gVCFs bit. I have 100s of genomes with 25 chromosomes each. Does the pipeline result in 25 separate databases (one per chrom) or can one generate a single database with all 25 chromosmes?
Would the following command be a potential approach to generating a single database?
gatk --java-options "-Xmx200g -Xms200g" \
GenomicsDBImport \
--genomicsdb-workspace-path ../Tse_ChrAll_database \
-L Chr_1 \
-L Chr_2 \
-L Chr_3 \
-L Chr_4 \
-L Chr_5 \
--sample-name-map ../gVCFs/Chr1_sample_names_map.txt \
--sample-name-map ../gVCFs/Chr2_sample_names_map.txt \
--sample-name-map ../gVCFs/Chr3_sample_names_map.txt \
--sample-name-map ../gVCFs/Chr4_sample_names_map.txt \
--sample-name-map ../gVCFs/Chr5_sample_names_map.txt \
--tmp-dir /tmp \
This guide is only really useful for people who already have in-depth knowledge of all the tools. A step-by-step guide with examples would be greatly appreciated.
This is super useful. However, none of the WDL workflows linked at the top actually use VariantRecalibrator or ApplyVQSR. It would be nice if there was an example workflow somewhere using these tools.
Thanks!
Hi,I'm a newer to gatk. I try to use gatk to call germline mutation from my tumor data. I see that there are to modules, per-sample and joint genotyping, I wonder that per sample module compatible to hg19 reference genome? the attached picture shows that per sample module only support hg38. Hope your response!
Reference for HaplotypeCaller commands:
https://gatk.broadinstitute.org/hc/en-us/articles/360042913231-HaplotypeCaller
Prior that, the GATK subtools are found via `gatk -h`
Hi, Thank you for this best practices documentation. For single-sample variant calling for a non-model organism, is it correct that one can use HaplotypeCaller to generate a gVCF and skip the GenotypeGVCFs step? Or do you still need to GenotypeGVCFs for this single sample?
Thank you for clarifying.
Best,
Hello Derek,
I came to update my GATK4 pipeline using the latest best practices and was surprised with the following:
1. Both universal per-sample and joint genotyping pipeline links point to the same page (per-sample calling).
2. No mention of mark duplicates step (is it no longer needed)
3. No mention of BQSR step (is it no longer needed)
4. Though VQSR is mentioned in this page, the universal pipeline page lacks this step.
Please advise regarding these notes as I am trying to follow the most recent best practices and this article seem to have confused me instead of helping me.
Wow, I'm pretty disappointed to see that the really great step-by-step descriptions for GATK's best practices are now gone, and what's in their place now boils down to 'just use Terra'. Searching for useful information on using command-line GATK is pretty terrible; I don't always need an end-to-end WDL workflow when a single step will do, so now the process is pretty inflexible.
The germline short variant calling GitHub link here is about (3) repositories out-of-date.
Please sign in to leave a comment.