Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.
Reference Implementations
Pipeline | Summary | Notes | Github | Terra |
---|---|---|---|---|
Somatic short variants tumor-normal pair | T-N BAMs to VCF | universal | yes | b37 |
Somatic short variants PON creation | Normal BAMs to PON | universal | yes | b37 |
Expected input
This workflow requires BAM files for each input tumor and normal sample. Input BAMs should be pre-processed as described in the GATK Best Practices for data pre-processing.
Main steps
There are two main steps to this workflow - first we generate a large set of candidate somatic variants, and then we filter them to obtain a more confident set of somatic variant calls.
Call candidate variants
Tools involved: Mutect2
Like HaplotypeCaller, Mutect2 calls SNVs and indels simultaneously via local de-novo assembly of haplotypes in an active region. That is, when Mutect2 encounters a region showing signs of somatic variation, it discards the existing mapping information and completely reassembles the reads in that region in order to generate candidate variant haplotypes. Like HaplotypeCaller, Mutect2 then aligns each read to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, it applies a Bayesian somatic likelihoods model to obtain the log odds for alleles to be somatic variants versus sequencing errors.
Calculate Contamination
Tools involved: GetPileupSummaries, CalculateContamination
This step emits an estimate of the fraction of reads due to cross-sample contamination for each tumor sample and an estimate of the allelic copy number segmentation of each tumor sample. Unlike other contamination tools, CalculateContamination is designed to work well without a matched normal even in samples with significant copy number variation and makes no assumptions about the number of contaminating samples.
Learn Orientation Bias Artifacts
Tools involved: LearnReadOrientationModel
This tool uses an optional F1R2 counts output of Mutect2 to learn the parameters of a model for orientation bias. It finds prior probabilities of single-stranded substitution errors prior to sequencing for each trinucleotide context. This is extremely important for FFPE tumor samples.
Filter Variants
Tools involved: FilterMutectCalls
Mutect2’s somatic likelihoods model assumes that read errors are independent, so that, for example, four reads each with an error probability of 1/1000 yield a log odds of roughly 1000^4 in favor of being a real variant versus a sequencing error. FilterMutectCalls accounts for correlated errors, that is, the possibility that all variant reads at a site were due to some common source of error. It accomplishes this through several hard filters to detect alignment artifacts and probabilistic models for strand and orientation bias artifacts, polymerase slippage artifacts, germline variants, and contamination. Additionally, it learns a Bayesian model for the overall SNV and indel mutation rate and allele fraction spectrum of the tumor to refine the log odds emitted by Mutect2. It then automatically sets a filtering threshold to optimize the F score, the harmonic mean of sensitivity and precision.
Annotate Variants
Tools involved: Funcotator
At this step we run tools to add information to the discovered variants in our dataset. One of those tools, Funcotator, can be used to add gene-level information to each variant. Funcotator is a functional annotation tool in the core GATK toolset and was designed to handle both somatic and germline use cases. Funcotator reads in a VCF file, labels each variant with one of twenty-three distinct variant classifications, produces gene information (e.g. affected gene, predicted variant amino acid sequence, etc.), and associations to information in datasources. Supported datasources include GENCODE (gene information and protein change prediction), dbSNP, gnomAD, and COSMIC (among others). The corpus of datasources is extensible and user-configurable and includes cloud-based datasources supported with Google Cloud Storage. Funcotator produces either a Variant Call Format (VCF) file (with annotations in the INFO field) or a Mutation Annotation Format (MAF) file.
Additional Information
- Somatic calling is NOT simply a difference between two callsets
- Funcotator Information and Tutorial
- ActiveRegion determination (HaplotypeCaller & Mutect2)
- Evaluating the evidence for haplotypes and variant alleles (HaplotypeCaller & Mutect2)
- Local re-assembly and haplotype determination (HaplotypeCaller & Mutect2)
4 comments
Hi! It is not clear how to use the GetPileupSummaries and CalculateContamination in the text above. Which tool do I use first? Or are they supposed to be used simultaneously?
Can this pipeline be used for calling SNVs + Indels on single-cell ATAC-seq or single-cell RNA-seq data?
Is this for gerline only or it applies to somatic variants as well.
Can this pipeline be used for bulk RNA-seq?
Please sign in to leave a comment.