In this article, you'll learn how to identify structural variants in one or more individuals to produce a callset in VCF format.
Structural variants (SVs) are DNA rearrangements that involve at least 50 nucleotides. By genetic standards, these mutations are fairly large, and they are significantly abundant in the genome. As such, SVs represent one of the strongest forces that direct genome evolution and disease.
In addition, SVs are impactful because they encompass a number of mutational classes that can disrupt protein-coding genes and cis-regulatory architecture through different mechanisms.
The different kinds of SVs that can be detected in genomic analysis are summarized in the figure below.
If you are interested in using SV data in your own research, then a good starting point is gnomAD-SV, which is a reference for SVs from 14,891 genomes. This is included as a subset of The Genome Aggregation Database (gnomAD), which provides resources for retrieving exome and genome sequencing data from a variety of large-scale sequencing projects.
- Tools used in the GATK-SV pipeline
- Single-Sample Mode vs Cohort Mode
- gCNV Training
The GATK-SV pipeline is used for discovering, genotyping, and annotating structural variants in Illumina short-read whole-genome sequencing (WGS) data.
The variants that this pipeline is able to detect includes:
- Copy number variants (CNVs), including deletions and duplications
- Reciprocal chromosomal translocations
- Complex structural variants involving two or more distinct SV signatures in a single mutational event
The pipeline starts with evidence collection in BAM/CRAM files. In this step, it begins by extracting SV signatures from the sequencing reads. The evidence collected includes calls from a set of different algorithms as well as data extracted from the CRAM file: binned read depth, split read positions, discordant pair positions, and allele fractions at a set of SNP sites. In addition, this step runs quality checks on the resulting data.
Next, data is passed to a “clustering” stage, where similar variants detected by multiple different algorithms are merged to reduce repeated computation.
Then, the list of candidate SV sites from the ensemble of different algorithms is filtered and breakpoints are refined based on the evidence collected from the BAM/CRAM files to remove low-quality sites.
During the genotyping stage, evidence (discordant read pairs, split reads, and read depth) is evaluated for every sample at each of the candidate SV sites called across all of the algorithms. This is “joint genotyping,” which increases sensitivity and allows us to provide a genotype for every individual at every site.
After genotyping, the variants go through several refinement stages. They are clustered again, taking into account sample overlap, so that each site that is common across individuals will only be represented as one record in the VCF. Then, complex variants are resolved and genotyped, and multi-allelic CNVs are collapsed into single records. Variants then go through another round of filtering.
To aid in interpretation of the variants, the VCF is then annotated with allele frequency in the cohort, external allele frequency from gnomAD-SV, and any overlap with genes and exons and the predicted functional consequences.
Using BAM/CRAM files as inputs, the GATK-SV pipeline outputs a jointly called VCF with genotypes at each site for every input sample.
2. Tools used in the GATK-SV pipeline
The GATK-SV pipeline runs multiple SV callers to increase sensitivity and leverage multiple types of evidence. These tools are:
- GATK gCNV for detecting germline copy number variants from variations in read depth, combining a negative-binomial factor analysis module and a hierarchical hidden Markov model to account for sequencing biases and areas of high variation across samples.
- MELT is a Java package that discovers, annotates, and genotypes non-reference Mobile Element Insertions (MEIs) in paired-end WGS data. Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm.
- Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads based on split read and discordant read pair evidence.
- (Whamg) “integrates mate-pair mapping, split read mapping, soft-clipping, alternative alignment and consensus sequence based evidence to predict SV breakpoints with single-nucleotide accuracy.”
- cn.MOPS detects CNVs from variations in read depth using a mixture of Poissons model.
3. Single-Sample Mode vs Cohort Mode
GATK-SV has two modes: cohort and single-sample.
In cohort mode, groups of at least 100 samples can be processed together; this will produce the highest quality results for the lowest cost per sample.
Smaller numbers of samples can be processed one at a time in single-sample mode . This mode uses pre-computed statistics from a reference panel for joint genotyping. Single-sample mode is a great option when analyzing only a few samples; however, it carries a higher cost per sample and has a lower sensitivity.
4. gCNV Training
The GATK-SV pipeline relies on gCNV as a depth calling tool, which requires a trained model as input.
It is important that the samples used to create the gCNV model match the samples the model is run on as closely as possible.
Among other characteristics, this means that samples used to create the model and the samples to which it will be applied should be consistent in terms of collection type, library preparation, and sequencing protocol.
For the GATK-SV cohort mode pipeline, we strongly recommend training a separate gCNV model on each batch of samples. For the single-sample pipeline, this means that the case sample should be matched as closely to the samples in the reference panel as possible.