Structural variants are key players in human evolution and disease, but they are difficult to discover and analyze. There are many tools out there to discover structural variants; some algorithms are specialized to detect certain structural variants or only work with certain SV categories. GATK-SV is an ensemble method that specializes in bringing together the best evidence from each tool in a high quality format. In this article, we will go over the evidence categories for detecting structural variants and the structural variant types that we report in GATK-SV.
Structural Variant - A genetic alteration of more than 50 base pairs. These include deletions, duplications, insertions, inversions, and translocations, as well as complex rearrangements.
Breakpoint - A discontinuity in the DNA sequence of a sample when compared to a reference. In simple terms, this can be thought of as a junction or a “jump” from one section of the reference sequence to another. The sample’s sequence may match the reference or its reverse complement on both sides of the breakpoint, or one side may be novel. A breakpoint is defined by the coordinates and orientation of the reference bases that match the sample on either side of the breakpoint, or the novel sequence content. All structural variants are composed of one or more breakpoints.
Breakend - A representation of a breakpoint tied to one specific location on the reference. For example, a deletion is formed by a single breakpoint in which a section of sequence is “skipped” and a subsequence of the sample’s DNA is joined to a subsequence that matches an upstream base in the reference, but we represent this as two breakends, one at the 5’ breakpoint on the reference and one at the 3’. The following image demonstrates breakends and a breakpoint with a deletion.
Evidence categories to find SVs
GATK-SV and the tools contained in the GATK-SV pipeline primarily use three evidence types in order to detect and categorize structural variants: anomalously paired ends, split reads, and read depth. The figure below shows how the evidence categories help to find structural variants (Adapted from Mahmoud, M., et al.).
Paired-end (PE) - Paired end short read sequencing data provides useful information for structural variant calling because the expected distance between the mate pairs can be estimated and the normal alignment orientation of each mate is known. In sample preparation for WGS short read DNA sequencing, the entire length of the DNA fragment including the adapters is called the insert size. If the distance between the start of read one and read two on the reference is substantially different from the insert size, this may indicate a structural variant in the region. In order to properly measure the distance between the starting locations of the forward and reverse reads along the reference, structural variant algorithms take into account the orientation and alignment length of the paired end reads. This results in certain patterns for different types of structural variants. See the short read signature diagram to view what paired end reads look like when structural variants exist in a sample genome. For example, when there is a deletion present in the sample, the insert size when the reads align to the reference will appear larger than the standard insert size for the reads.
Split reads (SR) - When only one end of a read aligns to the reference genome, this is called a split read. These alignments provide evidence of a breakpoint and the type of structural variant present in the sample genome. To see how split reads can indicate certain structural variant types, take a look at the short read signature diagram above. When there is a deletion in a sample genome, reads will map to two locations in the reference genome separated by the deleted sequence. The length of separation of the split reads will indicate the length of the deletion. Split reads with high mapping quality are very valuable to the algorithms detecting structural variants because they can provide precise information about the breakpoint coordinates of the structural variant.
Read depth (RD) - Read depth is a measure of how many reads cover each portion of the reference genome and can be used to find copy number variants (deletions and duplications). The amount of DNA from a sample genome will directly affect the amount of reads representing the DNA in the WGS data. As shown in the previous image, a duplicated region of a genome will have double the amount of sequenced reads in the duplicated region. This is a result of double the amount of DNA from the duplicated region in the original sample preparation. The depth-based tools cn.MOPs and GATK gCNV are a part of the GATK-SV pipeline and look for structural variation in the pileup of reads throughout the genome.
B-Allele Frequency (BAF) - The B allele frequency is the allele fraction of the alternate alleles and is used in order to determine if there is a copy number change. A normal diploid region of the human genome will typically have an allele fraction of 0.5 if the allele is heterozygous or 1.0 if the allele is homozygous. When there is a copy number change at a region in the genome, the allele fractions may change. For example, in a region spanned by a heterozygous deletion, variants on the non-deleted haplotype will have BAF of 0 or 1, but not 0.5. Alternatively, in a duplicated region with three copies, BAF will tend to be 0, 0.33, 0.66, or 1.
Structural Variant Types
Here is a description of the eight structural variant types that you will find in the output of GATK-SV.
Copy Number Variant (CNV) SV Types
First, we’ll describe three types of structural variants that fall into the category of copy number variants (CNVs). CNVs are large genomic alterations that change the number of copies of a genomic sequence in the sample DNA as compared to the reference. DNA is lost or added in these cases, so they are referred to as unbalanced structural variants.
- Deletion (DEL) - A deletion is when a segment of DNA is lost compared to the reference sequence.
- Duplication (DUP) - A duplication is when a large genomic region is copied one or more times, so that a sample has more copies of the region than the reference genome. A tandem duplication is when copies of the duplicated sequence appear next to one another in the sample genome, whereas a dispersed duplication is when the duplicated sequence is inserted into another point at another locus. If the insertion point of a dispersed duplication can be detected, GATK-SV will call a dispersed duplication a complex structural variant to distinguish it from a tandem duplication and provide information about the insertion coordinates.
- Multiallelic CNV (MCNV) - A site where more than one alternate CNV allele is observed. Typically, if a site exhibits a loss of copy number in some samples and gain in others, then the site exhibits both deletion and duplication alleles and therefore would be called an MCNV. For example, if some samples have homozygous deletions, some have heterozygous deletions, some are homozygous reference, some have one copy of a tandem duplication allele, and some have two copies of a tandem duplication allele, then copy states 0,1,2,3, and 4 exist in the population at the MCNV site. Multiple (theoretically infinite) duplication alleles can exist, as any integer number of copies of the site could be made. Also note that genotypes are not reported for MCNVs because they cannot always be unambiguously inferred from the copy number. GATK-SV imposes restrictions on multiallelic CNVs to ensure they are high-confidence multiallelic sites, rather than a biallelic site at which a few outlier samples are skewing the distribution.
Balanced Structural Variant Types
Next, there are structural variant types that fall into the category of balanced structural variants These variants involve rearrangement of large swathes of DNA or insertion of sequences not present in the reference, but they do not change the copy number of genomic sequences present in the reference sequence.
- Insertion (INS) - An insertion is when a novel DNA sequence or mobile element is inserted into the genome. There are three types of mobile elements that we annotate with GATK-SV: LINE1, SVA, ALU.
- Inversion (INV) - An inversion is when the orientation of a segment of DNA is flipped at the same position as the original sequence in the genome.
- Translocation (CTX) - A translocation is a chromosomal rearrangement in which a chromosomal segment breaks off and reattaches to a nonhomologous chromosome or a new site on the same chromosome. There are multiple types of translocations, including reciprocal and nonreciprocal translocations. Reciprocal translocations are when parts of two chromosomes exchange places. Nonreciprocal translocations are when a large part of one chromosome is transferred to another chromosome.
Complex Structural Variants
- Complex Structural Variant (CPX) - A complex structural variant involves multiple structural variant types. As you can imagine, with some of these major genomic structural shifts, when one event occurs, other events can also occur. Complex structural variants are kept together as singular variant records because they occur together in the population.
Unresolved Structural Variants
At some sites, methods can determine that there is evidence of a breakpoint but are unable to resolve the structural variant type.
- Breakends (BND) - Breakends represent discontinuity in the sample genome alignment with respect to the reference and can indicate the presence of an unresolved structural variant. This can result from ambiguity when looking at the different categories of evidence at a certain site or from assembly issues. Note that each BND record in GATK-SV VCFs represents a pair of breakends.
Mahmoud, M., Gobet, N., Cruz-Dávalos, D.I. et al. Structural variant calling: the long and the short of it. Genome Biol 20, 246 (2019). https://doi.org/10.1186/s13059-019-1828-7
Collins, R.L., Brand, H., Karczewski, K.J. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020). https://doi.org/10.1038/s41586-020-2287-8
While I appreciate very much your definitions and illustrations of SV-related concepts, there definition for "insert size" seem to differ from other sources.
Here, 'insert size' is defined as:
But it is mostly defined as the length of the DNA fragment between the adapters by many other online resources, such as Illumina and a plausible post.
While it is fine to adopt either definition (as long as the definition is clearly made), I wonder which one applies to the Picard tool "CollectInsertSizeMetrics", as it will affect the way I interpret its output. I ask this question here because this Picard tool refer to the GATK dictionary as a reference (although not to this specific post).
Please sign in to leave a comment.