GATK-SV is a structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data.
Before you begin processing, please read the full pipeline documentation available within the README file located in the GATK-SV GitHub repository. This article will focus on additional information specific to Terra, and cannot substitute for the full documentation. Additional resources are linked below:
- GATK-SV repository on GitHub, hosting the full and updated information on the pipeline and all associated WDLs.
- Structural variant (SV) discovery algorithm article, explaining how to identify structural variants in one or more individuals to produce a callset in VCF format.
- Structural variants glossary article, explaining useful concepts you should know.
- How to interpret SV VCFs glossary article, explaining how to understand your VCFs after getting them output from the GATK-SV pipeline.
- Troubleshooting-GATK-SV troubleshooting article, to help understand error messages or other issues you may (but hopefully will not) encounter.
- Troubleshooting GATK-SV Error Messages on Terra YouTube video, with a visual breakdown of GATK-SV error messages.
The GATK-SV workspace contains a fully reproducible workflow for discovering and resolving structural variation on single samples from Illumina short-read whole-genome sequencing (WGS) data. It can identify, genotype, and annotate structural variation from the following types of variants:
- Copy number variants (CNVs), including deletions and duplications
- Insertions
- Inversions
- Reciprocal chromosomal translocations
- Additional forms of complex structural variation
NOTE: This pipeline is designed to call many forms of structural variation in whole genome sequencing data obtained from a single sample. For SV detection and joint genotyping on at least 100 samples, we recommend running GATK-SV in cohort mode. More information is available on the GATK-SV webpage.
Pipeline Background
The single-sample pipeline is based upon the GATK-SV cohort pipeline, which jointly analyzes WGS data from large research cohorts. The cohort pipeline has been used to create SV call sets for gnomAD-SV and the SFARI SSC autism research study.
Extending SV detection to small datasets
The single-sample pipeline in this workspace is designed to facilitate running the methods developed for the cohort-mode GATK-SV pipeline on small data sets or in clinical contexts where batching large numbers of samples is not an option.
To do this, it uses precomputed data, SV calls and model parameters computed by the cohort pipeline on a reference panel composed of similar samples. The pipeline integrates this precomputed information with signals extracted from the input CRAM file in order to produce a call set in a computationally manageable and reproducible manner.
GATK-SV uses Manta, WHAM, GATK gCNV, and cn.MOPS as raw-calling algorithms, and then integrates, filters, refines, and annotates the calls from these tools to produce a final output.
RESTRICTION NOTICE: Please note that most of the large published joint call sets produced by GATK-SV (including gnomAD-SV) include the MELT tool as part of the pipeline, which is a state-of-the-art mobile element insertion (MEI) detector. Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm. The version of the pipeline configured in this workspace does not run MELT, or include MELT calls for the reference panel. Therefore, the output will be less sensitive to MEI calls that might appear in gnomAD or other joint call sets.
Data
Case Sample
This workspace includes an NA12878 input WGS CRAM file that has been configured in the workspace samples data table. This file is part of the high coverage (30X) WGS data for the 1000 Genomes Project samples generated by the New York Genome Center and hosted in AnVIL.
Reference Panel
The reference panel configured in this workspace consists of data and calls computed from 156 publicly available samples chosen from the NYGC/AnVIL 1000 Genomes high coverage data linked above.
Inputs to the pipeline for the reference panel include:
- A precomputed SV callset VCF, and joint-called depth-based CNV call files
- Raw calls for the reference panel samples from Manta and Wham
- Trained models for calling copy number variation in GATK gCNV case mode
- Parameters learned by the cohort mode pipeline during the training of machine learning models for filtering sites and genotyping samples when run on the reference panel samples.
These resources are primarily configured in the "Workspace Data" for this workspace. However, several of the resources need to be passed to the workflow as large lists of files or strings. Due to Terra limitations on uploading data containing lists to the workspace data table, these resources are specified directly in the workflow configuration.
Reference Resources
The pipeline uses a number of resource and data files computed for the hg38 reference:
- Reference sequences and indices
- Genome annotation tracks such as segmental duplication and RepeatMasker tracks
- Data used for annotation of called variants, including GenCode gene annotations and gnomAD site allele frequencies.
GATKSVSingleSample Workflow
What does it do?
The workflow calls structural variants on a single input CRAM by running the GATK-SV Single Sample Pipeline end-to-end.
What does it require as input?
The workflow accepts a single CRAM or BAM file as input, configured in the following parameters:
Input Type | Input Name | Description |
---|---|---|
String |
sample_id |
Case sample identifier. |
File |
bam_or_cram_file |
Path to the GCS location of the input CRAM or BAM file. |
String |
batch |
Arbitrary name to be assigned to the run. |
Boolean |
requester_pays_cram |
Set to true if the case data is stored in a requester-pays GCS bucket. |
Raw call VCFs and evidence files can optionally be provided along with or in lieu of a CRAM/BAM. This can be useful if these files were previously generated for the case sample.
Input Type | Input Name | Description |
---|---|---|
File |
case_manta_vcf |
Manta VCF from GatherSampleEvidence. |
File |
case_melt_vcf |
Melt VCF from GatherSampleEvidence. |
File |
case_wham_vcf |
Wham VCF from GatherSampleEvidence. |
File |
case_counts_file |
Counts file from GatherSampleEvidence. |
File |
case_pe_file |
Discordant pairs file from GatherSampleEvidence. |
File |
case_sr_file |
Split reads file from GatherSampleEvidence. |
Additional workspace-level inputs
- Reference resources for hg38
- Input data for the reference panel
- The set of docker images used in the pipeline.
Please contact GATK-SV developers if you are interested in customizing these inputs beyond their defaults.
What does it return as output?
Output Type | Output Name | Description |
---|---|---|
File |
final_vcf |
SV VCF output for the pipeline. Includes all sites genotyped as non-reference in the case sample and genotypes for all samples in the reference panel. Sites are annotated with information about their overlap with functional genomic elements including genes and exons, and with the allele frequencies of matching variants in gnomAD. |
File |
final_vcf_idx |
Index file for final_vcf . |
File |
final_bed |
Final output in BED format. Filter status, list of variant samples, and all VCF INFO fields are reported as additional columns. |
File |
metrics_file |
Metrics computed from the input data and intermediate and final VCFs. Includes metrics on the SV evidence, and on the number of variants called, broken down by type and size range. |
File |
qc_file |
Quality-control check file. This extracts several key metrics from the metrics_file and compares them to pre-specified threshold values. If any QC checks evaluate to FAIL, further diagnostics may be required. |
File |
ploidy_matrix |
Matrix of contig ploidy estimates computed by GATK gCNV. |
File |
ploidy_plots |
Plots of contig ploidy generated from ploidy_matrix . |
File |
non_genotyped_unique_depth_calls |
This VCF file contains any depth based calls made in the case sample that did not pass genotyping checks and do not match a depth-based call from the reference panel. If very high sensitivity is desired, examine this file for additional large CNV calls. Calls from this file should be scrutinized carefully to ensure that they are not false positives. |
File |
non_genotyped_unique_depth_calls_idx |
Index file for non_genotyped_unique_depth_calls . |
File |
pre_cleanup_vcf |
VCF output in a representation used internally in the pipeline. This file is less compliant with the VCF spec and is intended for debugging purposes. |
File |
pre_cleanup_vcf_idx |
Index file for pre_cleanup_vcf . |
Example time and cost run on sample data
Sample Name | Sample Size | Time | Cost $ |
---|---|---|---|
NA12878 | 18.17 GiB | 23hrs | ~$7.34 |
To use this workflow on your own data
If you would like to run this workflow on your own samples (which must be medium-to-high coverage WGS data):
- Clone this workspace into a Terra project you have access to.
- In the cloned workspace, upload rows to the Sample and (optionally) the Participant Data Table that describe your samples. Ensure that the rows you add to the Sample table contain the columns
sample_id
,bam_or_cram_file
, andrequester_pays_cram
, populated appropriately. - There is no need to modify values in the workspace data or method configuration. If you are interested in modifying the reference genome resources or reference panel, please contact the GATK team for support as listed below.
- Launch the workflow from the "Workflows" tab, selecting your samples as the inputs. Please check the
qc_file
output for each sample at the end of the run to screen for data quality issues.
Contact information
For questions about the GATK-SV workspace, please visit the Featured Workspaces topic on the Terra forum. Use the search box to see if other users have asked the same question previously. If not, post and tag @Chris Whelan
or @Mark Walker
so that they can get notified and more quickly respond to your inquiry.
This material is provided by the GATK Team. Please post any questions or concerns regarding the GATK tool to the GATK forum.
Workspace Citation
The source code for all GATK-SV pipelines is stored and documented in the GATK-SV GitHub repository.
When referring to these methods, please cite the Collins et al. gnomAD-SV publication in Nature (2020).
Details on citing Terra workspaces can be found here: How to cite Terra
0 comments
Please sign in to leave a comment.