On March 2-5, 2020, instructors from the Broad Institute taught a 4-day GATK course in Gujarat, India. The goal of this workshop was to teach the tools in GATK used for both Germline and Somatic analysis. It also empowered attendees, including those who may be entirely new to cloud computing, to use Terra to access data, run analysis tools, and collaborate -- all in a secure and scalable environment.
Materials
You can find all the materials at broad.io/GATK2003, though links below may specify a specific file or subfolder within this top-level directory. If you would like to follow along with this workshop, please use the gatk bundle in said top-level directory, and these installation instructions. This workshop was run on GATK version 4.1.4.1 and IGV version 2.8.0.
Day 1: Introductions and Germline Variant Discovery
Slides: The slides for Day 1 can be found here. All presentations are accompanied by slide decks located in this folder unless otherwise noted.
1. Opening Remarks
What is this workshop going to teach you over four days about GATK? We outline the schedule and order of topics before beginning detailed instruction.
Instructor: Eric Banks, Senior Director, Data Sciences Platform, Broad Institute
2. Introduction to Sequencing Data
Both germline and somatic studies start with sequenced data. We go over the differences between whole genome and whole exome data and why you might want to use one over the other.
Instructor: Mark Fleharty, Computational Scientist II, Data Sciences Platform, Broad Institute
3. Introduction to Data Preprocessing
Data fresh off the sequencer needs to be processed before further variant calling can be performed. In this talk, we go over mapping, marking duplicates, and recalibrating base quality scores.
Instructor: Joel Thibault, Senior Software Engineer, Data Sciences Platform, Broad Institute
4. Introduction to Variant Discovery
Variant discovery looks a bit different between somatic and germline studies. Here we cover both germline and somatic workflows for SNP & indel variant discovery at a high level, followed by a foray into some of our newest methods, germline CNV & SV. Later talks will go into further detail on the tools involved in SNP & indel variant discovery.
Instructor: Eric Banks, Senior Director, Data Sciences Platform, Broad Institute
5. HaplotypeCaller
How does HaplotypeCaller make its variant calls? This talk walks you through HaplotypeCaller's function without requiring an advanced degree in programming and statistics.
Instructor: Eric Banks, Senior Director, Data Sciences Platform, Broad Institute
6. Joint Calling
Power and speed are both improved with joint calling variants on callsets. We walk you through the changes to HaplotypeCaller's default single-sample mode and explain the benefits to joint calling.
Instructor: Mark Fleharty, Computational Scientist II, Data Sciences Platform, Broad Institute
7. Introduction to Pipelining Platforms
With a pipelines like the germline one we've been discussing, running workflows is an important step. In this talk, we will outline our solution for running workflows, and introduce you to Terra, the platform on which the workshop's tutorials will be run.
Instructor: Joel Thibault, Senior Software Engineer, Data Sciences Platform, Broad Institute
8. Terra Orientation
Here you will learn further details about the structure of Terra and discover how Terra can work with the structure of your analysis.
Instructor: Joel Thibault, Senior Software Engineer, Data Sciences Platform, Broad Institute
9. Germline Variant Discovery Tutorial
For our last item of the day, we will get hands on with the tools! In this tutorial you will go onto Terra to run a Notebook, which will walk you through discovery of a germline variant in some toy samples.
Materials: GATKTutorials-Germline notebook: 1-gatk-germline-variant-discovery-tutorial
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
Day 2: Germline Variant Filtering & Case Study
Slides: The slides for Day 2 can be found here.
1. Variant Filtering
Following variant discovery, we have a raw callset that requires filtering. Learn the three options for variant filtering: hard filtering, VQSR, and CNN.
Instructor: Mark Fleharty, Computational Scientist II, Data Sciences Platform, Broad Institute
2. Genotype Refinement
Depending on your study, genotype refinement can be an essential step to narrow down the list of putative causal variants. We will show you one such example using pedigree and population data to refine genotype calls in a sample study searching for de novo variants.
Instructor: Eric Banks, Senior Director, Data Sciences Platform, Broad Institute
3. Callset Evaluation
Following filtering and genotype refinement, we've made a number of adjustments to our callset. In this talk, we will cover evaluating that callset to determine overall validity when compared to metrics like SNP & Indel count, Indel and TiTv ratios, and genotype concordance.
Instructor: Joel Thibault, Senior Software Engineer, Data Sciences Platform, Broad Institute
4. Germline Hard Filtering Tutorial
Let's get hands-on again with another notebook in Terra. In this tutorial, we will explore annotations in the callset and compare the sensitivity & specificity before and after applying our best-practices hard filters. Try further tuning the filters on your own to improve the sensitivity and specificity!
Materials: GATKTutorials-Germline notebook: 2-gatk-hard-filtering-tutorial
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
5. Germline CNN Tutorial
When you have a single sample to work with, CNN is your best filtering option. We get hands-on again in a new notebook in Terra to run CNN and examine the resulting sensitivity and specificity of the callset.
Materials: GATKTutorials-Germline notebook: 3-gatk-cnn-tutorial
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
6. GWAS Reproducible Case Study
After running your samples through the GATK Best Practices for Germline Variant Discovery, a reasonable next step for your vcf's may be to run a GWAS. In this workspace, we will first explore some genotype and phenotype data and refine it using LD-pruning and PCA. The data set will then be used in a workflow to generate a Manhattan plot pointing to causal variants for the phenotype of your choice.
Materials: 2019_ASHG_Reproducible_GWAS notebook: GWAS_initial_analysis_cleared AND workflow: genesis_GWAS
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
Day 3: Somatic Variant Discovery
Slides: The slides for Day 3 can be found here.
1. Intro to Somatic Variant Discovery (recap)
As we haven't discussed the overall variant discovery pipeline since the first day, let's recap what somatic variant discovery is all about.
Instructor: Eric Banks, Senior Director, Data Sciences Platform, Broad Institute
2. Somatic SNVs and Indels
We use Mutect2 to call variants in somatic pipelines, and it functions quite similarly to HaplotypeCaller in germline cases. In this talk, you will learn the basic operation of Mutect2 and its related tools
Instructor: Mark Fleharty, Computational Scientist II, Data Sciences Platform, Broad Institute
3. GATK4 Mutect2 Tutorial
It's time for some more hands-on exercises with the tools. This notebook will walk you through using Mutect2 on a case sample we've prepared for you--everything from creating a panel of normals all the way up through functionally annotating your resulting variants with Funcotator.
Materials: GATKTutorials-Somatic notebook: 1-somatic-mutect2-tutorial
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
4. Somatic CNAs
A variety of events in the genome can lead to copy number alterations. Here we go over how copy numbers can be detected using the ModelSegments CNA workflow.
Instructor: Eric Banks, Senior Director, Data Sciences Platform, Broad Institute
5. GATK4 Somatic CNA Tutorial
Now that we've learned about it, let's get hands-on with the tools. In this notebook we will run through the ModelSegments CNA workflow and show you the transformations to data at each step.
Materials: GATKTutorials-Somatic notebook: 2-somatic-cna-tutorial
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
6. Mitochondria Pipeline Demo
Though we don't have enough time today to run this as a hands-on, follow along with your instructor as they demonstrate our best practices pipeline for calling variants in Mitochondria.
Materials: Mitochondria-SNPs-Indels-hg38 workflow: 1-Mitochondria_Pipeline
Instructor: Mark Fleharty, Computational Scientist II, Data Sciences Platform, Broad Institute
Day 4: Pipelining locally & in the cloud
Slides: The slides for Day 4 can be found here.
Exercises: The exercises for Day 4 can be found here.
1. WDL and Cromwell Basics
WDL is a language designed to be human readable and writable. In this presentation you will see just how true that is while you learn the basic format for this pipelining language.
Instructor: Joel Thibault, Senior Software Engineer, Data Sciences Platform, Broad Institute
2. Hello World WDL Tutorial
To further your understanding of WDL, we now move on to running some WDL scripts on your laptops, eventually graduating to uploading a script and running it in the cloud!
Materials: HelloWDL Tutorial worksheet followed by the GATKTutorials-Pipelining dashboard
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
3. WDL Puzzles
Now we will step back and allow for time to run through some exercises on your own. These puzzles give you partially filled-in WDL scripts and ask you to accomplish a series of tasks to complete them. This exercise should take several hours, and we expect it will require further time at home to finish.
Materials: WDL Puzzles worksheet alongside the puzzles folder in your bundle
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
4. Docker
Running workflows on the cloud requires a container to run in. In this presentation, we show you how to work with containers and create your own docker image.
Instructor: Eric Banks, Senior Director, Data Sciences Platform, Broad Institute
5. BigQuery Tutorial
Accessing and working with publicly-available datasets can be tricky. In this tutorial, you will use Terra's built-in Data Explorer to build queries and access data from the 1000 Genomes project.
Instructor: Anton Kovalsky, Science Writer, Data Sciences Platform, Broad Institute
0 comments
Please sign in to leave a comment.