We've talked about Terra here before, but we thought it might be useful to reintroduce the platform for those of you who may have joined us since then, or who missed the announcements the first time around. This matters because Terra is our preferred platform for providing hands-on educational resources, including test data, GATK pipelines and workshop tutorials; ergo, if you don't know about it, you might be missing out.
In a nutshell, Terra (formerly called FireCloud) is a cloud-based bioinformatics platform that the Broad Institute makes available to the research community. The main goal of Terra is to make it possible for researchers (like you!) to access large datasets efficiently, and run analyses on them without having to worry about managing computational resources.
We use it to provide access to fully worked-out examples of GATK analyses, so even if you don't want to use Terra for your full-time work, you might find it useful for learning, testing new tools and evaluating version updates. Specifically, let's talk about the Best Practices workspaces.
GATK Best Practices workspaces: preloaded with fully functional pipelines
The GATK Best Practices workspaces are probably the most useful set of GATK resources that is available out-of-the-box in Terra.
For context, Terra workspaces are a kind of project "sandbox" in which you can set up analyses to run on cloud-hosted data, either in the form of workflows (aka pipelines) or in interactive applications like Jupyter Notebooks. We maintain public workspaces for each of the GATK Best Practices use cases, as well as a few other "side project" use cases (see current list at the bottom of this post).
In each workspace, we set up all relevant workflows to run on some example data that we also provide (typically there's both a small scale test dataset, and a full-scale test dataset). You can clone the workspace and run the preconfigured pipelines on the example data, or tweak the configurations to run them on your own data. When you run one of the pipelines, Terra handles all the work of dispatching jobs to Google Cloud, managing resources and collecting outputs. It's not quite a single-click process... but it's close.
All that, and you still have full access to the workflow code, input configuration etc. if you want to customize the analysis, or run it on a different system — whether it's on premises, or on a different cloud platform. The whole system is designed to increase the portability and reproducibility of genomic analysis.
You can read more about some of the reasons why you might want to use Terra for GATK-related work in these older blog posts:
- Getting started with GATK? Terra can make it easier
- Test drive GATK Best Practices workflows on Terra
- The future of GATK tutorials is written in Jupyter Notebooks
- From Python Magic to embedded IGV: A closer look at GATK tutorial notebooks
- Learn GATK through workshop tutorials
Why on the Cloud, though?
This is a fair question — the majority of research work today is still done on local institutional infrastructure, and moving to the cloud can be a non-trivial endeavor. So, why bother?
For many research groups we work with (and for the federal agencies who fund a lot of the big data generation projects in the USA) moving to the cloud has become inevitable. This is because the amounts of data involved in large human genomics studies have become so huge (measured in terabytes!) that it's simply not sustainable to expect everyone to download a copy to work on locally. It makes a lot more sense (logistically, at least) to make the data available on the cloud, where anyone can go in and run analyses on it without having to make any copies.
Using the cloud also makes it easier to share analysis tools in fully portable and reproducible forms, and to provide everyone with access to the same kind of hardware. However, all of this introduces other questions, such as how you should make sure researchers are able to run the analyses they want without having to go through a lot of training to use cloud infrastructure.
How can we make sure researchers can run analyses without special training?
This is where Terra comes in.
The idea is to provide built-in data access and analysis capabilities through a user-friendly web interface so you don't actually have to learn everything about how cloud infrastructure works in order to get your work done, while retaining the ability to bring in your own data and tools. As a bonus, you can access a lot more computational resources than you normally have available at your home institution, and your analyses typically don't have to wait in any queues.
Of course, there are other facets to this question, like data security (yes, it is safe enough for human clinical data) and cost (Terra is free, but Google Cloud charges for compute and storage; Google may offer free credits for newcomers to evaluate the platform).
None of this is 100% easy, and in our experience, the question of whether moving to the cloud is "worth it" depends a lot on the priorities and constraints of each research group.
Nevertheless, we do encourage you to check out the GATK resources we provide in Terra. If nothing else, we hope this will help many of you spend less time figuring out how to run GATK, and more time doing interesting science with your results.
Getting started with Terra
If you're interested in trying out Terra on your own, we recommend starting with the videos in the Getting Started With Terra video playlist, which walk you through the platform’s main features and demonstrate how to use them. The workspaces shown in the videos are public in Terra, so you can work through the examples yourself. The Workflows Quickstart and Notebooks Quickstart tutorial workspaces should also be very helpful for learning how to use the GATK workspaces.
List of GATK Best Practice Workspaces currently available in Terra
For your convenience, we've compiled a list of the GATK Best Practices workspaces that are currently available in the platform, categorized by use case.
Note that we have several different use cases listed for germline short variant discovery, with a few distinct implementations, so make sure to read the full descriptions when selecting a workspace (or ask for clarifications in the forum if you're confused).
Keep in mind also that our team makes regular updates to these workspaces and occasionally adds new ones, so have a look at the Terra Showcase & Tutorials page in Terra (requires a free Terra login) if you don't find what you're looking for here.
Germline variant discovery
Germline SNPs and indels
- Whole-Genome-Analysis-Pipeline (Broad Institute's production implementation) - This workflow takes unmapped pair-end sequencing BAMs and returns a GVCF and other metrics read for joint genotyping, and accurately pre-processes the data for germline short variant discovery. This workspace holds Broads production sequence processing pipeline, which contains several quality control tasks within the workflow in addition to regular data processing tasks.
- Exome-Analysis-Pipeline (Broad Institute's production implementation) - Pre-process exome sequence data and then conduct germline short variant discovery. Input unmapped human exome sequencing BAMs in order to produce CRAM files, indices, md5, GVCFs, and report metrics.
- GATK4-Germline-Preprocessing-VariantCalling-JointCalling (Tutorial implementation) - A series of workflows that cover pre-processing, SNP and indel variant calling, and joint calling. Outputs from each workflow are designed to become inputs to the next, so the entire pipeline can be run altogether or in parts.
- Variant_Calling_Spark_Multicore (Spark implementation) - Call variants from aligned input data on a single multicore machine using the ReadsPipelineSpark pipeline. (beta)
- cnn-variant-filter (Single-genome filtering) - Filter variants using GATK CNN tool, a deep learning tool with additional options for advanced users to generate and evaluate their own training model. Create variant evaluations and summary plots out of input VCF and BED files, using our data or yours. This workflow is meant as a drop-in replacement for the VQSR filtering that is normally done in germline short variant filtering, when the dataset is too small for VQSR.
- Mitochondria-SNPs-Indels-hg38 (Mitochondrial DNA) - Use whole genome sequencing data to call mitochondrial variants, even those with low allele frequencies between 1-5%. Mitochondrial DNA possesses many particular characteristics that can make this kind of analysis difficult — circular DNA, nuclear mitochondrial DNA segment, etc — but this workflow will walk you through how to process your data and reliably call variants.
- GATK4-RNA-Germline-VariantCalling (Bulk RNAseq) - This workflow calls germline short variants from RNAseq data, using unmapped BAM and a corresponding GTF annotation file. As an output, the workflow will produce a recalibrated BAM file, a VCF file, and a filtered VCF. Index files for each output will also be included.
Germline copy number variants (CNVs)
- Germline-CNVs-GATK4 - Use GATK's GermlineCNVCaller to call a cohort of samples, build a model for denoising further case samples, then call case samples using a previously built model for denoising. This analysis will detect germline CNVs in exome sequence data.
Somatic variant discovery
Somatic SNVs and indels
- Somatic-SNVs-Indels-GATK4 - A Mutect2 workflow crafted to be used for variant discovery of SNVs and indels in somatic data. A detailed notebook tutorial walks you through the steps necessary to run the workflow on your data.
Somatic copy number variants (CNVs)
- Somatic-CNVs-GATK4 - A step-by-step walkthrough to create a panel of normals (PON) and then call CNVs using the GATK CNV pipeline with Oncotator.
Functional annotation of variants
- Variant-Functional-Annotation-With-Funcotator - Use Funcotator to analyze variants for their function and write the analysis to a specified output file. This workspace uses the default set of data sources for the human somatic use case. You can also modify annotations to use with germline data sources, or with your own custom data sources.