This guide introduces select elements of the broadinstitute/gatk GitHub repository to researchers on the GATK forum who we have pointed to the repo for any variety of reasons and who are unfamiliar with GitHub.
The labels in the screenshot number the seven elements this article covers.
Understanding the first three elements (Sections 1–3) should enable researchers to (i) interpret for example the status of a feature request or bug fix for a particular GATK release version and (ii) be involved in the discussion that drives GATK development forward.
The remaining four elements (Sections 4-7) are of interest to those who wish to read about the mathematics behind GATK algorithms, view versioned WDL-format pipelines for workflows under recent development, learn how to use engine features, e.g. streaming from Google Cloud Storage, and build GATK from the source code.
Jump to a section
- Issues: Submit new or discuss existing bugs and feature requests
- Pull requests: Make or track changes to the codebase
- # releases: Download releases and read release notes
- Branch: Control the version of the code in view
docs
: Mathematical whitepapers on select algorithmsscripts
: Tested versioned WDL pipeline scriptsREADME.md
: Instructions to build and run GATK in the required environment
1. Issues: Submit new or discuss existing bugs and feature requests
Issue tickets are where discussion happens and where plans are set to make changes to the codebase.
- When the issue ticket has an Open label, the discussion remains unresolved.
- When the issue ticket has a Closed label, consider the discussion closed. It's okay to comment in a closed ticket; know it is possible to reopen closed issues.
Just because an issue ticket discusses plans or has a Closed status, does not necessarily mean the GATK has or will implement that discussed within. Skim the discussion and look for associated pull requests, which are often referred to as PRs, and their status (screenshot below). If you are unclear on any point, ask for clarification by writing a comment in the issue ticket. You will need a GitHub account and be signed in to do so.
Here's an example issue ticket where the community drove the implementation of a feature, specifically the --include-non-variant-sites
option of GenotypeGVCFs: https://github.com/broadinstitute/gatk/issues/2865.
2. Pull requests: Make or track changes to the codebase
Read the discussion in the pull request and any associated issue ticket for specifics on the changes.
- An Open status indicates the changes are ongoing and being worked on away from the master codebase, which is the main code.
- A Merged status means the master code reflects the changes. To reiterate, the master code branch will immediately reflect the changes upon merging a PR. This does not mean the latest GATK release reflects these changes. To figure this out, note the date of the merge. A GATK release that comes after this merge will have the changes. A GATK release before this merge date will not contain the changes.
- An associated issue ticket appears like so and clicking on the link will open it.
Here's an example pull request that pairs with the previous example issue ticket: https://github.com/broadinstitute/gatk/pull/5219.
3. # releases: Download releases and read release notes
In the overview screenshot we see 35 releases for GATK4. The releases page presents releases in reverse-chronological order, so the latest release is at top.
- The release date is immediately underneath the release version tag.
- Click the gatk-4.x.x.x.zip link under Assets to download the pre-built release. When you expand the zip bundle, you will get a folder named gatk-4.x.x.x containing a working launch script you use to invoke tools from the commandline. Typing
/path/to/gatk-4.x.x.x/gatk --list
into a terminal prompt will list the available tools in the toolkit as well as their production status, whether experimentalEXPERIMENTAL Tool
, in beta testingBETA Tool
, or fit for production (no label).
- Each release comes with release notes. Release notes are the definitive place to learn about changes in GATK. Our engineers curate the notes to be meaningful and human-readable and derive them from git commit messages, a source of more technical detail that this article does not cover. Often, a bullet point in the release notes will have a link to the relevant pull request. If you need clarification on some point, please ask in the associated issue or on the GATK forum.
4. Branch: Control the version of the code in view
The branch is set to master by default, which reflects the latest development to the broadinstitute/gatk codebase. To view a snapshot of the code for a particular version of GATK, click the Branch button, then switch to the Tags tab. Selecting a tag version, e.g. 4.0.0.0, will allow you to travel back in time to the codebase as it looked for that particular release. This is useful, e.g. if you are looking for WDL pipeline scripts that work for past versions of GATK4 (see Section 6).
5. docs
: Mathematical whitepapers on select algorithms
The PDFs within this folder and subfolders outline the mathematics behind select GATK algorithms. If the GATK forum seems sparse on mathematical details, that is because it is not set up to display complex LaTeX equations. The whitepapers are provided by the generosity of GATK methods developers. Be sure to take into consideration the datestamps associated with the articles, as development takes priority over documentation and the mathematical details can fall behind the latest algorithmic improvements.
6. scripts
: Tested versioned WDL pipeline scripts
For certain GATK4 workflows, the developers maintain working WDL pipeline scripts for every release. See Section 4 for instructions on accessing tagged versioned scripts.
Take for example the mutect2wdl directory. It contains pipeline scripts for creating a Mutect2 PoN, for running Mutect2 on a tumor-normal pair, etc. The view will show the development or _master codebase by default. The following portions of the highlighted script illustrate a difference between the v4.0.0.0 and the v4.1.0.0 WDLs, for each workflow's invocation of their respective M2 tasks.
- https://github.com/broadinstitute/gatk/blob/4.0.0.0/scripts/mutect2_wdl/mutect2.wdl#L77-L99
- https://github.com/broadinstitute/gatk/blob/4.1.0.0/scripts/mutect2_wdl/mutect2.wdl#L198-L223
Notice the URL elements that differ--the tag version and the highlighted lines. We see the latter pipeline defines a number of additional parameters, e.g. artifact_prior_table
, that are not present in the earlier pipeline. If we check the details of the respective M2 tasks, then we also see differences. In this way, if you are testing out workflows using broadinstitute/gatk repository WDL scripts, you should be sure to match to the version of the toolkit.
7. README.md
: Instructions to build and run GATK in the required environment
The README.md is a document that the repository landing page displays, below the list of folders and files. For the broadinstitute/gatk repository, it presents a plethora of information that a Table of Contents at top organizes.
Of interest to researchers are the following sections.
- Instructions to build GATK4, e.g. from a development state branch for testing.
- Instructions to install R dependencies and install Python dependencies that certain GATK tools require. For example, the gCNV workflow's GermlineCNVCaller requires an activated Python conda environment.
- The alternative to installing dependencies is to run GATK using a preconfigured GATK Docker. Container images are mirrored at DockerHub and the US Google Cloud Repository.
- Example commands illustrating useful engine-level arguments such as setting a java memory limit, streaming Google Cloud Storage data directly into a GATK analysis, writing GATK analysis results directly to Google Cloud Storage, setting Spark setting for GATK Spark tools, and launching GATK analyses locally to then run on a Google Cloud Dataproc cluster.
0 comments
Please sign in to leave a comment.