Interval lists define subsets of genomic regions, sometimes even just individual positions in the genome. You can provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. There are four main types of reasons for doing so:
- You want to run a quick test on a subset of data (often used in troubleshooting)
- You want to parallelize execution of an analysis across genomic regions
- You need to exclude regions that have bad or uninformative data where a tool is getting stuck
- The analysis you're running should only take data from those subsets due to how the underlying algorithm works
Regarding the latter case, see the Best Practices workflow recommendations and tool example commands for guidance regarding when to restrict analysis to intervals.
Interval-related arguments and syntax
Arguments for specifying and modifying intervals are provided by the engine and can be applied to most if not all tools. The main arguments you need to know about are the following:
--intervalsallows you to specify an interval or list of intervals to include.
--exclude-intervalsallows you to specify an interval or list of intervals to exclude.
--interval-paddingallows you to add padding (in bp) to the intervals you include.
--interval-exclusion-paddingallows you to add padding (in bp) to the intervals you exclude.
By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see
--interval-merging-rule in the Tool Docs).
The syntax for using
-L is as follows; it applies equally to
-L chr20for contig chr20.
-L chr20:1-100for contig chr20, positions 1-100.
intervals.bed) when specifying a text file containing intervals (see supported formats below).
-L variants.vcfwhen specifying a VCF file containing variant records; their genomic coordinates will be used as intervals.
If you want to provide several intervals or several interval lists, just pass them in using separate
-XL arguments (you can even use both of them in the same command). You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by specifying an alternate interval set rule (see
--interval-set-rule in the Tool Docs).
In general, using intervals (
-L) can introduce artifacts if you choose the intervals unwisely. Reads will get discarded that are outside the interval when the reads may have been used in making variant calls (for example, lost mates). When we use intervals in our production pipelines with targeted sequencing, we make sure to give sufficient padding around the targeted sites (100 bp on each side). There are many GATK tools which require you to be careful regaring interval usage — this is because it can change the result, depending on how you use intervals.
Supported interval list formats
GATK supports several types of interval list formats: Picard-style
.list, BED files with extension
.bed, and VCF files. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is require for efficiency reasons.
Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form
<chr>:<start>-<stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).
@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens @SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens 1 30366 30503 + target_1 1 69089 70010 + target_2 1 367657 368599 + target_3 1 621094 622036 + target_4 1 861320 861395 + target_5 1 865533 865718 + target_6
This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).
This is a simpler format, where intervals are in the form
<chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the
<chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both
<chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.
C. BED files with extension
We also accept the widely-used BED format, where intervals are in the form
<chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the
.bed extension and interprets the coordinate system accordingly.
D. VCF files
Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g.
-ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.
Obtaining suitable interval lists
So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?
Targeted sequencing (exomes, gene panels etc.)
For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).
We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.
Whole genomes (WGS)
For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (e.g. centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.
We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves. However, we don't currently have documentation on their provenance, sorry -- baby steps.