Interval lists define subsets of genomic regions, sometimes even just individual positions in the genome. You can provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. There are four main types of reasons for doing so:
- You want to run a quick test on a subset of data (often used in troubleshooting)
- You want to parallelize execution of an analysis across genomic regions
- You need to exclude regions that have bad or uninformative data where a tool is getting stuck
- The analysis you're running should only take data from those subsets due to how the underlying algorithm works
Regarding the latter case, see the Best Practices workflow recommendations and tool example commands for guidance regarding when to restrict analysis to intervals.
Interval-related arguments and syntax
Arguments for specifying and modifying intervals are provided by the engine and can be applied to most if not all tools. The main arguments you need to know about are the following:
-L
/--intervals
allows you to specify an interval or list of intervals to include.-XL
/--exclude-intervals
allows you to specify an interval or list of intervals to exclude.-ip
/--interval-padding
allows you to add padding (in bp) to the intervals you include.-ixp
/--interval-exclusion-padding
allows you to add padding (in bp) to the intervals you exclude.
By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see --interval-merging-rule
in the Tool Docs).
The syntax for using -L
is as follows; it applies equally to -XL
:
-L chr20
for contig chr20.-L chr20:1-100
for contig chr20, positions 1-100.-L intervals.list
(orintervals.interval_list
, orintervals.bed
) when specifying a text file containing intervals (see supported formats below).-L variants.vcf
when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals.
If you want to provide several intervals or several interval lists, just pass them in using separate -L
or -XL
arguments (you can even use both of them in the same command). You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by specifying an alternate interval set rule (see --interval-set-rule
in the Tool Docs).
In general, using intervals (-L
) can introduce artifacts if you choose the intervals unwisely. Reads will get discarded that are outside the interval when the reads may have been used in making variant calls (for example, lost mates). When we use intervals in our production pipelines with targeted sequencing, we make sure to give sufficient padding around the targeted sites (100 bp on each side). There are many GATK tools which require you to be careful regaring interval usage — this is because it can change the result, depending on how you use intervals.
Supported interval list formats
GATK supports several types of interval list formats: Picard-style .interval_list
, GATK-style .list
, BED files with extension .bed
, and VCF files. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is require for efficiency reasons.
A. Picard-style .interval_list
Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr>:<start>-<stop> + <target_name>
, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).
@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens @SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens 1 30366 30503 + target_1 1 69089 70010 + target_2 1 367657 368599 + target_3 1 621094 622036 + target_4 1 861320 861395 + target_5 1 865533 865718 + target_6
This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).
B. GATK-style .list
or .intervals
This is a simpler format, where intervals are in the form <chr>:<start>-<stop>
, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr>
part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop>
and <chr>
can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.
C. BED files with extension .bed
We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>
, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed
extension and interprets the coordinate system accordingly.
D. VCF files
Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100
in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.
Obtaining suitable interval lists
So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?
Targeted sequencing (exomes, gene panels etc.)
For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).
We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.
Whole genomes (WGS)
For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (e.g. centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.
We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves. However, we don't currently have documentation on their provenance, sorry -- baby steps.
15 comments
I can't find anything convincing on how to create a valid Picard interval file although the above information suggests a recipe involving creating the header with "samtools -H" and then adding the required intervals by hand or otherwise. That may be a dirty hack that could problems in the long run though. One of the online discussion forums has a thread about this issue and points to a Broad Institute GATK page that no longer exists ("Preparing the essential GATK input files"),
William
Took me a while to figure this out, but the GATK list format is actually:
<chr>:<start>-<stop>
How do we download these blacklists that you state you made available?
Hi,
How can I access these WDS interval lists?
"We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist!"
Dear GATK developers,
I'm trying to run Mutect2 for WES cancer data.
However, since the Resource bundle only supports h19 seems I cannot proceed.
I've been looking for some hg38 interval_list file and I found: ''hg38_v0_HybSelOligos_whole_exome_illumina_coding_v1_whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list''
However, when I run the GenomicsDBImport I get the error (no matter if I use my own hg38 reference and .dict or the ones from your Resource Bundle):
''A USER ERROR has occurred: Badly formed genome unclippedLoc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary''
So, my questions are:
1. Is there any release date for this hg38 based exome interval file? will it be soon?
2. Or the file I put is ok and the error is coming from somewhere else?
How to set Chr01 and Chr02?
This article was very helpful when I perform 'GenomicsDBimport'.
However, the working GATK format is actually "<chr>:<start>-<stop>", not "<chr> <start> <stop>" as 'registered_user' said when I run 'GenomicsDBimport'.
Why don't you update this issue on this article?
It it not easy to find a solution for the beginner.
This article is pretty informative.
If I want to do an interval of chromosomes, should I use:
Any guidance will be appreciated.
How do i include sex chromosomes? Also, can you do an interval as stated above like:
Neev Liberman and Carolina Paez: I believe that you can not use this syntax. You can either use multiple -L arguments:
-L chr1 -L chr2 -L chr3 -L chr4 -L chr5
or use an interval list/bed file with the chromosomes you are after:
Great to know! Thank you, Dror Kessler (דרור קסלר) for your help.
So, what do I put down for the -L argument if I just want to look across the whole genome and not just a specific region?
Great explanation, thank you! Since many of the library prep kits are well established and somewhat standard, does Broad have a repository of the most commonly used interval lists? I see that whole_exome_illumina_coding_v1 is used in some workflows (i.e. Exome-Analysis-Pipeline - featured workspace), but what if my sequencing was done using a v6 kit? Where can I find that information? Shouldn't there be a repository? Thank you so much!
I want to confirm the format of bed file. The article says <chr>:<start>-<stop>, but as far as I know, it seems to be:
Derek Caetano-Anolles, thanks for this helpful article! I have the same questions as Enrico Cocchi and Aldhair Médico - which list are you referring to for WGS blacklist (Mutect2 in hg38 cancer dataset). Here, gatk-best-practices/somatic-hg38 I found:
I believe #1 is the most comprehensive based on the title and the non header line counts. Please correct me if I'm wrong, and thanks again!
Please sign in to leave a comment.