The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. We provide several versions of the bundle corresponding to the various reference builds, but be aware that we no longer actively support very old versions (b36/hg18). In addition, we are currently transitioning to support the Grch38/hg38 reference build, but have not yet generated all of the files necessary for all use cases (in particular we are still missing the Hg38 version of the Broad's exome intervals).
As of April 2020, we actively support the following human genome reference builds:
- Grch38/hg38 and b37/hg19 - For Best Practices short variant discovery in WGS (uBam to GVCF).
- b37/hg19 - For Best Practices short variant discovery in exome and other targeted sequencing.
Please see this article for further details on the content of this resource bundle.
Accessing the Resource Bundle
The resource bundle is hosted on a Google Cloud bucket. This bucket is useful for people who plan to run analyses on the Google Cloud, and can therefore call to the resource files directly using the bucket paths, without needing to copy or download the files first. These files can also be directly downloaded from the cloud for processing on your local machine.
The bucket can be accessed using a regular web browser at the following Google Cloud Platform site using a valid Google account (which can be obtained for free from Google).
The following resources are available through our Google Cloud buckets:
Grch38/Hg38 Resources: the Standard Set
- This contains all the resource files needed for Best Practices germline short variant discovery in whole-genome sequencing data (WGS).
- Exome files and itemized resource list will come soon.
- Somatic resources are in development.
b37 Resources: the Standard Data Set, pending completion of the Hg38 bundle
- Reference sequence (standard 1000 Genomes fasta), along with fai and dict files
- dbSNP in VCF. This includes two files:
- A recent dbSNP release (build 138)
- The same file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
- HapMap genotypes and sites VCFs
- OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
- The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
- 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
- The latest set from 1000G phase 3 (v4) for genotype refinement: 1000Gphase3v4_20130502.sites.vcf
- A large-scale standard single sample BAM file for testing:
- NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
- A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
- The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)
Additionally, these files all have supplementary indices, statistics, and other QC data available.
Note that many of these resources are out of date and will eventually be retired. All new development is being done against Grch38/hg38.