The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. We provide several versions of the bundle corresponding to the various reference builds, but be aware that we no longer actively support very old versions (b36/hg18). In addition, we are currently transitioning to support the Grch38/hg38 reference build, but have not yet generated all of the files necessary for all use cases (in particular we are still missing the Hg38 version of the Broad's exome intervals).
As of August 2016, we actively support the following human genome reference builds:
- Grch38/hg38 and b37/hg19 - For Best Practices short variant discovery in WGS (uBam to GVCF).
- b37/hg19 - For Best Practices short variant discovery in exome and other targeted sequencing. Please see this article for further details on the content of this resource bundle.
Accessing the Resource Bundle
The resource bundle is hosted on a Google Cloud bucket. This bucket is useful for people who plan to run analyses on the Google Cloud, and can therefore call to the resource files directly using the bucket paths, without needing to copy or download the files first. These files can also be directly downloaded from the cloud for processing on your local machine. The bucket can be accessed using a regular web browser at the following Google Cloud Platform site using a valid Google account (which can be obtained for free from Google).
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
The following resources are available through our Google Cloud buckets:
Grch38/Hg38 Resources: the Standard Set
- This contains all the resource files needed for Best Practices germline short variant discovery in whole-genome sequencing data (WGS).
- Exome files and itemized resource list will come soon.
- Somatic resources are in development.
b37 Resources: the Standard Data Set, pending completion of the Hg38 bundle
- Reference sequence (standard 1000 Genomes fasta), along with fai and dict files
- dbSNP in VCF. This includes two files:
- A recent dbSNP release (build 138)
- The same file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
- HapMap genotypes and sites VCFs
- OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
- The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
1000G_phase1.indels.b37.vcf
(currently from the 1000 Genomes Phase I indel calls)Mills_and_1000G_gold_standard.indels.b37.sites.vcf
- The latest set from 1000G phase 3 (v4) for genotype refinement:
1000G_phase3_v4_20130502.sites.vcf
- A large-scale standard single sample BAM file for testing:
NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam
containing ~64x reads of NA12878 on chromosome 20- A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
- The Broad's custom exome targets list:
Broad.human.exome.b37.interval_list
(note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website).
Additionally, these files all have supplementary indices, statistics, and other QC data available.
Note that many of these resources are out of date and will eventually be retired. All new development is being done against Grch38/hg38.
Notable Google Buckets
The following are useful google buckets for GATK users.
Owned by Google
genomics-public-data
- Description: Cloud Life Sciences provides a variety of public datasets that can be accessed for free and integrated into your applications. Google hosts these datasets, providing public access to the data through the following methods.
Owned by the Broad Institute
This article lists public Google buckets accessible to the public. The buckets contain an assortment of reference, resource, and sample test data which can be used in GATK workflows.
gcp-public-data--broad-references
- Bucket path:
gs://gcp-public-data--broad-references
- Description: This is the Broad's public hg38 and b37 reference and resource data. Additional information can be found in the GATK Resource Bundle article. This bucket is controlled by Broad, but hosted by Google. Example workspaces include:
- Whole-Genome-Analysis-Pipeline
- GATK4-Germline-Preprocessing-VariantCalling-JointCalling
gatk-legacy-bundles
- Bucket path:
gs://gatk-legacy-bundles
- Description: Broad public legacy b37 and hg19 reference and resource data.
broad-public-datasets
- Bucket path:
gs://broad-public-datasets
- Description: Stores public test data, often used to test workflows. For example, it contains NA12878 CRAM, gVCF, and unmapped BAM files.
gatk-best-practices
- Bucket path:
gs://gatk-best-practices
- Description: Stores GATK workflow specific plumbing, reference, and resources data. Example workspaces include:
- Somatic-SNVs-Indels-GATK4
gatk-test-data
- Bucket path:
gs://gatk-test-data
- Description: Additional public test data focusing on smaller data sets. For example, whole genome BAM, FASTQ, gVCF, VCF, etc. Example Workspaces include:
- Somatic-CNVs-GATK4.
FTP Server Access
NOTE: FTP Server Access will soon be disabled, and code using FTP file paths must be updated with Google Bucket file paths by June 1, 2020.
The FTP server is intended for people who wish to download files to run on them locally. However, FTP is local to the Broad Institute (there are no mirrors), it has tight limits on concurrent downloads, and users in some countries have reported difficulties accessing it due to e.g. firewalls. For these (and other) reasons, FTP Server Access will be disabled by June 1, 2020.
Instead, please use the resources available in our Google Cloud bucket, and available through our cloud-based analysis portal Terra, in workspaces that are preconfigured for the major Best Practices analysis use cases.
To access the bundle on the FTP server, use the following login credentials in your favorite FTP client:
location: ftp.broadinstitute.org/bundle username: gsapubftp-anonymous password:
If you are using your browser as an FTP client, make sure to include the login information in the address, otherwise you will access the general Broad Institute FTP instead of our team FTP. This should work as a direct link:
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/
The bundle/directory contains five subdirectories, one for each build of the human genome that we have resources for: b36, b37, hg18, hg19 and hg38 (aka GRCh38). Be aware that the hg38 resource set is provided as-is, and its contents may still be incomplete.
Currently, the following resources are only available exclusively through FTP:
hg19 Resources: lifted over from b37
- Includes the UCSC-style hg19 reference along with all lifted over VCF files.
hg18 Resources: lifted over from b37
- Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
- Also includes a chain file to lift over to b37.
b36 Resources: lifted over from b37
- Includes the 1000 Genomes pilot b36 formatted reference sequence (
human_b36_both.fasta
) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause. - Also includes a chain file to lift over to b37.
3 comments
Hello! Could you improve a readme file to introduct the resource files?
Hi,
I am trying to run BaseRecalibrator with my WGS data.
My ref is hg19 reference. Where can I get SNP and Indel vcf files in hg19 version?
I found b37 version files (below) in google cloud gs://gatk-legacy-bundles, but not for hg19.
dbsnp_138.b37.vcf
1000G_phase1.indels.b37.vcf
(currently from the 1000 Genomes Phase I indel calls)Mills_and_1000G_gold_standard.indels.b37.sites.vcf
Hi
I have another question about hg38 genome reference fasta file.
I downloaded "Homo_sapiens_assembly38.fasta" from your Google Cloud bucket
But, this fasta file does not have chrEBV seq.
To compare, next, I downloaded "hg38" file from here, https://support.illumina.com/sequencing/sequencing_software/igenome.html
hg38.fasta has chrEBV seq (just checked by "grep hg38.fasta").
Is there any reason for excluding chrEBV from your bundle-reference, hg38?
Please sign in to leave a comment.