The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. We provide several versions of the bundle corresponding to the various reference builds, but be aware that we no longer actively support very old versions (b36/hg18). In addition, we are currently transitioning to support the Grch38/hg38 reference build, but have not yet generated all of the files necessary for all use cases (in particular we are still missing the Hg38 version of the Broad's exome intervals).
As of August 2016, we actively support the following human genome reference builds:
- Grch38/hg38 and b37/hg19 - For Best Practices short variant discovery in WGS (uBam to GVCF).
- b37/hg19 - For Best Practices short variant discovery in exome and other targeted sequencing. Please see this article for further details on the content of this resource bundle.
Accessing the Resource Bundle
The resource bundle is hosted on a Google Cloud bucket. This bucket is useful for people who plan to run analyses on the Google Cloud, and can therefore call to the resource files directly using the bucket paths, without needing to copy or download the files first. These files can also be directly downloaded from the cloud for processing on your local machine. The bucket can be accessed using a regular web browser at the following Google Cloud Platform site using a valid Google account (which can be obtained for free from Google).
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
The following resources are available through our Google Cloud buckets:
Grch38/Hg38 Resources: the Standard Set
- This contains all the resource files needed for Best Practices germline short variant discovery in whole-genome sequencing data (WGS).
- Exome files and itemized resource list will come soon.
- Somatic resources are in development.
b37 Resources: the Standard Data Set, pending completion of the Hg38 bundle
- Reference sequence (standard 1000 Genomes fasta), along with fai and dict files
- dbSNP in VCF. This includes two files:
- A recent dbSNP release (build 138)
- The same file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
- HapMap genotypes and sites VCFs
- OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
- The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
1000G_phase1.indels.b37.vcf
(currently from the 1000 Genomes Phase I indel calls)Mills_and_1000G_gold_standard.indels.b37.sites.vcf
- The latest set from 1000G phase 3 (v4) for genotype refinement:
1000G_phase3_v4_20130502.sites.vcf
- A large-scale standard single sample BAM file for testing:
NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam
containing ~64x reads of NA12878 on chromosome 20- A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
- The Broad's custom exome targets list:
Broad.human.exome.b37.interval_list
(note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website).
Additionally, these files all have supplementary indices, statistics, and other QC data available.
Note that many of these resources are out of date and will eventually be retired. All new development is being done against Grch38/hg38.
Notable Google Buckets
The following are useful google buckets for GATK users.
Owned by Google
genomics-public-data
- Description: Cloud Life Sciences provides a variety of public datasets that can be accessed for free and integrated into your applications. Google hosts these datasets, providing public access to the data through the following methods.
Owned by the Broad Institute
This article lists public Google buckets accessible to the public. The buckets contain an assortment of reference, resource, and sample test data which can be used in GATK workflows.
gcp-public-data--broad-references
- Bucket path:
gs://gcp-public-data--broad-references
- Description: This is the Broad's public hg38 and b37 reference and resource data. Additional information can be found in the GATK Resource Bundle article. This bucket is controlled by Broad, but hosted by Google. Example workspaces include:
- Whole-Genome-Analysis-Pipeline
- GATK4-Germline-Preprocessing-VariantCalling-JointCalling
gatk-legacy-bundles
- Bucket path:
gs://gatk-legacy-bundles
- Description: Broad public legacy b37 and hg19 reference and resource data.
broad-public-datasets
- Bucket path:
gs://broad-public-datasets
- Description: Stores public test data, often used to test workflows. For example, it contains NA12878 CRAM, gVCF, and unmapped BAM files.
gatk-best-practices
- Bucket path:
gs://gatk-best-practices
- Description: Stores GATK workflow specific plumbing, reference, and resources data. Example workspaces include:
- Somatic-SNVs-Indels-GATK4
gatk-test-data
- Bucket path:
gs://gatk-test-data
- Description: Additional public test data focusing on smaller data sets. For example, whole genome BAM, FASTQ, gVCF, VCF, etc. Example Workspaces include:
- Somatic-CNVs-GATK4.
Cromwell on Azure
Cromwell is a workflow management system for scientific workflows, orchestrating the computing tasks needed for genomics analysis. Originally developed by the Broad Institute, the Microsoft Genomics supported implementation of the workflow engine is currently on Azure, and can be used in the GATK Best Practices genome analysis pipeline. Cromwell supports running scripts on your local machine, computing cluster, and even on the cloud.
Cromwell on Azure configures all Azure resources needed to run workflows through Cromwell on the Azure cloud, and uses the GA4GH TES backend for orchestrating the tasks that create a workflow. The installation sets up a VM host to run the Cromwell server and uses Azure Batch to spin up virtual machines that run each task in a workflow.
Cromwell workflows can be written using WDL or CWL scripting languages. Examples of WDL and CWL scripts are located here and here, respectively.
More information about deploying your own instance of Cromwell on Azure is located in the Microsoft CromwellOnAzure repository. The Azure GATK Resource Bundle page also catalogs the standard files used for working with human re-sequencing data with the GATK, including instructions on how to access the following data stores:
- datasetgatkbestpractices
- datasetgatklegacybundles
- datasetgatktestdata
- datasetpublicbroadref
- datasetbroadpublic
FTP Server Access
NOTE: FTP Server Access will soon be disabled, and code using FTP file paths must be updated with Google Bucket file paths by June 1, 2020.
The FTP server is intended for people who wish to download files to run on them locally. However, FTP is local to the Broad Institute (there are no mirrors), it has tight limits on concurrent downloads, and users in some countries have reported difficulties accessing it due to e.g. firewalls. For these (and other) reasons, FTP Server Access will be disabled by June 1, 2020.
Instead, please use the resources available in our Google Cloud bucket, and available through our cloud-based analysis portal Terra, in workspaces that are preconfigured for the major Best Practices analysis use cases.
To access the bundle on the FTP server, use the following login credentials in your favorite FTP client:
location: ftp.broadinstitute.org/bundle username: gsapubftp-anonymous password:
If you are using your browser as an FTP client, make sure to include the login information in the address, otherwise you will access the general Broad Institute FTP instead of our team FTP. This should work as a direct link:
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/
The bundle/directory contains five subdirectories, one for each build of the human genome that we have resources for: b36, b37, hg18, hg19 and hg38 (aka GRCh38). Be aware that the hg38 resource set is provided as-is, and its contents may still be incomplete.
Currently, the following resources are only available exclusively through FTP:
hg19 Resources: lifted over from b37
- Includes the UCSC-style hg19 reference along with all lifted over VCF files.
hg18 Resources: lifted over from b37
- Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
- Also includes a chain file to lift over to b37.
b36 Resources: lifted over from b37
- Includes the 1000 Genomes pilot b36 formatted reference sequence (
human_b36_both.fasta
) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause. - Also includes a chain file to lift over to b37.
16 comments
Hello! Could you improve a readme file to introduct the resource files?
Hi,
I am trying to run BaseRecalibrator with my WGS data.
My ref is hg19 reference. Where can I get SNP and Indel vcf files in hg19 version?
I found b37 version files (below) in google cloud gs://gatk-legacy-bundles, but not for hg19.
dbsnp_138.b37.vcf
1000G_phase1.indels.b37.vcf
(currently from the 1000 Genomes Phase I indel calls)Mills_and_1000G_gold_standard.indels.b37.sites.vcf
Hi
I have another question about hg38 genome reference fasta file.
I downloaded "Homo_sapiens_assembly38.fasta" from your Google Cloud bucket
But, this fasta file does not have chrEBV seq.
To compare, next, I downloaded "hg38" file from here, https://support.illumina.com/sequencing/sequencing_software/igenome.html
hg38.fasta has chrEBV seq (just checked by "grep hg38.fasta").
Is there any reason for excluding chrEBV from your bundle-reference, hg38?
Hi Ashi,
I'm facing the same problem in retrieving hg19 resources. In particular:
Did you find a way to get them?
Thanks
Joy
Hello,
First, thank you to the members of the Broad for putting this bundle together for the genomics community.
I had a question regarding the specific assembly build of the hg38 reference genome. In the documentation there is a reference to the GRCh38.p7 release in "Technical Documentation->Glossary->Reference Genome Components" and then again it is mentioned in "Technical Documentation->Glossary-Human genome reference builds - GRCh38 or hg38 - b37 - hg19". However in the same paragraph it states "Note that the GATK team rarely if ever adopts patches due to constraints from our production operations. We are not currently able to provide support for the use of patches."
Does this mean that the current FASTA file (Homo_sapiens_assembly38.fasta) in the resource bundle is in fact NOT GRCh38.p7? Instead it is the primary release GRCh38 from 2013 with no patches included? This was unclear to me as I searched all the documentation.
Thank you,
Patrick
Hi GATK Team,
First, thank you for this post. I want to download hg19 version resources for VariantRecalibrator. From this page, it seems like these resources were available through FTP Server, which is now disabled. Is there any official platform that still provides these resources? Thank you so much.
Best
Lingyu
To add to my previous questions, it seems like that the 'genomics-public-data' bucket also does not contain the complete list of b37 resources (for VariantRecalibrator) as indicated either. I would like to know if there are any other buckets that contain a complete list for b37 resources too. Thank you so much.
Best
Lingyu
Hello,
I know that `Homo_sapiens_assembly38.fasta.64.amb` is one of the bwa index file, but what does `.64` mean in the file name while the original fasta file DOES NOT have `.64`. Why add `.64`?
Is it possible to create Readme.txt to explain what each file does?
Thanks,
Best,
LC
Can't access google cloud it says permission required
Hello,
I previously access ftp.broad.mit.edu/pub/human_STS_releases/july97/ to get “07-97.YAC2STS.txt”. Now, how can I get the file?
With best regards.
Take
progress report
I was able to access the ftp server and get the file.
Thank you for the support.
Take
Dear GATK team and community,
I have WES data and have aligned in my previous steps with bwa-mem with the ref genome hg38. I am now looking to do the BaseRecalibrator and BQSR steps with the same reference genome hg38.
However, the text above " In addition, we are currently transitioning to support the Grch38/hg38 reference build, but have not yet generated all of the files necessary for all use cases (in particular we are still missing the Hg38 version of the Broad's exome intervals)" has made me reconsider. Should I be using a different ref genome?
Any advice or clarification would be great!
I can't find a gtf file for Homo_sapiens_assembly38 in the resource bundle v0 - genom…blic-data – Bucket details – Cloud Storage – Google Cloud console
Hi GATK team
Thanks for making this resource bundle.
I was looking for an annotation file with gene symbols and their strand, exon\intron coordinates on the Grch38/hg38 build. I looked through the resource bundle and found the following file - Homo_sapiens_assembly38.fasta.64.ann
When I browsed the file I didn't see gene symbols (maybe I missed something). If you have an annotation file that I am looking for do you also have it in gtf \ gff \ bed format?
Many thanks,
Gil
Hi, GATK, currently I plan to use GATK 4 to find snp and compare the variants between samples, however I couldn't find the resource reference file for Vibrio spp., where do i get this file?? And how to setup and run GATK4 for my project??
Thanks for your patience on my questions. Thank you!!
Is there an estimate for when the exome files for the hg38 build will be released? The gtf files for that build are also still missing. Many thanks!
Please sign in to leave a comment.