Resource bundle Follow

GATK Team

January 17, 2025 03:29
Updated

The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. We provide several versions of the bundle corresponding to the various reference builds, but be aware that we no longer actively support very old versions (b36/hg18). In addition, we are currently transitioning to support the Grch38/hg38 reference build, but have not yet generated all of the files necessary for all use cases (in particular we are still missing the Hg38 version of the Broad's exome intervals).

As of August 2016, we actively support the following human genome reference builds:

Grch38/hg38 and b37/hg19 - For Best Practices short variant discovery in WGS (uBam to GVCF).
b37/hg19 - For Best Practices short variant discovery in exome and other targeted sequencing. Please see this article for further details on the content of this resource bundle.

Accessing the Resource Bundle

The resource bundle is hosted on a Google Cloud bucket. This bucket is useful for people who plan to run analyses on the Google Cloud, and can therefore call to the resource files directly using the bucket paths, without needing to copy or download the files first. These files can also be directly downloaded from the cloud for processing on your local machine. The bucket can be accessed using a regular web browser at the following Google Cloud Platform site using a valid Google account (which can be obtained for free from Google).

https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/

The following resources are available through our Google Cloud buckets:

Grch38/Hg38 Resources: the Standard Set

This contains all the resource files needed for Best Practices germline short variant discovery in whole-genome sequencing data (WGS).
Exome files and itemized resource list will come soon.
Somatic resources are in development.

b37 Resources: the Standard Data Set, pending completion of the Hg38 bundle

Reference sequence (standard 1000 Genomes fasta), along with fai and dict files
dbSNP in VCF. This includes two files:
- A recent dbSNP release (build 138)
- The same file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
HapMap genotypes and sites VCFs
OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
- 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
- Mills_and_1000G_gold_standard.indels.b37.sites.vcf
The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf
A large-scale standard single sample BAM file for testing:
- NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
- A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
- The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website).

Additionally, these files all have supplementary indices, statistics, and other QC data available.

Note that many of these resources are out of date and will eventually be retired. All new development is being done against Grch38/hg38.

Notable Google Buckets

The following are useful google buckets for GATK users.

Owned by Google

genomics-public-data

Description: Cloud Life Sciences provides a variety of public datasets that can be accessed for free and integrated into your applications. Google hosts these datasets, providing public access to the data through the following methods.

Owned by the Broad Institute

This article lists public Google buckets accessible to the public. The buckets contain an assortment of reference, resource, and sample test data which can be used in GATK workflows.

gcp-public-data--broad-references

Bucket path: gs://gcp-public-data--broad-references
Description: This is the Broad's public hg38 and b37 reference and resource data. Additional information can be found in the GATK Resource Bundle article. This bucket is controlled by Broad, but hosted by Google. Example workspaces include:
- Whole-Genome-Analysis-Pipeline
- GATK4-Germline-Preprocessing-VariantCalling-JointCalling

gatk-legacy-bundles

Bucket path: gs://gatk-legacy-bundles
Description: Broad public legacy b37 and hg19 reference and resource data.

broad-public-datasets

Bucket path: gs://broad-public-datasets
Description: Stores public test data, often used to test workflows. For example, it contains NA12878 CRAM, gVCF, and unmapped BAM files.

gatk-best-practices

Bucket path: gs://gatk-best-practices
Description: Stores GATK workflow specific plumbing, reference, and resources data. Example workspaces include:
- Somatic-SNVs-Indels-GATK4

gatk-test-data

Bucket path: gs://gatk-test-data
Description: Additional public test data focusing on smaller data sets. For example, whole genome BAM, FASTQ, gVCF, VCF, etc. Example Workspaces include:
- Somatic-CNVs-GATK4.

Cromwell on Azure

Cromwell is a workflow management system for scientific workflows, orchestrating the computing tasks needed for genomics analysis. Originally developed by the Broad Institute, the Microsoft Genomics supported implementation of the workflow engine is currently on Azure, and can be used in the GATK Best Practices genome analysis pipeline. Cromwell supports running scripts on your local machine, computing cluster, and even on the cloud.

Cromwell on Azure configures all Azure resources needed to run workflows through Cromwell on the Azure cloud, and uses the GA4GH TES backend for orchestrating the tasks that create a workflow. The installation sets up a VM host to run the Cromwell server and uses Azure Batch to spin up virtual machines that run each task in a workflow.

Cromwell workflows can be written using WDL or CWL scripting languages. Examples of WDL and CWL scripts are located here and here, respectively.

More information about deploying your own instance of Cromwell on Azure is located in the Microsoft CromwellOnAzure repository. The Azure GATK Resource Bundle page also catalogs the standard files used for working with human re-sequencing data with the GATK, including instructions on how to access the following data stores:

datasetgatkbestpractices
datasetgatklegacybundles
datasetgatktestdata
datasetpublicbroadref
datasetbroadpublic

FTP Server Access

NOTE: FTP Server Access will soon be disabled, and code using FTP file paths must be updated with Google Bucket file paths by June 1, 2020.

The FTP server is intended for people who wish to download files to run on them locally. However, FTP is local to the Broad Institute (there are no mirrors), it has tight limits on concurrent downloads, and users in some countries have reported difficulties accessing it due to e.g. firewalls. For these (and other) reasons, FTP Server Access will be disabled by June 1, 2020.

Instead, please use the resources available in our Google Cloud bucket, and available through our cloud-based analysis portal Terra, in workspaces that are preconfigured for the major Best Practices analysis use cases.

To access the bundle on the FTP server, use the following login credentials in your favorite FTP client:

location: ftp.broadinstitute.org/bundle
username: gsapubftp-anonymous
password:

If you are using your browser as an FTP client, make sure to include the login information in the address, otherwise you will access the general Broad Institute FTP instead of our team FTP. This should work as a direct link:

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/

The bundle/directory contains five subdirectories, one for each build of the human genome that we have resources for: b36, b37, hg18, hg19 and hg38 (aka GRCh38). Be aware that the hg38 resource set is provided as-is, and its contents may still be incomplete.

Currently, the following resources are only available exclusively through FTP:

hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.

b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formatted reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.

18 comments

Nickier

September 02, 2020 11:09
Hello! Could you improve a readme file to introduct the resource files?
5

Comment actions Permalink
Ashi

November 16, 2020 19:39
Hi,

I am trying to run BaseRecalibrator with my WGS data.

My ref is hg19 reference. Where can I get SNP and Indel vcf files in hg19 version?

I found b37 version files (below) in google cloud gs://gatk-legacy-bundles, but not for hg19.

dbsnp_138.b37.vcf

1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

Mills_and_1000G_gold_standard.indels.b37.sites.vcf
6

Comment actions Permalink
Ashi

December 01, 2020 14:42
Hi

I have another question about hg38 genome reference fasta file.

I downloaded "Homo_sapiens_assembly38.fasta" from your Google Cloud bucket
```
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
```
But, this fasta file does not have chrEBV seq.

To compare, next, I downloaded "hg38" file from here, https://support.illumina.com/sequencing/sequencing_software/igenome.html

hg38.fasta has chrEBV seq (just checked by "grep hg38.fasta").

Is there any reason for excluding chrEBV from your bundle-reference, hg38?
1

Comment actions Permalink
Joy Bordini

September 14, 2021 10:38

Edited
Hi Ashi,

I'm facing the same problem in retrieving hg19 resources. In particular:
- 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg19.vcf
- Axiom_Exome_Plus.genotypes.all_populations.poly.hg19.vcf.gz
- Homo_sapiens_assembly19.known_indels.vcf.gz
Did you find a way to get them?

Thanks

Joy
-1

Comment actions Permalink
Patrick Blaney

December 06, 2021 17:42
Hello,

First, thank you to the members of the Broad for putting this bundle together for the genomics community.

I had a question regarding the specific assembly build of the hg38 reference genome. In the documentation there is a reference to the GRCh38.p7 release in "Technical Documentation->Glossary->Reference Genome Components" and then again it is mentioned in "Technical Documentation->Glossary-Human genome reference builds - GRCh38 or hg38 - b37 - hg19". However in the same paragraph it states "Note that the GATK team rarely if ever adopts patches due to constraints from our production operations. We are not currently able to provide support for the use of patches."

Does this mean that the current FASTA file (Homo_sapiens_assembly38.fasta) in the resource bundle is in fact NOT GRCh38.p7? Instead it is the primary release GRCh38 from 2013 with no patches included? This was unclear to me as I searched all the documentation.

Thank you,

Patrick
0

Comment actions Permalink
Lingyu Zhan

March 31, 2022 17:05
Hi GATK Team,

First, thank you for this post. I want to download hg19 version resources for VariantRecalibrator. From this page, it seems like these resources were available through FTP Server, which is now disabled. Is there any official platform that still provides these resources? Thank you so much.

Best

Lingyu
0

Comment actions Permalink
Lingyu Zhan

March 31, 2022 18:01
To add to my previous questions, it seems like that the 'genomics-public-data' bucket also does not contain the complete list of b37 resources (for VariantRecalibrator) as indicated either. I would like to know if there are any other buckets that contain a complete list for b37 resources too. Thank you so much.

Best

Lingyu
0

Comment actions Permalink
Limin Chen

April 20, 2022 19:58
Hello,

I know that `Homo_sapiens_assembly38.fasta.64.amb` is one of the bwa index file, but what does `.64` mean in the file name while the original fasta file DOES NOT have `.64`. Why add `.64`?

Is it possible to create Readme.txt to explain what each file does?

Thanks,

Best,

LC
1

Comment actions Permalink
HQ Zhao

October 27, 2022 01:56
Can't access google cloud it says permission required
```
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
```
0

Comment actions Permalink
Take Murata

November 24, 2022 02:29
Hello,

I previously access ftp.broad.mit.edu/pub/human_STS_releases/july97/ to get “07-97.YAC2STS.txt”. Now, how can I get the file?

With best regards.

Take
0

Comment actions Permalink
Take Murata

November 28, 2022 23:47
progress report
I was able to access the ftp server and get the file.

Thank you for the support.

Take
0

Comment actions Permalink
Emily

February 21, 2023 12:53

Edited
Dear GATK team and community,

I have WES data and have aligned in my previous steps with bwa-mem with the ref genome hg38. I am now looking to do the BaseRecalibrator and BQSR steps with the same reference genome hg38.

However, the text above " In addition, we are currently transitioning to support the Grch38/hg38 reference build, but have not yet generated all of the files necessary for all use cases (in particular we are still missing the Hg38 version of the Broad's exome intervals)" has made me reconsider. Should I be using a different ref genome?

Any advice or clarification would be great!
0

Comment actions Permalink
Rahul Yadav

April 04, 2023 06:32
I can't find a gtf file for Homo_sapiens_assembly38 in the resource bundle v0 - genom…blic-data – Bucket details – Cloud Storage – Google Cloud console
0

Comment actions Permalink
Gil Stelzer

May 12, 2023 21:19
Hi GATK team

Thanks for making this resource bundle.

I was looking for an annotation file with gene symbols and their strand, exon\intron coordinates on the Grch38/hg38 build. I looked through the resource bundle and found the following file - Homo_sapiens_assembly38.fasta.64.ann

When I browsed the file I didn't see gene symbols (maybe I missed something). If you have an annotation file that I am looking for do you also have it in gtf \ gff \ bed format?

Many thanks,

Gil
0

Comment actions Permalink
Yap Sing Yee

April 28, 2024 16:26
Hi, GATK, currently I plan to use GATK 4 to find snp and compare the variants between samples, however I couldn't find the resource reference file for Vibrio spp., where do i get this file?? And how to setup and run GATK4 for my project??

Thanks for your patience on my questions. Thank you!!
0

Comment actions Permalink
Julia Wiggeshoff

July 03, 2024 11:44
Is there an estimate for when the exome files for the hg38 build will be released? The gtf files for that build are also still missing. Many thanks!
0

Comment actions Permalink
Elena S Kim

November 06, 2024 17:17
Hello everyone, just wanted to share where I finally found the legacy hg37 I needed:

gsutil ls gs://gatk-best-practices/somatic-b37/

gs://gatk-best-practices/somatic-b37/

gs://gatk-best-practices/somatic-b37/CNV.hg19.bypos.v1.CR1_event_added.mod.seg

gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list

gs://gatk-best-practices/somatic-b37/HCC1143.bai

gs://gatk-best-practices/somatic-b37/HCC1143.bam

gs://gatk-best-practices/somatic-b37/HCC1143_normal.bai

gs://gatk-best-practices/somatic-b37/HCC1143_normal.bam

gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.dict

gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.fasta

gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.fasta.fai

gs://gatk-best-practices/somatic-b37/Mutect2-WGS-panel-b37.vcf

gs://gatk-best-practices/somatic-b37/Mutect2-WGS-panel-b37.vcf.idx

gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf

gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf.idx

gs://gatk-best-practices/somatic-b37/README.txt

gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf

gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf.idx

gs://gatk-best-practices/somatic-b37/final_centromere_hg19.seg

gs://gatk-best-practices/somatic-b37/onco_config.txt

gs://gatk-best-practices/somatic-b37/oncotator_v1_ds_April052016.tar.gz

gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf

gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf.idx

gs://gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.baits.interval_list

gs://gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.targets.interval_list
1

Comment actions Permalink
Elizabeth McMillan

January 17, 2025 01:19
Hi guys. There have been several questions about the missing gtf file for hg38. Do you guys have plans of adding it? thanks-
1

Comment actions Permalink