Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Resource bundle Follow

24 comments

  • Avatar
    Nickier

    Hello! Could you improve a readme file to introduct the resource files?  

    5
    Comment actions Permalink
  • Avatar
    Ashi

    Hi,

    I am trying to run BaseRecalibrator with my WGS data.

    My ref is hg19 reference. Where can I get SNP and Indel vcf files in hg19 version?

     

    I found b37 version files (below) in google cloud gs://gatk-legacy-bundles, but not for hg19.

    dbsnp_138.b37.vcf

    1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

    Mills_and_1000G_gold_standard.indels.b37.sites.vcf

     

    6
    Comment actions Permalink
  • Avatar
    Ashi

    Hi

    I have another question about hg38 genome reference fasta file.

    I downloaded "Homo_sapiens_assembly38.fasta" from your Google Cloud bucket

    https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/

    But, this fasta file does not have chrEBV seq.

    To compare, next, I downloaded "hg38" file from here, https://support.illumina.com/sequencing/sequencing_software/igenome.html 

    hg38.fasta has chrEBV seq (just checked by "grep hg38.fasta").

     

    Is there any reason for excluding chrEBV from your bundle-reference, hg38?  

     

     

     

    1
    Comment actions Permalink
  • Avatar
    Joy Bordini

    Hi Ashi,

     

    I'm facing the same problem in retrieving hg19 resources. In particular:

    • 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg19.vcf
    • Axiom_Exome_Plus.genotypes.all_populations.poly.hg19.vcf.gz
    • Homo_sapiens_assembly19.known_indels.vcf.gz

    Did you find a way to get them?

     

    Thanks

     

    Joy

    -1
    Comment actions Permalink
  • Avatar
    Patrick Blaney

    Hello,

    First, thank you to the members of the Broad for putting this bundle together for the genomics community.

    I had a question regarding the specific assembly build of the hg38 reference genome. In the documentation there is a reference to the GRCh38.p7 release in "Technical Documentation->Glossary->Reference Genome Components" and then again it is mentioned in "Technical Documentation->Glossary-Human genome reference builds - GRCh38 or hg38 - b37 - hg19". However in the same paragraph it states "Note that the GATK team rarely if ever adopts patches due to constraints from our production operations. We are not currently able to provide support for the use of patches."

    Does this mean that the current FASTA file (Homo_sapiens_assembly38.fasta) in the resource bundle is in fact NOT GRCh38.p7? Instead it is the primary release GRCh38 from 2013 with no patches included? This was unclear to me as I searched all the documentation.

    Thank you,

    Patrick

    0
    Comment actions Permalink
  • Avatar
    Lingyu Zhan

    Hi GATK Team,

    First, thank you for this post. I want to download hg19 version resources for VariantRecalibrator. From this page, it seems like these resources were available through FTP Server, which is now disabled. Is there any official platform that still provides these resources? Thank you so much.

    Best

    Lingyu

     

    0
    Comment actions Permalink
  • Avatar
    Lingyu Zhan

    To add to my previous questions, it seems like that the 'genomics-public-data' bucket also does not contain the complete list of b37 resources (for VariantRecalibrator) as indicated either. I would like to know if there are any other buckets that contain a complete list for b37 resources too. Thank you so much.

    Best

    Lingyu

    0
    Comment actions Permalink
  • Avatar
    Limin Chen

    Hello, 

    I know that `Homo_sapiens_assembly38.fasta.64.amb` is one of the bwa index file, but what does `.64` mean in the file name while the original fasta file DOES NOT have `.64`. Why add `.64`? 

    Is it possible to create Readme.txt to explain what each file does?

    Thanks,

    Best,

    LC

    1
    Comment actions Permalink
  • Avatar
    HQ Zhao

    Can't access google cloud it says permission required

    https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
    1
    Comment actions Permalink
  • Avatar
    Take Murata

    Hello, 

    I previously access ftp.broad.mit.edu/pub/human_STS_releases/july97/ to get “07-97.YAC2STS.txt”. Now, how can I get the file?

    With best regards.

    Take

    0
    Comment actions Permalink
  • Avatar
    Take Murata

    progress report
    I was able to access the ftp server and get the file.

    Thank you for the support.

    Take

    0
    Comment actions Permalink
  • Avatar
    Emily

    Dear GATK team and community, 

    I have WES data and have aligned in my previous steps with bwa-mem with the ref genome hg38. I am now looking to do the BaseRecalibrator and BQSR steps with the same reference genome hg38.

    However, the  text above " In addition, we are currently transitioning to support the Grch38/hg38 reference build, but have not yet generated all of the files necessary for all use cases (in particular we are still missing the Hg38 version of the Broad's exome intervals)" has made me reconsider. Should I be using a different ref genome?

    Any advice or clarification would be great! 

    0
    Comment actions Permalink
  • Avatar
    Rahul Yadav

    I can't find a gtf file for Homo_sapiens_assembly38 in the resource bundle v0 - genom…blic-data – Bucket details – Cloud Storage – Google Cloud console

    0
    Comment actions Permalink
  • Avatar
    Gil Stelzer

    Hi GATK team

    Thanks for making this resource bundle.

    I was looking for an annotation file with gene symbols and their strand, exon\intron coordinates on the Grch38/hg38 build.  I looked through the resource bundle and found the following file - Homo_sapiens_assembly38.fasta.64.ann

    When I browsed the file I didn't see gene symbols (maybe I missed something).  If you have an annotation file that I am looking for do you also have it in gtf \ gff \ bed format?

    Many thanks,

    Gil

    0
    Comment actions Permalink
  • Avatar
    Yap Sing Yee

    Hi, GATK, currently I plan to use GATK 4 to find snp and compare the variants between samples, however I couldn't find the resource reference file for Vibrio spp., where do i get this file?? And how to setup and run GATK4 for my project??

    Thanks for your patience on my questions. Thank you!!

    0
    Comment actions Permalink
  • Avatar
    Julia Wiggeshoff

    Is there an estimate for when the exome files for the hg38 build will be released? The gtf files for that build are also still missing. Many thanks!

    0
    Comment actions Permalink
  • Avatar
    Elena S Kim

    Hello everyone, just wanted to share where I finally found the legacy hg37 I needed:

    gsutil ls gs://gatk-best-practices/somatic-b37/

    gs://gatk-best-practices/somatic-b37/

    gs://gatk-best-practices/somatic-b37/CNV.hg19.bypos.v1.CR1_event_added.mod.seg

    gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list

    gs://gatk-best-practices/somatic-b37/HCC1143.bai

    gs://gatk-best-practices/somatic-b37/HCC1143.bam

    gs://gatk-best-practices/somatic-b37/HCC1143_normal.bai

    gs://gatk-best-practices/somatic-b37/HCC1143_normal.bam

    gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.dict

    gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.fasta

    gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.fasta.fai

    gs://gatk-best-practices/somatic-b37/Mutect2-WGS-panel-b37.vcf

    gs://gatk-best-practices/somatic-b37/Mutect2-WGS-panel-b37.vcf.idx

    gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf

    gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf.idx

    gs://gatk-best-practices/somatic-b37/README.txt

    gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf

    gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf.idx

    gs://gatk-best-practices/somatic-b37/final_centromere_hg19.seg

    gs://gatk-best-practices/somatic-b37/onco_config.txt

    gs://gatk-best-practices/somatic-b37/oncotator_v1_ds_April052016.tar.gz

    gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf

    gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf.idx

    gs://gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.baits.interval_list

    gs://gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.targets.interval_list

    2
    Comment actions Permalink
  • Avatar
    Elizabeth McMillan

    Hi guys. There have been several questions about the missing gtf file for hg38. Do you guys have plans of adding it? thanks-

    1
    Comment actions Permalink
  • Avatar
    Michelle Paredes Escobar

    Hello GATK team,

    am using the GRCh38 reference from the GATK resource bundle (Homo_sapiens_assembly38.fasta), and while validating the FASTA, I found a small number of IUPAC ambiguous bases (e.g., R, Y, S, W, K, M, B, D, H, V) still present in the sequence.

    I have already performed variant calling using Mutect2 / HaplotypeCaller; however, due to these ambiguous bases, the pipeline crashes and fails to complete successfully.

    Has anyone else encountered this issue?

    Could you please confirm whether the presence of ambiguous bases is expected in the current Broad-distributed GRCh38 reference?

    If so, is there a recommended workaround or best practice for handling them (e.g., masking, replacing them with N, or using an alternative Broad-provided FASTA) to ensure stable and reproducible variant calling?

    Any guidance would be greatly appreciated.

    0
    Comment actions Permalink
  • Avatar
    Steven P. Vensko II

    I've been able to download files from the GATK bundle previously, but I'm currently getting 403 errors when attempting to fetch files. Others on different networks are reporting the same thing. Has something changed in how we should be accessing the files?

    3
    Comment actions Permalink
  • Avatar
    Landon Luesing

    As of 02/13/2026, something is wrong with the google bucket permissions for both the https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/ link given and the https://console.cloud.google.com/storage/browser/genomics-public-data links in the article. Both give "Access Denied" when attempting to view or access anything in the buckets. The bucket owned by the Broad Institute still works. Can this be corrected to allow file access? Thanks!

    1
    Comment actions Permalink
  • Avatar
    Landon Luesing

    The Azure Data Access links given in the Azure support page linked all fail as well. They all result in DNS errors for the given FQDNs on the https://learn.microsoft.com/en-us/azure/open-datasets/dataset-gatk-resource-bundle page.

    0
    Comment actions Permalink
  • Avatar
    Elise H

    Landon Luesing Did you ever figure this out? I am having the same issue.

    1
    Comment actions Permalink
  • Avatar
    Landon Luesing

    Hi, Elise H. I did! I was able to locate an AWS S3 bucket at https://s3.amazonaws.com/gatk-test-data, which you can grab data from. There's also a "read me" file located at: https://s3.amazonaws.com/gatk-test-data/gatk-test-data-readme.html. I you want to access a file without using S3 manager or another, command-line driven, option - you can do so by using the HTTPS link to the bucket and appending the "Key" value of what file you want. For example: https://s3.amazonaws.com/gatk-test-data/cnv/somatic/SM-74P4M.bam. I've also used some resources from the Broad Institute GCP https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0. You can also download FASTA and SNP reference files from places like Illumina (https://support.illumina.com/downloads/genome-fasta-files.html/ and https://knowledge.illumina.com/software/on-premises-software/software-on-premises-software-troubleshooting-list/000007409) , Ensembl (https://www.ensembl.org/Homo_sapiens/Info/Index), and NCBI (https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9606/). I hope that this helps you!!!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk