Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Known Sites for BQSR

Answered
6

5 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hello,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

     

    0
    Comment actions Permalink
  • Avatar
    Carlos Uziel

    Hello,

    I am also interested in this topic, as there isn't enough documentation regarding the specific choice of known-sites or resources to use for BQSR and VQSR for human WES data, respectively. I have curated a list of the different resource files that I think should be used in both cases (intuition based entirely on reading the GATK documentation of the different functions and looking up the source databases).

    This is the resources list I have come up with until now, including all files that need to be downloaded (thus including the index files):

     

    dbSNP

    From NCBI:

    • GCF_000001405.38.gz
    • GCF_000001405.38.gz.tbi

    Alternatively (though probably different versions), from the GATK Resource Bundle:

    • Homo_sapiens_assembly38.dbsnp138.vcf
    • Homo_sapiens_assembly38.dbsnp138.vcf.idx

     

    1000G

    From the GATK Resource Bundle:

    • 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
    • 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.idx

    This data is also available in EBI. I think the corresponding file to the one hosted in the GATK Resource Bundle would be "ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz" , but I could be mistaken (at least the compressed size of 1.8GB checks out). More information here.

     

    1000G omni

    From GATK Resource Bundle:

    • 1000G_omni2.5.hg38.vcf.gz
    • 1000G_omni2.5.hg38.vcf.gz.tbi

     

    HapMap

    From GATK Resource Bundle:

    • hapmap_3.3.hg38.vcf.gz
    • hapmap_3.3.hg38.vcf.gz.tbi

     

    Indels

    From GATK Resource Bundle:

    • Homo_sapiens_assembly38.known_indels.vcf.gz
    • Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
    • Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
    • Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

     

    I would greatly appreciate any feedback on this list of resources, whether there are more resources that should be added and its suitability for human whole exome sequencing data aligned to hg38.

    4
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you for putting together this resource, Carlos Uziel. Our general recommendations that will work for most use cases can also be found in our public workflows, you can look through those pipelines to see what we are using for our own analysis.

    0
    Comment actions Permalink
  • Avatar
    Carlos Uziel

    Thank you for putting together this resource, Carlos Uziel. Our general recommendations that will work for most use cases can also be found in our public workflows, you can look through those pipelines to see what we are using for our own analysis.

    Thank you Genevieve Brandt (she/her)! I will check these workflows out.

    0
    Comment actions Permalink
  • Avatar
    Yanick Paco Hagemeijer

    Hi,
    As I am working from home I am running up against a problem with downloading big files, most of the vcf files from the hg38 resource bundle are provided as .vcf.gz, but not the biggest 2, which would arguably save the most bandwith. Is there a (legacy) reason why those 2 files were not provided as bgzipped?

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk