Known Sites for BQSR

Answered

T. Li

December 07, 2020 22:26

Which known site references should I use for BQSR of WGS/Exome data?

On https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0, I see a bunch of VCFs:

1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38
1000G_omni2.5.hg38
1000G_phase1.snps.high_confidence.hg38
1000G_phase3_v4_20130502.sites.hg38
Axiom_Exome_Plus.genotypes.all_populations.poly.hg38
Homo_sapiens_assembly38.dbsnp138
Homo_sapiens_assembly38.known_indels
Mills_and_1000G_gold_standard.indels.hg38

a) Should I be using all of these for BQSR? If not, how do I choose which ones to use (and when would you want to use the other datasets)

b) The 1000G phase3 dataset is new to me. What's the difference between integrated sites only and v4?

c) What's the Axiom_Exome_Plus data set?

Many thanks for your help.

5 comments

Genevieve Brandt (she/her)

December 08, 2020 00:54
Hello,

The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

For context, check out our support policy.
0

Comment actions Permalink
Carlos Uziel

March 03, 2021 06:07

Edited
Hello,

I am also interested in this topic, as there isn't enough documentation regarding the specific choice of known-sites or resources to use for BQSR and VQSR for human WES data, respectively. I have curated a list of the different resource files that I think should be used in both cases (intuition based entirely on reading the GATK documentation of the different functions and looking up the source databases).

This is the resources list I have come up with until now, including all files that need to be downloaded (thus including the index files):

dbSNP

From NCBI:
- GCF_000001405.38.gz
- GCF_000001405.38.gz.tbi
Alternatively (though probably different versions), from the GATK Resource Bundle:
- Homo_sapiens_assembly38.dbsnp138.vcf
- Homo_sapiens_assembly38.dbsnp138.vcf.idx
1000G

From the GATK Resource Bundle:
- 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
- 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.idx
This data is also available in EBI. I think the corresponding file to the one hosted in the GATK Resource Bundle would be "ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz" , but I could be mistaken (at least the compressed size of 1.8GB checks out). More information here.

1000G omni

From GATK Resource Bundle:
- 1000G_omni2.5.hg38.vcf.gz
- 1000G_omni2.5.hg38.vcf.gz.tbi
HapMap

From GATK Resource Bundle:
- hapmap_3.3.hg38.vcf.gz
- hapmap_3.3.hg38.vcf.gz.tbi
Indels

From GATK Resource Bundle:
- Homo_sapiens_assembly38.known_indels.vcf.gz
- Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
- Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
- Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
I would greatly appreciate any feedback on this list of resources, whether there are more resources that should be added and its suitability for human whole exome sequencing data aligned to hg38.
4

Comment actions Permalink
Genevieve Brandt (she/her)

March 02, 2021 19:28
Thank you for putting together this resource, Carlos Uziel. Our general recommendations that will work for most use cases can also be found in our public workflows, you can look through those pipelines to see what we are using for our own analysis.
0

Comment actions Permalink
Carlos Uziel

March 03, 2021 06:03
Thank you for putting together this resource, Carlos Uziel. Our general recommendations that will work for most use cases can also be found in our public workflows, you can look through those pipelines to see what we are using for our own analysis.

Thank you Genevieve Brandt (she/her)! I will check these workflows out.
0

Comment actions Permalink
Yanick Paco Hagemeijer

August 02, 2021 16:34
Hi,
As I am working from home I am running up against a problem with downloading big files, most of the vcf files from the hg38 resource bundle are provided as .vcf.gz, but not the biggest 2, which would arguably save the most bandwith. Is there a (legacy) reason why those 2 files were not provided as bgzipped?
0

Comment actions Permalink