Known Sites for BQSR
AnsweredWhich known site references should I use for BQSR of WGS/Exome data?
On https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0, I see a bunch of VCFs:
1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38
1000G_omni2.5.hg38
1000G_phase1.snps.high_confidence.hg38
1000G_phase3_v4_20130502.sites.hg38
Axiom_Exome_Plus.genotypes.all_populations.poly.hg38
Homo_sapiens_assembly38.dbsnp138
Homo_sapiens_assembly38.known_indels
Mills_and_1000G_gold_standard.indels.hg38
a) Should I be using all of these for BQSR? If not, how do I choose which ones to use (and when would you want to use the other datasets)
b) The 1000G phase3 dataset is new to me. What's the difference between integrated sites only and v4?
c) What's the Axiom_Exome_Plus data set?
Many thanks for your help.
-
Hello,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
Hello,
I am also interested in this topic, as there isn't enough documentation regarding the specific choice of known-sites or resources to use for BQSR and VQSR for human WES data, respectively. I have curated a list of the different resource files that I think should be used in both cases (intuition based entirely on reading the GATK documentation of the different functions and looking up the source databases).
This is the resources list I have come up with until now, including all files that need to be downloaded (thus including the index files):
dbSNP
From NCBI:
- GCF_000001405.38.gz
- GCF_000001405.38.gz.tbi
Alternatively (though probably different versions), from the GATK Resource Bundle:
- Homo_sapiens_assembly38.dbsnp138.vcf
- Homo_sapiens_assembly38.dbsnp138.vcf.idx
1000G
From the GATK Resource Bundle:
- 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
- 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.idx
This data is also available in EBI. I think the corresponding file to the one hosted in the GATK Resource Bundle would be "ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz" , but I could be mistaken (at least the compressed size of 1.8GB checks out). More information here.
1000G omni
From GATK Resource Bundle:
- 1000G_omni2.5.hg38.vcf.gz
- 1000G_omni2.5.hg38.vcf.gz.tbi
HapMap
From GATK Resource Bundle:
- hapmap_3.3.hg38.vcf.gz
- hapmap_3.3.hg38.vcf.gz.tbi
Indels
From GATK Resource Bundle:
- Homo_sapiens_assembly38.known_indels.vcf.gz
- Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
- Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
- Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
I would greatly appreciate any feedback on this list of resources, whether there are more resources that should be added and its suitability for human whole exome sequencing data aligned to hg38.
-
Thank you for putting together this resource, Carlos Uziel. Our general recommendations that will work for most use cases can also be found in our public workflows, you can look through those pipelines to see what we are using for our own analysis.
-
Thank you for putting together this resource, Carlos Uziel. Our general recommendations that will work for most use cases can also be found in our public workflows, you can look through those pipelines to see what we are using for our own analysis.
Thank you Genevieve Brandt (she/her)! I will check these workflows out.
-
Hi,
As I am working from home I am running up against a problem with downloading big files, most of the vcf files from the hg38 resource bundle are provided as .vcf.gz, but not the biggest 2, which would arguably save the most bandwith. Is there a (legacy) reason why those 2 files were not provided as bgzipped?
Please sign in to leave a comment.
5 comments