Questions Regarding VCF Files Found in hg38 Public Resource Bundle
Can you please provide
a) GATK version used: GATK 4.1.7.0
b) Exact GATK commands used N/A
c) The entire error log if applicable. N/A
Hello! I am attempting to implement the GATK 4.1.7.0 "best practices pipelines" for (1) Data Pre-processing for Variant Discovery, (2) Somatic short variant discovery (SNVs + Indels), and (3) Somatic Copy Number Variant Discovery on human WES and WGS data.
I downloaded the most recent version of the hg38 resource bundle from here:
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1
I have two overarching questions that I could not find in the manual documentation:
1. What are the differences and origins of the various known variation vcf files in the hg38 resource bundle? I see the following vcf files downloaded from the site:
1000G_omni2.5.hg38.vcf.gz
1000G_phase1.snps.high_confidence.hg38.vcf.gz
Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
hapmap_3.3.hg38.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
2. For the "best practices" pipelines for (1) Data Pre-processing for Variant Discovery, (2) Somatic short variant discovery (SNVs + Indels), and (3) Somatic Copy Number Variant Discovery on human WES and WGS data, which vcf files should I use at the various steps requiring known variants?
Thank you,
Sara
-
Hi ,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
Hi Sara Coder
The answer to your first question can be easily googled since the filenames contain the corresponding study names in which the VCFs were obtained/generated.
As for the second question, most of these VCFs are used in VQSR steps. All necessary information can be found at the documentation page for VariantRecalibrator (including how to use these VCFs).
-
Hi danilovkiri,
I know about how 1000Genomes was generated, but I can't find any info about the 1000G_phase1.snps.high_confidence.hg38.vcf.gz file on the IGSR website or FTP server.
Is this just the "95% of SNPs at 1% frequency" mentioned in the paper or did the Broad filter the findings in this file for inclusion in the Broad Resource Bundle? i.e. What makes the SNPs in the file high confidence?
Specifically, I was wondering if they are common SNPs or just all SNPs that were detected with certainty?
Thanks.
-
Hi Mark Godek,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
Please sign in to leave a comment.
4 comments