af-only-gnomad.hg38.vcf.gz details
Hello,
Are there details available about how exactly af-only-gnomad.hg38.vcf.gz was created?
I've seen elsewhere that it was generated from one of the Gnomad v2 versions (not sure which exactly) and is a merge of the exome and genome data lifted over to hg38, but could not find specifics and could not easily reverse engineer the values.
Specifically, what Gnomad field(s) were used to to compute the AF value in af-only-gnomad.hg38.vcf.gz ? Is it just computed from a merged tally of AC and AN from the exome and genome files?
Also, has any consideration been given to using AF_popmax and/or including Gnomad v3? Either of these could potentially make a big difference for previously underrepresented populations.
Thanks very much.
-
Additionally, there are non_cancer allele counts in the Gnomad files which I believe exclude cancer samples. Do we know if these samples are included in the af-only-gnomad.hg38.vcf.gz AF?
-
Hi lmose,
Please see this Mutect2 FAQ, there is information about the af-only-gnomad.hg38.vcf.gz file in question 19. There are also many discussions on the forum regarding that file, please see those. For example, this one: https://gatk.broadinstitute.org/hc/en-us/community/posts/360058276951-Which-file-is-af-only-gnomad-hg38-vcf-gz-
We are not the group who creates the gnomad files so we only have specific information about how the file was modified for GATK use.
Genevieve
-
Hi Genevieve,
Thanks for responding. Yes, I spent time looking at the various threads including the one you mentioned. I could not find the specific details in any of them or in the FAQ. My apologies if I'm missing something.
My questions above are specifically about the details of how this file was created. The details of the original Gnomad files are already well documented by the Gnomad team.
To clarify, from what I understand, af-only-gnomad.hg38.vcf.gz appears to have re-computed AF values from the Gnomad exomes and genomes (from some version of Gnomad 2). It would be good to understand exactly how these values were re-computed and exactly which Gnomad version was used.
Thanks again.
-
I don't believe that the AF values have been recomputed, the only changes according to our documentation are removing the INFO lines.
If this is not the case, please provide more information so that I can thoroughly look into it.
-
The Gnomad datasets are provided for exomes and genomes separately with AF computed separately for each. My understanding is that af-only-gnomad.hg38.vcf.gz includes both genome and exome data and so this would require either recomputing the AF values, or just picking one from either the exome or the genome datasets. It would be good to have clarification on what was done either way.
Also, there are multiple versions of Gnomad v2. It would be good to know which one is used here. The latest is v2.1.1.
-
Hi lmose,
We don't have this information currently available but I can submit a documentation request.
Our first priority is resolving questions about GATK tool-specific errors and abnormal results from the tools. For more information, you can view our support policy. We are not able to guarantee a solution for this request. If other community members know this answer and can help out, please do so!
Genevieve
-
Hi lmose,
I found a WDL script that is used by our developers to make the Mutect2 resources. You can view it for more detailed information about how the file is made.
https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl
Genevieve
-
Thanks. That script appears to just remove the non-AF fields.
Please sign in to leave a comment.
8 comments