Resource bundle hg38 hosts corrupted VCF
If you are seeing an error, please provide(REQUIRED) :
a) GATK version used: 4.1.7.0
b) Exact command used:
gatk CalculateGenotypePosteriors -R hg38.fa -V my_WES.vcf \
-O my_WES_CGP.vcf --tmp-dir /home/rbottcher/tmp/ --supporting 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz
Dear GATK team,
I am currently performing germline variant calling for WES samples using the hg38 resource bundle downloaded from Broad.
After variant calling, I am running CalculateGenotypePosteriors on my samples, supplying the "1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf" via the parameter ---supporting as outlined in the GATK documentation. Unfortunately, this command fails throwing an error about differing chromosome lengths between the reference and the VCF:
Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths: contig reference = chr15 / 101991189 contig features = chr15 / 90338345.
Since both files were downloaded from the same source, I had a closer look at the chromosome lengths in each of them. Here I found that the chromosome lengths of chr16 and chr17 in the VCF are identical:
##contig=<ID=chr1,assembly=GCF_000001405.26,length=248956422>
##contig=<ID=chr2,assembly=GCF_000001405.26,length=242193529>
##contig=<ID=chr3,assembly=GCF_000001405.26,length=198295559>
##contig=<ID=chr4,assembly=GCF_000001405.26,length=190214555>
##contig=<ID=chr5,assembly=GCF_000001405.26,length=181538259>
##contig=<ID=chr6,assembly=GCF_000001405.26,length=170805979>
##contig=<ID=chr7,assembly=GCF_000001405.26,length=159345973>
##contig=<ID=chr8,assembly=GCF_000001405.26,length=145138636>
##contig=<ID=chr9,assembly=GCF_000001405.26,length=138394717>
##contig=<ID=chr10,assembly=GCF_000001405.26,length=133797422>
##contig=<ID=chr11,assembly=GCF_000001405.26,length=135086622>
##contig=<ID=chr12,assembly=GCF_000001405.26,length=133275309>
##contig=<ID=chr13,assembly=GCF_000001405.26,length=114364328>
##contig=<ID=chr14,assembly=GCF_000001405.26,length=107043718>
##contig=<ID=chr15,assembly=GCF_000001405.26,length=90338345>
##contig=<ID=chr16,assembly=GCF_000001405.26,length=83257441>
##contig=<ID=chr17,assembly=GCF_000001405.26,length=83257441>
##contig=<ID=chr18,assembly=GCF_000001405.26,length=80373285>
##contig=<ID=chr19,assembly=GCF_000001405.26,length=58617616>
##contig=<ID=chr20,assembly=GCF_000001405.26,length=64444167>
##contig=<ID=chr21,assembly=GCF_000001405.26,length=46709983>
##contig=<ID=chr22,assembly=GCF_000001405.26,length=50818468>
##contig=<ID=chrX,assembly=GCF_000001405.26,length=156040895>
##contig=<ID=chrY,assembly=GCF_000001405.26,length=57227415>
while the chromosome lengths in the hg38.fa look fine:
chr1 AC:CM000663.2 gi:568336023 LN:248956422 AS:GRCh38
chr2 AC:CM000664.2 gi:568336022 LN:242193529 AS:GRCh38
chr3 AC:CM000665.2 gi:568336021 LN:198295559 AS:GRCh38
chr4 AC:CM000666.2 gi:568336020 LN:190214555 AS:GRCh38
chr5 AC:CM000667.2 gi:568336019 LN:181538259 AS:GRCh38
chr6 AC:CM000668.2 gi:568336018 LN:170805979 AS:GRCh38
chr7 AC:CM000669.2 gi:568336017 LN:159345973 AS:GRCh38
chr8 AC:CM000670.2 gi:568336016 LN:145138636 AS:GRCh38
chr9 AC:CM000671.2 gi:568336015 LN:138394717 AS:GRCh38
chr10 AC:CM000672.2 gi:568336014 LN:133797422 AS:GRCh38
chr11 AC:CM000673.2 gi:568336013 LN:135086622 AS:GRCh38
chr12 AC:CM000674.2 gi:568336012 LN:133275309 AS:GRCh38
chr13 AC:CM000675.2 gi:568336011 LN:114364328 AS:GRCh38
chr14 AC:CM000676.2 gi:568336010 LN:107043718 AS:GRCh38
chr15 AC:CM000677.2 gi:568336009 LN:101991189 AS:GRCh38
chr16 AC:CM000678.2 gi:568336008 LN:90338345 AS:GRCh38
chr17 AC:CM000679.2 gi:568336007 LN:83257441 AS:GRCh38
chr18 AC:CM000680.2 gi:568336006 LN:80373285 AS:GRCh38
chr19 AC:CM000681.2 gi:568336005 LN:58617616 AS:GRCh38
chr20 AC:CM000682.2 gi:568336004 LN:64444167 AS:GRCh38
chr21 AC:CM000683.2 gi:568336003 LN:46709983 AS:GRCh38
chr22 AC:CM000684.2 gi:568336002 LN:50818468 AS:GRCh38
chrX AC:CM000685.2 gi:568336001 LN:156040895 AS:GRCh38
chrY AC:CM000686.2 gi:568336000 LN:57227415 AS:GRCh38
This leads me to conclude that the "1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf" file currently hosted on Google is corrupted / faulty and should NOT be used for any analysis in its current state (I downloaded it twice to be sure, the file & error remain the same).
Since this file is referred to in the documentation of CalculateGenotypePosteriors, I am posting in hopes of raising awareness about this issue.
Kind regards
-
Hi René Böttcher,
Thank you for posting about this and starting a discussion! We try to maintain the files as possible, but sometimes there can be issues and we do not have the capacity to make the changes. This file specifically is from the 1000G site and you can find the original version there.
There were two discussions on our old forum site that will give you more information about how to work around this issue.
- https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2017-01-18-2016-08-11/8694-1000Gphase3integratedsitesonlynoMATCHEDREVhg38vcf-corrupted
- https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2019-02-11-2018-08-12/23411-supporting-dataset-for-CalculateGenotypePosteriors
I hope this helps you!
Genevieve
-
I am also concerned about this and posting here to continue this discussion for those who also run into this problem. I was looking at the vcf of 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf and noticed that it ends at Chr15. So, from the links that you provided, I gather that if I am using the current reference hg38, then I need to liftover the 1000G phase3 file from b37 (from the ftp in your links) to hg38 and then use it in CalculateGenotypePosteriors?
Since the links are from back in 2017 and 2019, are there any alternative phase3 supporting files that CalculateGenotypePosteriors can use that you currently recommend?
Thank you in advance!
-
Hi ,
I am going to move your post into our Community Discussions -> General Discussion topic, as this topic is for reporting bugs and issues with GATK.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Best,
Bhanu
-
Linda Do To get the most up to date version of this file, you can look at the 1000 genomes website and follow the recommendations in the legacy links I provided above.
I'll reach out to the developers who work on CalculateGenotypePosteriors and see if we can replace that documentation or replace the file. I can't guarantee a date for this, however.
-
Hi Genevieve Brandt (she/her),
Just checking to see if this issue can be revisited. I am trying to use the 1000 genomes file in question to CalculateGenotpyePosteriors just as René originally posted about and the file in the resource bundle is still corrupted. I have tried the suggestions in the forum posts from 2016 and 2019, as well as multiple other workarounds not suggested, but none have succeeded. I also cannot locate the original on the 1000 genomes website.
The documentation for CalculateGenotypePosteriors still lists the 1000 genomes file as the file that should be passed in under the --supporting flag, but when I look through more recent documentation about relevant pipelines (both on Terra and using WDL), I do not see the CalculateGenotypePosteriors step even included. Our lab would like to be able to use this step to improve our genotyping calls, but we are wondering if this step is even still included in best practice recommendations.
For reference we are using gatk-4.2.6.1
Thank you for any clarity you are able to provide!
-
June 9th, 2023
If I run the following script:
#!/bin/bash
set -e
echo "# Downloading VCF"
wget -qc https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
echo "# sha256sum of VCF"
sha256sum 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
echo "# Lengths of chr15 / chr16 from VCF header"
head -1000 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf | grep -E -i '^##.*contig.*chr1[56].*'
echo "# Last five lines of VCF"
tail -5 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
I see the following output:# Downloading VCF
# sha256sum of VCF
d8a2e764e30d774618f64a681a72d32218e4b65e3906d21c278283c88877b13e 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
# Lengths of chr15 / chr16 from VCF header
##contig=<ID=chr15,assembly=GCF_000001405.26,length=90338345>
##contig=<ID=chr16,assembly=GCF_000001405.26,length=83257441>
# Last five lines of VCF
chr15 90276852 rs558111382 G C 100 PASS AC=2;AF=0.000399361;AFR_AF=0.0000;AMR_AF=0.0000;AN=5008;ASP;DP=16982;EAS_AF=0.0020;EUR_AF=0.0000;MATCHED_FWD;NS=2504;SAS_AF=0.0000;ssID=ss1354573158
chr15 90276855 rs371577297 C T 100 PASS AC=4;AF=0.000798722;AFR_AF=0.0030;AMR_AF=0.0000;AN=5008;ASP;DP=16924;EAS_AF=0.0000;EUR_AF=0.0000;MATCHED_FWD;NS=2504;SAS_AF=0.0000;ssID=ss1354573159
chr15 90276856 rs149628226 C T 100 PASS AC=38;AF=0.00758786;AFR_AF=0.0287;AMR_AF=0.0000;AN=5008;ASP;DP=16975;EAS_AF=0.0000;EUR_AF=0.0000;MATCHED_FWD;NS=2504;SAS_AF=0.0000;ssID=ss1354573160
chr15 90276899 rs562404415 C T 100 PASS AC=5;AF=0.000998403;AFR_AF=0.0000;AMR_AF=0.0000;AN=5008;ASP;DP=17609;EAS_AF=0.0000;EUR_AF=0.0000;MATCHED_FWD;NS=2504;SAS_AF=0.0051;ssID=ss1354573161
chr15 90276957 rs529800547 G C 100 PASS AC=1;AF=0.000199681;AFR_AF=0.0000;AMR_AF=0.0000;AN=5008;ASP;DP=19063;EAS_AF=0.0000;EUR_AF=0.From this analysis, it seems like the chromosome lengths are correctly reported in the header of the 1000G VCF, but that the VCF itself only reports up to the end of Chromosome 15. It's unclear why this VCF only reports on a subset of the chromosomes, but this has been tripping up people for a few years, so a fix is warranted. Please keep an eye on this thread for updates.
As a workaround until a fix is available, you can try more recent population variant catalogs such as v5 of the 1000G project or gnomad.
-
Jack Koskinen you can also use this 1000G vcf, from the same google bucket as the truncated version, instead.
Please sign in to leave a comment.
7 comments