Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Resource bundle hg38 hosts corrupted VCF

0

7 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi René Böttcher,

    Thank you for posting about this and starting a discussion! We try to maintain the files as possible, but sometimes there can be issues and we do not have the capacity to make the changes. This file specifically is from the 1000G site and you can find the original version there.

    There were two discussions on our old forum site that will give you more information about how to work around this issue. 

    1. https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2017-01-18-2016-08-11/8694-1000Gphase3integratedsitesonlynoMATCHEDREVhg38vcf-corrupted
    2. https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2019-02-11-2018-08-12/23411-supporting-dataset-for-CalculateGenotypePosteriors

    I hope this helps you!

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Linda Do

    Hi Genevieve Brandt (she/her) 

    I am also concerned about this and posting here to continue this discussion for those who also run into this problem. I was looking at the vcf of 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf and noticed that it ends at Chr15. So, from the links that you provided, I gather that if I am using the current reference hg38, then I need to liftover the 1000G phase3 file from b37 (from the ftp in your links) to hg38 and then use it in CalculateGenotypePosteriors?

    Since the links are from back in 2017 and 2019, are there any alternative phase3 supporting files that CalculateGenotypePosteriors can use that you currently recommend?

    Thank you in advance!

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi ,

    I am going to move your post into our Community Discussions -> General Discussion topic, as this topic is for reporting bugs and issues with GATK.

    You can read more about our forum guidelines and the topics here: Forum Guidelines.

    Best,

    Bhanu

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Linda Do To get the most up to date version of this file, you can look at the 1000 genomes website and follow the recommendations in the legacy links I provided above.

    I'll reach out to the developers who work on CalculateGenotypePosteriors and see if we can replace that documentation or replace the file. I can't guarantee a date for this, however.

    0
    Comment actions Permalink
  • Avatar
    Jack Koskinen

     

    Hi Genevieve Brandt (she/her),

    Just checking to see if this issue can be revisited. I am trying to use the 1000 genomes file in question to CalculateGenotpyePosteriors just as René originally posted about and the file in the resource bundle is still corrupted. I have tried the suggestions in the forum posts from 2016 and 2019, as well as multiple other workarounds not suggested, but none have succeeded. I also cannot locate the original on the 1000 genomes website. 

    The documentation for CalculateGenotypePosteriors still lists the 1000 genomes file as the file that should be passed in under the --supporting flag, but when I look through more recent documentation about relevant pipelines (both on Terra and using WDL), I do not see the CalculateGenotypePosteriors step even included. Our lab would like to be able to use this step to improve our genotyping calls, but we are wondering if this step is even still included in best practice recommendations. 

    For reference we are using gatk-4.2.6.1

    Thank you for any clarity you are able to provide!

    0
    Comment actions Permalink
  • Avatar
    Giles Hall

    June 9th, 2023

    If I run the following script:

    #!/bin/bash
    set -e

    echo "# Downloading VCF"
    wget -qc https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
    echo "# sha256sum of VCF"
    sha256sum 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
    echo "# Lengths of chr15 / chr16 from VCF header"
    head -1000 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf | grep -E -i '^##.*contig.*chr1[56].*'
    echo "# Last five lines of VCF"
    tail -5 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf

    I see the following output:
    # Downloading VCF
    # sha256sum of VCF
    d8a2e764e30d774618f64a681a72d32218e4b65e3906d21c278283c88877b13e  1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
    # Lengths of chr15 / chr16 from VCF header
    ##contig=<ID=chr15,assembly=GCF_000001405.26,length=90338345>
    ##contig=<ID=chr16,assembly=GCF_000001405.26,length=83257441>
    # Last five lines of VCF
    chr15   90276852        rs558111382     G       C       100       PASS    AC=2;AF=0.000399361;AFR_AF=0.0000;AMR_AF=0.0000;AN=5008;ASP;DP=16982;EAS_AF=0.0020;EUR_AF=0.0000;MATCHED_FWD;NS=2504;SAS_AF=0.0000;ssID=ss1354573158
    chr15   90276855        rs371577297     C       T       100       PASS    AC=4;AF=0.000798722;AFR_AF=0.0030;AMR_AF=0.0000;AN=5008;ASP;DP=16924;EAS_AF=0.0000;EUR_AF=0.0000;MATCHED_FWD;NS=2504;SAS_AF=0.0000;ssID=ss1354573159
    chr15   90276856        rs149628226     C       T       100       PASS    AC=38;AF=0.00758786;AFR_AF=0.0287;AMR_AF=0.0000;AN=5008;ASP;DP=16975;EAS_AF=0.0000;EUR_AF=0.0000;MATCHED_FWD;NS=2504;SAS_AF=0.0000;ssID=ss1354573160
    chr15   90276899        rs562404415     C       T       100       PASS    AC=5;AF=0.000998403;AFR_AF=0.0000;AMR_AF=0.0000;AN=5008;ASP;DP=17609;EAS_AF=0.0000;EUR_AF=0.0000;MATCHED_FWD;NS=2504;SAS_AF=0.0051;ssID=ss1354573161
    chr15   90276957        rs529800547     G       C       100       PASS    AC=1;AF=0.000199681;AFR_AF=0.0000;AMR_AF=0.0000;AN=5008;ASP;DP=19063;EAS_AF=0.0000;EUR_AF=0.

    From this analysis, it seems like the chromosome lengths are correctly reported in the header of the 1000G VCF, but that the VCF itself only reports up to the end of Chromosome 15.  It's unclear why this VCF only reports on a subset of the chromosomes, but this has been tripping up people for a few years, so a fix is warranted.  Please keep an eye on this thread for updates.

    As a workaround until a fix is available, you can try more recent population variant catalogs such as v5 of the 1000G project or gnomad.  

    0
    Comment actions Permalink
  • Avatar
    Chris Kachulis

    Jack Koskinen you can also use this 1000G vcf, from the same google bucket as the truncated version, instead. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk