Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs and the death of the dot Follow

8 comments

  • Avatar
    Eric C. Anderson

    I love the title!  Thanks so much for this blog post.  I think it will be valuable for people dealing with low-coverage WGS data.

    1
    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    I agree, Eric C. Anderson! I think it's good to have this info accessible somewhere that Google (or our site's search bar) can easily find it.

    Also — thank you for inspiring this post with your comments in the forum. You definitely pushed us over the edge towards seeing this issue as deserving its own blog post, rather than it getting buried in the forum.

    1
    Comment actions Permalink
  • Avatar
    SAMUEL ANDREW

    Hi, is there an option to get GATK vcf outputs to code missing data as “./.”? I can't see anything that looks like what I want in the documentation. my colleague wrote a script to convert the missing data to "./." which does not seem to be a super slow job and I think should be a default for GATK outputs.

    If I pass .vcf files from GATK, with missing data coded as `0/0:0,0:0:0:0,0,0 (GT:AD:DP:GQ:PL)`, to your internal `gatk VariantsToTable` function it outputs missing data as if it were genotype calls.

    e.g. by using the below script.

    ```

      gatk VariantsToTable \

         -V input.vcf \

         -F CHROM -F POS -F TYPE -GF GT \

         -O output.table

    ```

    If I am going to output a table of genotypes, I should be able to differentiate missing data from genotype calls without referring to other files. At least that is what I think most people expect.

    If 0 has multiple meanings surely your own functions should be able to work out the difference between missing data and genotypes by default.

    We have a step in our pipeline where we use `gatk VariantFiltration` with `--filter-expression "DP < 10"` but GATK seems to just returns the filtered genotypes as `0/0`. In my mind after running such a function I should be able to differentiate missing or filtered out data from genotype calls.

    Your functions and other programs don’t seem to be able to tell the difference between missing data and reference genotypes by default. At what point do we differentiate the missing data from genotype calls?

    This seems like a massive trap for people because it is not reasonable to expect people to anticipate that 0/0 has two meanings. especially when filtering or reformatting the data does not seem to make sense.

    1
    Comment actions Permalink
  • Avatar
    Eddy W

    Hi, I'm just wondering, so when converting these VCF files to plink file format, does plink recognize those variant sites (without the dot) as missing genotype? Thanks

    1
    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    Thank you for your input, SAMUEL ANDREW ~

    As you know, you can still determine the missing genotypes because the FORMAT DP will be 0 even within the current GenotypeGVCFs format.

    So it is still possible to convert the current output VCFs by replacing samples with FORMAT DP=0 with the standard ./. representation.

    That said, I completely agree that it is a bit burdensome to expect GATK users to do this any time they want to use their GenotypeGVCFs outputs with another tool outside of GATK. As you said, even your script was a bit slower to run than what might be possible directly from within GATK.

    As such, I have opened a feature request on GitHub for a conversion tool (or an option within GenotypeGVCFs) that will output missing genotypes in the standard ./. representation.

    You can follow the status of this feature request at the following GitHub ticket link.

    1
    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    Eddy W - By default the VCF files will not recognize those variant sites as missing genotype. This would require converting the GenotypeGVCFs output to a format that PLINK will recognize, by replacing 0/0 with ./. in any samples where DP=0.

    As I said in my previous comment, we have an open GitHub ticket on this issue that you can follow.

    1
    Comment actions Permalink
  • Avatar
    Stephen Schaffner

    Coming to this late because I'm only now seeing GATK 4 output. Doesn't this choice violate the VCF format definition? If so, how is it anything other than an error?

    0
    Comment actions Permalink
  • Avatar
    Nick Dietz

    We have avoided upgrading to the versions of GATK that have this change because we have several downstream tools that rely on the './.' format to detect a missing genotype call. Currently, we are using GATK 4.2.0.0. When I check our VCF files generated using this version, I see several sites where the DP value does not equal zero, but the genotype call is missing. This makes sense to me because there are other reasons a genotype call might be missing, besides having no read coverage. Does this mean that, as of GATK 4.2.3.0, the DP values associated with a missing genotype call will be "over-written" with a zero, even in cases where there were reads covering the region, but a genotype could not be called for some other reason?

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk