Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs and the death of the dot Follow

17 comments

  • Avatar
    Eric C. Anderson

    I love the title!  Thanks so much for this blog post.  I think it will be valuable for people dealing with low-coverage WGS data.

    0
    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    I agree, Eric C. Anderson! I think it's good to have this info accessible somewhere that Google (or our site's search bar) can easily find it.

    Also — thank you for inspiring this post with your comments in the forum. You definitely pushed us over the edge towards seeing this issue as deserving its own blog post, rather than it getting buried in the forum.

    0
    Comment actions Permalink
  • Avatar
    SAMUEL ANDREW

    Hi, is there an option to get GATK vcf outputs to code missing data as “./.”? I can't see anything that looks like what I want in the documentation. my colleague wrote a script to convert the missing data to "./." which does not seem to be a super slow job and I think should be a default for GATK outputs.

    If I pass .vcf files from GATK, with missing data coded as `0/0:0,0:0:0:0,0,0 (GT:AD:DP:GQ:PL)`, to your internal `gatk VariantsToTable` function it outputs missing data as if it were genotype calls.

    e.g. by using the below script.

    ```

      gatk VariantsToTable \

         -V input.vcf \

         -F CHROM -F POS -F TYPE -GF GT \

         -O output.table

    ```

    If I am going to output a table of genotypes, I should be able to differentiate missing data from genotype calls without referring to other files. At least that is what I think most people expect.

    If 0 has multiple meanings surely your own functions should be able to work out the difference between missing data and genotypes by default.

    We have a step in our pipeline where we use `gatk VariantFiltration` with `--filter-expression "DP < 10"` but GATK seems to just returns the filtered genotypes as `0/0`. In my mind after running such a function I should be able to differentiate missing or filtered out data from genotype calls.

    Your functions and other programs don’t seem to be able to tell the difference between missing data and reference genotypes by default. At what point do we differentiate the missing data from genotype calls?

    This seems like a massive trap for people because it is not reasonable to expect people to anticipate that 0/0 has two meanings. especially when filtering or reformatting the data does not seem to make sense.

    2
    Comment actions Permalink
  • Avatar
    Eddy W

    Hi, I'm just wondering, so when converting these VCF files to plink file format, does plink recognize those variant sites (without the dot) as missing genotype? Thanks

    1
    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    Thank you for your input, SAMUEL ANDREW ~

    As you know, you can still determine the missing genotypes because the FORMAT DP will be 0 even within the current GenotypeGVCFs format.

    So it is still possible to convert the current output VCFs by replacing samples with FORMAT DP=0 with the standard ./. representation.

    That said, I completely agree that it is a bit burdensome to expect GATK users to do this any time they want to use their GenotypeGVCFs outputs with another tool outside of GATK. As you said, even your script was a bit slower to run than what might be possible directly from within GATK.

    As such, I have opened a feature request on GitHub for a conversion tool (or an option within GenotypeGVCFs) that will output missing genotypes in the standard ./. representation.

    You can follow the status of this feature request at the following GitHub ticket link.

    0
    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    Eddy W - By default the VCF files will not recognize those variant sites as missing genotype. This would require converting the GenotypeGVCFs output to a format that PLINK will recognize, by replacing 0/0 with ./. in any samples where DP=0.

    As I said in my previous comment, we have an open GitHub ticket on this issue that you can follow.

    0
    Comment actions Permalink
  • Avatar
    Stephen Schaffner

    Coming to this late because I'm only now seeing GATK 4 output. Doesn't this choice violate the VCF format definition? If so, how is it anything other than an error?

    5
    Comment actions Permalink
  • Avatar
    Nick Dietz

    We have avoided upgrading to the versions of GATK that have this change because we have several downstream tools that rely on the './.' format to detect a missing genotype call. Currently, we are using GATK 4.2.0.0. When I check our VCF files generated using this version, I see several sites where the DP value does not equal zero, but the genotype call is missing. This makes sense to me because there are other reasons a genotype call might be missing, besides having no read coverage. Does this mean that, as of GATK 4.2.3.0, the DP values associated with a missing genotype call will be "over-written" with a zero, even in cases where there were reads covering the region, but a genotype could not be called for some other reason?

    2
    Comment actions Permalink
  • Avatar
    David Carlson

    Also coming to this late, thanks to a recent twitter thread on this.  It seems an odd choice that the output of GATK will no longer meet the VCF standard. From the current VCF specification (emphasis added):

    GT (String): Genotype, encoded as allele value preceded by either of / or | depending on whether that allele
    is considered phased. The first phasing indicator may be omitted and is implicitly defined as / if any phasing
    indicators are / and | otherwise. The allele values are 0 for the reference allele (what is in the REF field), 1 for
    the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be
    0/1, 1 | 0, /0/1, or 1/2, etc. Haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondria, should be
    indicated by having only one allele value. A triploid call might look like 0/0/1, and a partially phased triploid
    call could be |0/1/2 to indicate that the first allele is phased with another variant in the VCF. If a call cannot
    be made for a sample at a given locus, ‘.’ must be specified for each missing allele in the GT field (for example
    ‘./.’ for a diploid genotype and ‘.’ for haploid genotype).

    6
    Comment actions Permalink
  • Avatar
    Yokofakun

    This feature will change the internal AF isn't it ?

    1
    Comment actions Permalink
  • Avatar
    David Roazen

    This change was originally made to allow our joint calling pipeline to scale to handle the hundreds of thousands of samples in the AllOfUs project. I agree that it's problematic and confusing for downstream users. Since AllOfUs recently switched back to using the ./. convention in their callsets in response to similar complaints, we are going to revisit this decision in GATK GenotypeGVCFs and consider reverting back to ./. as well in a future release.

    6
    Comment actions Permalink
  • Avatar
    Eric C. Anderson

    Oh Dear!  And so, in the third year from its untimely demise the dot rose from its grave to wreak vengeance upon all who doubted it.

    Yikes! I hope that isn't too much work for y'all, but it sounds like a worthy revisit.  

    Thanks again for all you do!

    1
    Comment actions Permalink
  • Avatar
    David Lawrence

    Please follow the VCF spec!

    > a call cannot be made for a sample at a given locus, ‘.’ MUST be specified for each missing
    allele in the GT field

    If the performance is so vital to you, write a different file format rather than violating an existing spec

    5
    Comment actions Permalink
  • Avatar
    Old Account

    Uh oh! This breaks thousands of pipelines especially those that matter most (e.g., diagnostics). It is also no longer a VCF file. What should be the new name for this file format? Perhaps something like .GKCF, (GatK Call Format) or GKVCF.

    3
    Comment actions Permalink
  • Avatar
    Old Account

    Some people have VCF pipelines that are set up to take in "minimum" input and do not use those fields. This is not lazy: there are certain sources of variant data (e.g., some microarray) that you cannot populate this field.

    Is the slight efficacy boost worth the potential misdiagnoses that will occur? There are so many labs that do this that at least some will not notice this change and have no reason to suspect it. And why should they? 0 is a call!

    4
    Comment actions Permalink
  • Avatar
    Old Account

    If you need the change for whatever efficiency boost you're going for, that's fine. Add it as a flag option. But do not make it the default

    4
    Comment actions Permalink
  • Avatar
    Philipp Schneider

    Any comment on the question of @Yokofakun?

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk