⚠️ NOTE ⚠️ This article describes behavior present in GATK versions 4.2.3.0 through 4.5.0.0, and is obsolete as of GATK 4.6.0.0, which (by overwhelming popular demand!) reverted back to the standard ./.
representation in our joint genotyping tools and GenomicsDB.
───
If you have been keeping up with our GATK release notes, then you know that we have been rolling out a number of backend changes, tools, and features to GATK that we are hoping will improve the efficiency and ease-of-use of our tools.
The funny thing about backend changes, though, is that it's not always obvious to us which of those changes will affect you enough to deserve including as a bullet point in our release notes.
One such change appears to have been having a negative effect for some people's pipelines, so in the name of transparency, here's some explanation about a recent change we've made that may have affected you.
What changed
GenotypeGVCFs has historically represented missing genotypes as a .
dot in the VCF output (or ./.
for diploids). In a multisample VCF file, missing genotypes occur in locations where the genotype of the variant is not known, even though they are known in other samples.
However, as of GATK 4.2.3.0, missing genotypes might appear as 0
instead of the .
you might expect in a diploid case. You’ll still be able to determine the genotype as missing because the FORMAT DP
field will be 0
, though.
So, by way of an actual example, some missing genotypes in the FORMAT field might look like this in an old VCF file, versus a new one:
Old VCF: ./.:0,0:.:0:0,0,0 (GT:AD:DP:GQ:PL)
New VCF: 0/0:0,0:0:0:0,0,0 (GT:AD:DP:GQ:PL)
The genotype quality is zero in both cases, only now it is much more explicit for the "new" VCF.
As the sun rose on my failed analysis, I finally realized why it's called mourning.
How could you? That dot had a family!
Alright, alright — I can already hear the questions being typed about why this change was needed at all.
No spoilers, but we're making changes to the GATK backend to make it run faster and more efficiently for large-scale joint calling. This will make room for some larger scale updates we have planned for upcoming releases.
Sample sets are getting bigger every day, so trying to find ways to decrease the computing resources we need to use will make our tools future-proof, while making our current projects faster.
Decreasing the amount of possible genotype "states" that GenotypeGVCFs needs to differentiate between (ie. "missing" vs. "no-call" genotypes) has a very minor boost in efficiency. As minor as this might seem, it still adds up over the course of millions of individual calls... so the .
dot had to go.
For those who need the dot
Since our tools don't look for missing genotypes using these .
dots at all, it was a pretty obvious change for us to make. For users who are explicitly searching for missing genotypes in GVCFs, that information is still preserved in the FORMAT DP
field — missing values will be marked as DP=0
.
Keep in mind that in some cases with phased variants, you might still see a missing genotype with a .
dot.
Concluding thoughts
It's a balancing act, trying to decide what is or is not important enough to warrant a blog post. We want to be as transparent as possible with the changes we have, while not outlining every banal change we make to GATK and boring you out of ever reading our release notes again.
It seems like we might have overlooked a change to something that some of you find important, and we want to apologize if you were inconvenienced as a result. Hopefully, now you have a better sense for why we made these changes, and what you need to do in order to adjust your pipelines for the future.
As always, we read all of your comments on the forum, so don't hesitate to go there, and let us know what you think!
───
May 18, 2022 - This article was originally published.
May 17, 2023 (Update) - Turns out that you did, in fact, let us know what you thought. A GitHub ticket was made that suggested implementing a way to revert the change for users who wanted to.
June 30, 2024 (Update) - You asked, we listened! Due to user comments, this update was implemented into GATK 4.6.0, which reverts to the previous no-call behavior. As a result, the current article presents information that is no longer relevant, following an update to GATK 4.6 and later.
24 comments
I love the title! Thanks so much for this blog post. I think it will be valuable for people dealing with low-coverage WGS data.
I agree, Eric C. Anderson! I think it's good to have this info accessible somewhere that Google (or our site's search bar) can easily find it.
Also — thank you for inspiring this post with your comments in the forum. You definitely pushed us over the edge towards seeing this issue as deserving its own blog post, rather than it getting buried in the forum.
Hi, is there an option to get GATK vcf outputs to code missing data as “./.”? I can't see anything that looks like what I want in the documentation. my colleague wrote a script to convert the missing data to "./." which does not seem to be a super slow job and I think should be a default for GATK outputs.
If I pass .vcf files from GATK, with missing data coded as `0/0:0,0:0:0:0,0,0 (GT:AD:DP:GQ:PL)`, to your internal `gatk VariantsToTable` function it outputs missing data as if it were genotype calls.
e.g. by using the below script.
```
gatk VariantsToTable \
-V input.vcf \
-F CHROM -F POS -F TYPE -GF GT \
-O output.table
```
If I am going to output a table of genotypes, I should be able to differentiate missing data from genotype calls without referring to other files. At least that is what I think most people expect.
If 0 has multiple meanings surely your own functions should be able to work out the difference between missing data and genotypes by default.
We have a step in our pipeline where we use `gatk VariantFiltration` with `--filter-expression "DP < 10"` but GATK seems to just returns the filtered genotypes as `0/0`. In my mind after running such a function I should be able to differentiate missing or filtered out data from genotype calls.
Your functions and other programs don’t seem to be able to tell the difference between missing data and reference genotypes by default. At what point do we differentiate the missing data from genotype calls?
This seems like a massive trap for people because it is not reasonable to expect people to anticipate that 0/0 has two meanings. especially when filtering or reformatting the data does not seem to make sense.
Hi, I'm just wondering, so when converting these VCF files to plink file format, does plink recognize those variant sites (without the dot) as missing genotype? Thanks
Thank you for your input, SAMUEL ANDREW ~
As you know, you can still determine the missing genotypes because the FORMAT
DP
will be0
even within the current GenotypeGVCFs format.So it is still possible to convert the current output VCFs by replacing samples with FORMAT
DP=0
with the standard./.
representation.That said, I completely agree that it is a bit burdensome to expect GATK users to do this any time they want to use their GenotypeGVCFs outputs with another tool outside of GATK. As you said, even your script was a bit slower to run than what might be possible directly from within GATK.
As such, I have opened a feature request on GitHub for a conversion tool (or an option within GenotypeGVCFs) that will output missing genotypes in the standard
./.
representation.You can follow the status of this feature request at the following GitHub ticket link.
Eddy W - By default the VCF files will not recognize those variant sites as missing genotype. This would require converting the GenotypeGVCFs output to a format that PLINK will recognize, by replacing
0/0
with./.
in any samples whereDP=0
.As I said in my previous comment, we have an open GitHub ticket on this issue that you can follow.
Coming to this late because I'm only now seeing GATK 4 output. Doesn't this choice violate the VCF format definition? If so, how is it anything other than an error?
We have avoided upgrading to the versions of GATK that have this change because we have several downstream tools that rely on the './.' format to detect a missing genotype call. Currently, we are using GATK 4.2.0.0. When I check our VCF files generated using this version, I see several sites where the DP value does not equal zero, but the genotype call is missing. This makes sense to me because there are other reasons a genotype call might be missing, besides having no read coverage. Does this mean that, as of GATK 4.2.3.0, the DP values associated with a missing genotype call will be "over-written" with a zero, even in cases where there were reads covering the region, but a genotype could not be called for some other reason?
Also coming to this late, thanks to a recent twitter thread on this. It seems an odd choice that the output of GATK will no longer meet the VCF standard. From the current VCF specification (emphasis added):
This feature will change the internal AF isn't it ?
This change was originally made to allow our joint calling pipeline to scale to handle the hundreds of thousands of samples in the AllOfUs project. I agree that it's problematic and confusing for downstream users. Since AllOfUs recently switched back to using the ./. convention in their callsets in response to similar complaints, we are going to revisit this decision in GATK GenotypeGVCFs and consider reverting back to ./. as well in a future release.
Oh Dear! And so, in the third year from its untimely demise the dot rose from its grave to wreak vengeance upon all who doubted it.
Yikes! I hope that isn't too much work for y'all, but it sounds like a worthy revisit.
Thanks again for all you do!
Please follow the VCF spec!
> a call cannot be made for a sample at a given locus, ‘.’ MUST be specified for each missing
allele in the GT field
If the performance is so vital to you, write a different file format rather than violating an existing spec
Uh oh! This breaks thousands of pipelines especially those that matter most (e.g., diagnostics). It is also no longer a VCF file. What should be the new name for this file format? Perhaps something like .GKCF, (GatK Call Format) or GKVCF.
Some people have VCF pipelines that are set up to take in "minimum" input and do not use those fields. This is not lazy: there are certain sources of variant data (e.g., some microarray) that you cannot populate this field.
Is the slight efficacy boost worth the potential misdiagnoses that will occur? There are so many labs that do this that at least some will not notice this change and have no reason to suspect it. And why should they? 0 is a call!
If you need the change for whatever efficiency boost you're going for, that's fine. Add it as a flag option. But do not make it the default
Any comment on the question of @Yokofakun?
Previously, I had made a comment that questioned the tone of this article, along with the original decision itself. Key among my points was the use of the fairly insensitive meme picture, which is still quite offensive. I can no longer see my post, and I'd like to know why my comment was deleted. Thank you.
Sorry, I can't see another post. Your profile says you've only posted this one ¯\_(ツ)_/¯
Anyway, it's totally fine to question the decision to change the output format... it's also why we reverted the feature due to community reaction. Thanks for being so passionate!
What do you find offensive about the tone of the article or the image?
Thanks for writing back. Not sure what happened to the post, as I definitely had another comment here. Regardless, you're setting a mocking tone. This post delivered news that was going to cause folks to have to fundamentally change their approach to analysis, which creates work. It does not seem professional to mock the change, and in my opinion loss and grief should never be mocked in this way.
I'm sorry that you feel that way.
The tone of the article was intended to make a technical topic more engaging and draw attention to a change that might otherwise be overlooked in the release notes.
I understand that different readers may have varying reactions to this style, and that's okay.
I appreciate your passion and commitment to improving our tools and community. While I can't promise that everything I write will meet your standards of professionalism, I'll consider your feedback for future communications.
I just read this post today following the email update of GATK 4.6.0. I saw that this post was just updated this week. Hence, I am confused about whether GATK 4.6.0 still gets rid of ./. or if the ./. is actually back.
additional edit: after posting my comment, I saw that the lastest edit was actually including the time I posted my comment. So it seems like ./. would be gone, and this post update platform might actually be buggy.
Good question, Bhoom Suktitipat. Unfortunately, the date on that article is not accurate. Maybe someone else on the GATK team made an update?
I've updated the article to list out the relevant dates for changes that have affected the content.
Long story short, the article was originally published June 2022, but as of GATK 4.6.0 (updated June 2024), the previous no-call behavior was reverted.
I have edited the title and opening paragraph of this article to make it clear that GATK 4.6 reverts the behavior described here and switches back to the standard ./. representation for no-calls in GenotypeGVCFs and GenomicsDB.
Please sign in to leave a comment.