If you have been keeping up with our GATK release notes, then you know that we have been rolling out a number of backend changes, tools, and features to GATK that we are hoping will improve the efficiency and ease-of-use of our tools.
The funny thing about backend changes, though, is that it's not always obvious to us which of those changes will affect you enough to deserve including as a bullet point in our release notes.
One such change appears to have been having a negative effect for some people's pipelines, so in the name of transparency, here's some explanation about a recent change we've made that may have affected you.
What changed
GenotypeGVCFs has historically represented missing genotypes as a .
dot in the VCF output (or ./.
for diploids). In a multisample VCF file, missing genotypes occur in locations where the genotype of the variant is not known, even though they are known in other samples.
However, as of GATK 4.2.3.0, missing genotypes might appear as 0
instead of the .
you might expect in a diploid case. You’ll still be able to determine the genotype as missing because the FORMAT DP
field will be 0
, though.
So, by way of an actual example, some missing genotypes in the FORMAT field might look like this in an old VCF file, versus a new one:
Old VCF: ./.:0,0:.:0:0,0,0 (GT:AD:DP:GQ:PL)
New VCF: 0/0:0,0:0:0:0,0,0 (GT:AD:DP:GQ:PL)
The genotype quality is zero in both cases, only now it is much more explicit for the "new" VCF.
As the sun rose on my failed analysis, I finally realized why it's called mourning.
How could you? That dot had a family!
Alright, alright — I can already hear the questions being typed about why this change was needed at all.
No spoilers, but we're making changes to the GATK backend to make it run faster and more efficiently for large-scale joint calling. This will make room for some larger scale updates we have planned for upcoming releases.
Sample sets are getting bigger every day, so trying to find ways to decrease the computing resources we need to use will make our tools future-proof, while making our current projects faster.
Decreasing the amount of possible genotype "states" that GenotypeGVCFs needs to differentiate between (ie. "missing" vs. "no-call" genotypes) has a very minor boost in efficiency. As minor as this might seem, it still adds up over the course of millions of individual calls... so the .
dot had to go.
For those who need the dot
Since our tools don't look for missing genotypes using these .
dots at all, it was a pretty obvious change for us to make. For users who are explicitly searching for missing genotypes in GVCFs, that information is still preserved in the FORMAT DP
field — missing values will be marked as DP=0
.
Keep in mind that in some cases with phased variants, you might still see a missing genotype with a .
dot.
Concluding thoughts
It's a balancing act, trying to decide what is or is not important enough to warrant a blog post. We want to be as transparent as possible with the changes we have, while not outlining every banal change we make to GATK and boring you out of ever reading our release notes again.
It seems like we might have overlooked a change to something that some of you find important, and we want to apologize if you were inconvenienced as a result. Hopefully, now you have a better sense for why we made these changes, and what you need to do in order to adjust your pipelines for the future.
As always, we read all of your comments on the forum, so don't hesitate to go there, and let us know what you think!
2 comments
I love the title! Thanks so much for this blog post. I think it will be valuable for people dealing with low-coverage WGS data.
I agree, Eric C. Anderson! I think it's good to have this info accessible somewhere that Google (or our site's search bar) can easily find it.
Also — thank you for inspiring this post with your comments in the forum. You definitely pushed us over the edge towards seeing this issue as deserving its own blog post, rather than it getting buried in the forum.
Please sign in to leave a comment.