If you have been keeping up with our GATK release notes, then you know that we have been rolling out a number of backend changes, tools, and features to GATK that we are hoping will improve the efficiency and ease-of-use of our tools.
The funny thing about backend changes, though, is that it's not always obvious to us which of those changes will affect you enough to deserve including as a bullet point in our release notes.
One such change appears to have been having a negative effect for some people's pipelines, so in the name of transparency, here's some explanation about a recent change we've made that may have affected you.
What changed
GenotypeGVCFs has historically represented missing genotypes as a .
dot in the VCF output (or ./.
for diploids). In a multisample VCF file, missing genotypes occur in locations where the genotype of the variant is not known, even though they are known in other samples.
However, as of GATK 4.2.3.0, missing genotypes might appear as 0
instead of the .
you might expect in a diploid case. You’ll still be able to determine the genotype as missing because the FORMAT DP
field will be 0
, though.
So, by way of an actual example, some missing genotypes in the FORMAT field might look like this in an old VCF file, versus a new one:
Old VCF: ./.:0,0:.:0:0,0,0 (GT:AD:DP:GQ:PL)
New VCF: 0/0:0,0:0:0:0,0,0 (GT:AD:DP:GQ:PL)
The genotype quality is zero in both cases, only now it is much more explicit for the "new" VCF.
As the sun rose on my failed analysis, I finally realized why it's called mourning.
How could you? That dot had a family!
Alright, alright — I can already hear the questions being typed about why this change was needed at all.
No spoilers, but we're making changes to the GATK backend to make it run faster and more efficiently for large-scale joint calling. This will make room for some larger scale updates we have planned for upcoming releases.
Sample sets are getting bigger every day, so trying to find ways to decrease the computing resources we need to use will make our tools future-proof, while making our current projects faster.
Decreasing the amount of possible genotype "states" that GenotypeGVCFs needs to differentiate between (ie. "missing" vs. "no-call" genotypes) has a very minor boost in efficiency. As minor as this might seem, it still adds up over the course of millions of individual calls... so the .
dot had to go.
For those who need the dot
Since our tools don't look for missing genotypes using these .
dots at all, it was a pretty obvious change for us to make. For users who are explicitly searching for missing genotypes in GVCFs, that information is still preserved in the FORMAT DP
field — missing values will be marked as DP=0
.
Keep in mind that in some cases with phased variants, you might still see a missing genotype with a .
dot.
Concluding thoughts
It's a balancing act, trying to decide what is or is not important enough to warrant a blog post. We want to be as transparent as possible with the changes we have, while not outlining every banal change we make to GATK and boring you out of ever reading our release notes again.
It seems like we might have overlooked a change to something that some of you find important, and we want to apologize if you were inconvenienced as a result. Hopefully, now you have a better sense for why we made these changes, and what you need to do in order to adjust your pipelines for the future.
As always, we read all of your comments on the forum, so don't hesitate to go there, and let us know what you think!
8 comments
I love the title! Thanks so much for this blog post. I think it will be valuable for people dealing with low-coverage WGS data.
I agree, Eric C. Anderson! I think it's good to have this info accessible somewhere that Google (or our site's search bar) can easily find it.
Also — thank you for inspiring this post with your comments in the forum. You definitely pushed us over the edge towards seeing this issue as deserving its own blog post, rather than it getting buried in the forum.
Hi, is there an option to get GATK vcf outputs to code missing data as “./.”? I can't see anything that looks like what I want in the documentation. my colleague wrote a script to convert the missing data to "./." which does not seem to be a super slow job and I think should be a default for GATK outputs.
If I pass .vcf files from GATK, with missing data coded as `0/0:0,0:0:0:0,0,0 (GT:AD:DP:GQ:PL)`, to your internal `gatk VariantsToTable` function it outputs missing data as if it were genotype calls.
e.g. by using the below script.
```
gatk VariantsToTable \
-V input.vcf \
-F CHROM -F POS -F TYPE -GF GT \
-O output.table
```
If I am going to output a table of genotypes, I should be able to differentiate missing data from genotype calls without referring to other files. At least that is what I think most people expect.
If 0 has multiple meanings surely your own functions should be able to work out the difference between missing data and genotypes by default.
We have a step in our pipeline where we use `gatk VariantFiltration` with `--filter-expression "DP < 10"` but GATK seems to just returns the filtered genotypes as `0/0`. In my mind after running such a function I should be able to differentiate missing or filtered out data from genotype calls.
Your functions and other programs don’t seem to be able to tell the difference between missing data and reference genotypes by default. At what point do we differentiate the missing data from genotype calls?
This seems like a massive trap for people because it is not reasonable to expect people to anticipate that 0/0 has two meanings. especially when filtering or reformatting the data does not seem to make sense.
Hi, I'm just wondering, so when converting these VCF files to plink file format, does plink recognize those variant sites (without the dot) as missing genotype? Thanks
Thank you for your input, SAMUEL ANDREW ~
As you know, you can still determine the missing genotypes because the FORMAT
DP
will be0
even within the current GenotypeGVCFs format.So it is still possible to convert the current output VCFs by replacing samples with FORMAT
DP=0
with the standard./.
representation.That said, I completely agree that it is a bit burdensome to expect GATK users to do this any time they want to use their GenotypeGVCFs outputs with another tool outside of GATK. As you said, even your script was a bit slower to run than what might be possible directly from within GATK.
As such, I have opened a feature request on GitHub for a conversion tool (or an option within GenotypeGVCFs) that will output missing genotypes in the standard
./.
representation.You can follow the status of this feature request at the following GitHub ticket link.
Eddy W - By default the VCF files will not recognize those variant sites as missing genotype. This would require converting the GenotypeGVCFs output to a format that PLINK will recognize, by replacing
0/0
with./.
in any samples whereDP=0
.As I said in my previous comment, we have an open GitHub ticket on this issue that you can follow.
Coming to this late because I'm only now seeing GATK 4 output. Doesn't this choice violate the VCF format definition? If so, how is it anything other than an error?
We have avoided upgrading to the versions of GATK that have this change because we have several downstream tools that rely on the './.' format to detect a missing genotype call. Currently, we are using GATK 4.2.0.0. When I check our VCF files generated using this version, I see several sites where the DP value does not equal zero, but the genotype call is missing. This makes sense to me because there are other reasons a genotype call might be missing, besides having no read coverage. Does this mean that, as of GATK 4.2.3.0, the DP values associated with a missing genotype call will be "over-written" with a zero, even in cases where there were reads covering the region, but a genotype could not be called for some other reason?
Please sign in to leave a comment.