GenotypeGVCFs -stand-call-conf filtering high-QUAL variants
I am comparing variant calls in a batch of 7 samples, with and without an additional control sample. The addition of this 8th sample alters the QUAL scores during genotyping, such that some variants present in the gVCFs, and formerly present in the VCF when genotyped without the additional sample, subsequently failed when the control sample was added, since they originally had borderline QUAL scores (~10) and must have dropped below threshold. This makes sense.
However, there are a handful of variants that originally had higher QUAL scores, some in the 100 - 200 range, that are also missing from the VCF when run with the additional sample. I doubted that their QUAL scores could have dropped so far, so I re-ran GenotypeGVCFs with the additional sample and -stand-call-conf set to zero. Those variants now reappear in the VCF with QUAL scores similar to or higher than their original scores.
Actual question:
So, why does switching the -stand-call-conf from 10 (default) to zero decide whether these variants with QUAL > 10 are filtered or not? Are the QUAL scores normalized somehow before filtering? I noticed that the majority of the "missing" variants with QUAL > 20 seem to be small, repetitive indels, most with >=3 alt alleles. Is the QUAL filtering normalized over alleles somehow? Or is there another filter being applied that depends on the value of the QUAL filter?
I must confess, I'm using GATK/3.7 for this, so I understand if this is out of scope, or if this was maybe a known issue in the past that has been resolved.
-
Hi Tyler,
We recommend that you upgrade to GATK4 latest version and try again. We do not support GATK3 anymore and as you mentioned there is a high possibility that this issue has been resolved in the newer versions.
-
Hi Bhanu,
I've rerun a batch using GATK/4.1.7.0, and I'm still seeing the same effect. In this particular batch:
28 variants were 'missing' when run with my additional control sample.
> of these, 24 originally had 30<QUAL<32, so these make sense since the default cut-off is 30
> the other 4 originally had 40<QUAL<60
I then re-ran with stand-call-conf = 0, as before, to check if the missing variants' QUAL scores truly did fall far enough to fail the original stand-call-conf = 30 filter. No variants are 'missing', as expected.
Of the 28 variants that were previously missed:
> 23 now have QUAL < 30
> 1 went from QUAL = 31.37 --> 43.05
> The other 4 that originally had 40<QUAL<60 are now 40<QUAL<80
So I'm still seeing the same effect. I don't know why these 5 variants are being filtered during genotyping, or why the stand-call-conf level is determining their filtering. However, I do observe once again that when run with my additional control sample, these 5 variants each have at least 2 alternate alleles, whereas the 23 low-QUAL variants each only have 1. Again, this makes me think that the QUAL filtering is normalized by alleles somehow, but it's just a hunch.
-
Hi again,
I think I found my answer in Github open issue 5793.
Long story short, looks like the QUAL score filtering is performed per-allele. Alleles that fail are removed, but the output QUAL score is unaffected. Hence the presence of seemingly high-QUAL variants with lots of alleles when stand-call-conf is turned down.
From davidbenjamin's comment:
"It's kind of tricky because suppose eg that we have three alt alleles each with an allele qual of 19, so that the overall variant qual is roughly 3x19 = 57. If we filter alleles with a confidence of 20, we get no alleles and the variant qual changes to 0.
Now, if instead of filtering by allele we only filter by overall variant qual we then have to keep an arbitrary number of sketchy alleles. I mean, what if we have 30 alleles each with a qual of 1? The current behavior seems preferable to me because the usual question users would ask downstream is whether some allele is real, not whether some site exhibits variation. As long as we define
-stand-call-conf
to pertain to alleles everything is consistent."From ldgauthier's comment:
"So we definitely don't update the QUAL if we drop alternate alleles" ... "Note that the QUAL is based off of the AFResult that had alleles removed if they exceeded the output limit, but not if they had less evidence than the calling confidence threshold."
Please sign in to leave a comment.
3 comments