MQRankSum Bug?
Hello,
I'm wondering if this is a bug.
Please see the variant called by GATK HaplotypeCaller below. There are 6549 reads supporting the alt allele, and only 1 read supporting the reference allele. According to the docs, the MQRankSum is desrcibed as follows: "A positive value means the mapping qualities of the reads supporting the alternate allele are higher than those supporting the reference allele; a negative value indicates the mapping qualities of the reference allele are higher than those supporting the alternate allele".
In this example below we see the MQRankSum = -13, which means that "the mapping qualities of the reference allele are higher than those supporting the alternate allele". Is this suggesting that the mapping quality of the single read supporting the reference allele is so much greater than the mapping qualities of all 6549 reads supporting the alternate allele such that the MQRankSum value is so negative?
I 22345 . G A 295797.04 MQRankSum_filter AC=1;AF=1.00;AN=1;BaseQRankSum=-0.562;DP=6606;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.99;MQRankSum=-13.050;QD=29.02;ReadPosRankSum=-0.114;SOR=0.258 GT:AD:DP:GQ:PL 1:1,6549:6550:99:295807,0
Another example from an adjacent snp with an even greater negative MQRankSum:
I 22346 . G C 295419.01 MQRankSum_filter AC=1;AF=1.00;AN=1;BaseQRankSum=-0.459;DP=6498;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQRankSum=-23.266;QD=32.90;ReadPosRankSum=-0.189;SOR=0.258 GT:AD:DP:GQ:PL 1:1,6495:6496:99:295429,0
To make things more puzzling, here is the SNP that directly precedes these two SNPS and has a MQRankSum very close to 0. I'm providing this to show that the reads supporting the alt alleles above cannot be that poor in terms of mapping quality:
I 22344 . G A 259089.01 PASS AC=1;AF=1.00;AN=1;BaseQRankSum=3.174;DP=6542;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.99;MQRankSum=-0.141;QD=25.42;ReadPosRankSum=2.858;SOR=0.157 GT:AD:DP:GQ:PL 1:9,6327:6336:99:259099,0
-
Hi mk
I checked with our dev team and this is what they said:
When we calculate the z-score for any ranksum statistic (which is what is written out in the tag of the bam), we are comparing a Mann-Whitney test statistic to its expected distribution under the null hypothesis. To get the expected distribution under the null hypothesis, we can either calculate it exactly by counting the number of permutation of ref and alt labels which would lead to a smaller test statistic, or use a normal approximation. Our current implementation of the exact calculation is rather inefficient, and so can only be used when we have a small number (less than ~10 to 20) of total reads. The problem is that the normal approximation is only good when there are both a large number of ref AND a large number of alt reads, so there are scenarios, such as when we have a large number of alt reads but a small number of ref reads, when we have to use the normal approximation even though it may be wildly inaccurate. This appears to be exactly the situation you are observing.
Fortunately, after searching through the literature, it appears that there are some much more efficient algorithms for performing the exact calculation. I’m working through some thoughts on how to implement one of them, so possible improvement incoming.This is something we are investigating. Unfortunately I cannot promise a timeline on when we will resolve this but rest assured we are looking into it.
-
Thank you.
Please sign in to leave a comment.
2 comments