BaseQualityRankSumTest Follow

GATK Team

July 30, 2024 15:32
Updated

Rank sum test of REF versus ALT base quality scores (BaseQRankSum)

Category Variant Annotations

Overview

Rank Sum Test of REF versus ALT base quality scores

This variant-level annotation tests compares the base qualities of the data supporting the reference allele with those supporting the alternate allele. The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the bases supporting the alternate allele have lower quality scores than those supporting the reference allele. Conversely, a positive value indicates that the bases supporting the alternate allele have higher quality scores than those supporting the reference allele. Finding a statistically significant difference either way suggests that the sequencing process may have been biased or affected by an artifact.

Statistical notes

The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for base qualities (bases supporting REF vs. bases supporting ALT). See the method document on statistical tests for a more detailed explanation of the ranksum test.

Caveat

The base quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Return to top

GATK version 4.2.0.0-SNAPSHOT built at Mon, 22 Feb 2021 13:44:49 -0800.

2 comments

Manuel Dominguez Becerra

July 30, 2024 12:39

We are developing a pipeline and we think that we can detect False Positive (FP) variations using this approach. We found VariantAnnotator and this parameter very useful but we don't understand the results of this BaseQualityRankSum test in some scenarios.

We have a training dataset so we know what variants are FP had what are True Positive (TP). Considering the following TP

CHROM

POS

ID

REF

ALT

QUAL

FILTER

BaseQRankSum

DP

ReadPosRankSum

AMPLICON

VAF

REF_MEAN_STD_QUAL

ALT_MEAN_STD_QUAL

chr13

32355095

PASS

-35.99

-2.167

BRCA2_14B_F_P5

50.34483

38.90Â±1.13

37.82Â±1.23

chr15

48437011

PASS

-45.383

1.956

FBN1_52_F_P5_02

50.49712

38.76Â±1.78

38.15Â±2.03

chr13

32355095

PASS

-40.661

1.995

BRCA2_14B_F_P5

48.87064

38.85Â±1.36

37.69Â±2.03

chr10

87957934

PASS

-47.142

-1.828

PTEN_07_F_P5

48.84685

38.80Â±1.62

37.86Â±1.96

chr17

31326275

PASS

-41.377

0.354

NF1_28B_F_P5

52.77097

38.89Â±0.92

37.83Â±1.35

chr13

32339049

AGACC

PASS

-37.756

42.155

BRCA2_11H_F_P5

38.33Â±1.41

NaN

Why do these TP variants get a very low BaseQRankSum??? I show in the table the average base quality of the reads supporting REF and the reads supporting ALT with the SD. The VAF is around 50% and the Depth is always >1000 reads so it should be enough reads in both groups of reads to compute a better Z score.

I have plot all my data to see the VAF against the BaseQRankSum

I am looking for an approach to reduce FP without misclassifying TP. With the results I have in the previous plot, this task is not possible.

However, if I plot VAF against the average quality base of reads supporting REF minus the average quality base of reads supporting ALT, we can see a better distribution of FP and TP

Z=-40

The two-tailed P value is less than 0.0001
By conventional criteria, this difference is considered to be extremely statistically significant. So, Why is telling me that the TP variants in the table are extremely statistically FP?

The code I used is

apptainer exec --bind /mnt:/mnt docker://broadinstitute/gatk:4.6.0.0 \

        /gatk/gatk VariantAnnotator  \

            -R $bwarefgenomepath \

            -V "$RunID"_"$Sample_Name".vcf \

            -I "$RunID"_"$Sample_Name".bam \

            -O "$RunID"_"$Sample_Name"_annotated_variants.vcf \

            -L "$ampliconscoordenatesbed" \

            -A BaseQualityRankSumTest \

            -A ReadPosRankSumTest \

            --verbosity DEBUG

Manuel Dominguez Becerra

July 30, 2024 15:32
Interestingly the MQRankSum uses the same statistical test that the BaseQualityRankSumTest, and this statistical analysis seem to work fine

Are both Z-score calculated in the same way??
0

Comment actions Permalink

Please sign in to leave a comment.

Genome Analysis Toolkit

Need Help?

Community Forum

Articles in this section

BaseQualityRankSumTest Follow

Category Variant Annotations

Overview

Statistical notes

Caveat

2 comments

Genome Analysis Toolkit

Need Help?

Community Forum

Articles in this section

Category Variant Annotations

Overview

Statistical notes

Caveat

Related articles