Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

BaseQualityRankSumTest Follow

2 comments

  • Avatar
    Manuel Dominguez Becerra

    We are developing a pipeline and we think that we can detect False Positive (FP) variations using this approach. We found VariantAnnotator and this parameter very useful but we don't understand the results of this BaseQualityRankSum test in some scenarios.

     

    We have a training dataset so we know what variants are FP had what are True Positive (TP). Considering the following TP

    CHROM
    POS
    ID
    REF
    ALT
    QUAL
    FILTER
    BaseQRankSum
    DP
    ReadPosRankSum
    AMPLICON
    VAF
    REF_MEAN_STD_QUAL
    ALT_MEAN_STD_QUAL
    chr13
    32355095
     
    A
    G
    100
    PASS
    -35.99
    1447
    -2.167
    BRCA2_14B_F_P5
    50.34483
    38.90±1.13
    37.82±1.23
    chr15
    48437011
     
    T
    C
    100
    PASS
    -45.383
    7601
    1.956
    FBN1_52_F_P5_02
    50.49712
    38.76±1.78
    38.15±2.03
    chr13
    32355095
     
    A
    G
    100
    PASS
    -40.661
    1939
    1.995
    BRCA2_14B_F_P5
    48.87064
    38.85±1.36
    37.69±2.03
    chr10
    87957934
     
    T
    C
    100
    PASS
    -47.142
    3843
    -1.828
    PTEN_07_F_P5
    48.84685
    38.80±1.62
    37.86±1.96
    chr17
    31326275
     
    T
    C
    100
    PASS
    -41.377
    2035
    0.354
    NF1_28B_F_P5
    52.77097
    38.89±0.92
    37.83±1.35
    chr13
    32339049
     
    A
    AGACC
    100
    PASS
    -37.756
    1779
    42.155
    BRCA2_11H_F_P5
    0
    38.33±1.41
    NaN
     

    Why do these TP variants get a very low BaseQRankSum??? I show in the table the average base quality of the reads supporting REF and the reads supporting ALT with the SD. The VAF is around 50% and the Depth is always >1000 reads so it should be enough reads in both groups of reads to compute a better Z score.

    I have plot all my data to see the VAF against the BaseQRankSum

    I am looking for an approach to reduce FP without misclassifying TP. With the results I have in the previous plot, this task is not possible.

    However, if I plot VAF against the average quality base of reads supporting REF minus the average quality base of reads supporting ALT, we can see a better distribution of FP and TP

     

    Z=-40

    The two-tailed P value is less than 0.0001
    By conventional criteria, this difference is considered to be extremely statistically significant. So, Why is telling me that the TP variants in the table are extremely statistically FP? 

    The code I used is 

     
    apptainer exec --bind /mnt:/mnt docker://broadinstitute/gatk:4.6.0.0 \

            /gatk/gatk VariantAnnotator  \

                -R $bwarefgenomepath \

                -V "$RunID"_"$Sample_Name".vcf \

                -I "$RunID"_"$Sample_Name".bam \

                -O "$RunID"_"$Sample_Name"_annotated_variants.vcf \

                -L "$ampliconscoordenatesbed" \

                -A BaseQualityRankSumTest \

                -A ReadPosRankSumTest \

                --verbosity DEBUG
    0
    Comment actions Permalink
  • Avatar
    Manuel Dominguez Becerra

    Interestingly the MQRankSum uses the same statistical test that the  BaseQualityRankSumTest, and this statistical analysis seem to work fine

     

    Are both Z-score calculated in the same way??

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk