Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

strand_bias filtering is never emitted

Answered
0

7 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi GE, could you supply more information so that we can thoroughly look into this? https://gatk.broadinstitute.org/hc/en-us/articles/360053424571-How-to-Write-a-Post

    For this, it would be helpful to see the commands you ran in your Mutect2 pipeline, so we can see if there are possible issues.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    G E

    Hi,

    After further investigation, my prior post is not exactly correct. Let me explain the full situation, and sorry for the confusion:

    I'm using FilterMutectCalls from GATK 4.2.0.0 (I reran and confirmed this happens with 4.2.0.0), using the standard Terra Mutect2 workflow:

    For one of my samples called in tumor-only mode, only 3 variants out of 248,244 were labeled with "strand_bias".

    chr3 130509501 . C G . base_qual;panel_of_normals;strand_bias AS_FilterStatus=base_qual,strand_bias;AS_SB_TABLE=25,7|0,18;DP=52;ECNT=2;GERMQ=9;MBQ=30,10;MFRL=430,392;MMQ=60,60;MPOS=48;PON;POPAF=7.3;ROQ=70;TLOD=46.8 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:32,18:0.354:50:17,2:11,5:25,7,0,18
    chr8 98288909 . T G . base_qual;strand_bias;weak_evidence AS_FilterStatus=weak_evidence,base_qual,strand_bias;AS_SB_TABLE=24,4|0,8;DP=37;ECNT=1;GERMQ=22;MBQ=30,10;MFRL=399,398;MMQ=60,60;MPOS=51;POPAF=7.3;ROQ=39;TLOD=4.55 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:28,8:0.182:36:13,0:15,2:24,4,0,8
    chr11 27722908 . C G . panel_of_normals;strand_bias AS_FilterStatus=strand_bias;AS_SB_TABLE=24,2|0,15;DP=41;ECNT=1;GERMQ=10;MBQ=30,20;MFRL=468,415;MMQ=60,60;MPOS=47;PON;POPAF=7.3;ROQ=93;TLOD=52.49 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:26,15:0.372:41:12,4:13,5:24,2,0,15

    However, upon investigating the final PASS variants, I found a large number of T>G and A>C variants which had clear strand bias, but perhaps at lower levels than the strand_bias threshold? Note: I believe this is a sequencing artifact from Novaseq data.

    A few examples:

    chr1 113957274 . A C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=8,17|5,0;DP=31;ECNT=1;GERMQ=27;MBQ=30,20;MFRL=371,474;MMQ=60,60;MPOS=27;POPAF=7.3;ROQ=54;TLOD=6.78 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:25,5:0.179:30:13,2:7,1:8,17,5,0

    chr3 36857684 . T G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=15,20|7,0;DP=43;ECNT=1;GERMQ=41;MBQ=30,30;MFRL=426,386;MMQ=60,60;MPOS=61;POPAF=7.3;ROQ=15;TLOD=8.45 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:35,7:0.17:42:10,6:23,0:15,20,7,0

    chr22 25973864 . A C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=19,12|0,7;DP=39;ECNT=1;GERMQ=25;MBQ=30,20;MFRL=383,432;MMQ=60,60;MPOS=48;POPAF=7.3;ROQ=22;TLOD=8.33 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:31,7:0.192:38:13,0:15,6:19,12,0,7

    https://app.terra.bio/#workspaces/help-gatk/Somatic-SNVs-Indels-GATK4/workflows/help-gatk/2-Mutect2-GATK4

     

    So the questions are:

    1) Why were these variants not called as 'strand_bias'?

    2) Is there any way to adjust the sensitivity/parameters of the strand_bias filter in FilterMutectCalls?

    3) Are you aware of this (I think Novaseq) artifact, and should there be an additional filter targeted for it specifically?

    Thanks.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi G E,

    I spoke with our Mutect2 dev team regarding this issue. FilterMutectCalls uses a model to figure out the rate of strand bias in your sample and uses that probability to determine the likelihood of strand bias at each site. In our testing, we have found that strand bias artifacts occur much less frequently than people would think.

    Generally, we find the likelihood of strand bias is 1/10,000. And for example in your first PASS variant example with SB=8,17,5,0: 5,0 or 0,5 has a 1/16 likelihood of occurring, which is much more probable than the 1/10,000. You can read more about the strand bias filter in the Notes on Mutect2 paper, in section 2D: Strand Artifact Model.

    Here are the more specific answers to your questions:

    1. FilterMutectCalls did not find that they were likely to be an example of strand bias. We have done significant testing for strand bias artifacts and found that FilterMutectCalls is successful in filtering these out, so you can feel confident in the results. 
    2. There are no arguments to adjust the strand bias filter. We wouldn't recommend changing this filter, as we have done significant testing and found that FilterMutectCalls is successful and that strand bias artifacts are less common than people think.
    3. For NovaSeq data, we recommend running the read orientation bias filter, which you are already doing. We have evaluated NovaSeq data and found that this filter does a good job removing orientation bias artifacts. You should not need to make any other adjustments.

    Hope this helps, 

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    G E

    Hi Genevieve,

    Thank you very much for this detailed follow-up. However, I have seen in Novaseq data following Mutect2, many T>G and A>C variants way out of proportion to expected numbers of variants, which all have this low level strand bias (0 reads in one strand, and 3-10 reads in the other strand). For some I have validated by Sanger that they are artifacts. This has been published by others, for example:

    https://www.nature.com/articles/s41598-019-55636-3

    "Even after collapsing the mismatch types, we saw more T > G/A > C mismatches in the NovaSeq data. "  and "We therefore think that a lot of these T > G variants called only in NovaSeq data may in fact be artifactual calls that could be filtered out if we include more NovaSeq samples in our panel of normals. "

    I have also informed Illumina of this systematic artifact, and they do not have a mechanistic explanation for how this happens.

    I suggest you all might want to take a more detailed look, specifically at T>G and A>C strand-biased variants in normal tissue Novaseq samples in tumor-only mode. Some samples have more of this artifact than others.

    Note also that a variant showing up at low level with 0 reads in one strand and 5 reads in the other strand formally has a 1/16 likelihood of occurring, however, in the context of an artifact with these characteristics, the threshold (prior probability for filtering) would be adjusted accordingly.

    0
    Comment actions Permalink
  • Avatar
    G E

    I also wanted to add an idea for detecting residual systematic strand bias: Analyzing what % of variants have >3 reads in one strand and 0 in the other strand relative to expectation, for different sequence contexts.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    G E We studied those particular Novaseq artifacts extensively a couple years ago and also informed Illumina, and found the read orientation filter to be effective.  Your data show some questionable PASS variants with fairly low ROQ (read orientation phred-scaled quality scores), which suggest that the model is doing its job but the default balance between sensitivity and precision in FilterMutectCalls allows them to pass filtering.  You can adjust the `-f-score-beta` parameter if you want more precision at the cost of some sensitivity.

    If you want to aim for a particular maximum false discovery rate rather than rely on the F score optimization you can set `-threshold-strategy FALSE_DISCOVERY_RATE -false-discovery-rate <desired max FDR goes here>` in the FilterMutectCalls command.

    0
    Comment actions Permalink
  • Avatar
    G E

    Thanks very much! This is very helpful.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk