Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Mutect2 issue with --force-active: not all regions are active

1

23 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Maria Litovchenko,

    Thank you for this detailed explanation! We are looking into this closely to determine what is causing this pattern you are seeing.

    Could you give more details regarding how you downsampled your reads and in what step of your pipeline?

    It would also be helpful to see these comparison variants after FilterMutectCalls. The raw output of Mutect2 contains many false positives, and so I am wondering if these variants would be filtered or not.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hi Genevieve,

    Thank you for coming back to me!

    Regarding the downsampling, I posted the code in the GitHub folder mentioned in the original message: https://github.com/McGranahanLab/mutect2_forceActive_behaviour  , the script name is 1_downsample_consequent.sh. In my pipeline I downsample the bam file just after ApplyBQSR, so the bam is essentially ready for further processing such as variant calling. Shortly, I perform downsampling with samtools view -bs command, with seed fixed at 42. I do it sequentially to avoid non-overlapping between the different percentages of downsampling reads. I think the sequential downsampling is easier to grasp with the code so I also put the code from 1_downsample_consequent.sh below:

    #!/bin/bash
    # -----------------------------------------------------------------------------
    # DESCRIPTION
    # Script to perform CONSEQUITIVE downsampling of bam files to 25%, 50%, 75%,
    # 90% of total reads. Consequitive downsampling means that to create, for
    # example, 75% downsampled bam, we will downsample 90% bam, rather than
    # original 100% bam. This techniques removes a factor of random downsampling
    # and therefore presence/absence of some variants due to change of sampling or
    # not sampling alternative allele. Downsampling is performed with samtools.
    #
    # HOW TO RUN: chmod +x 1_downsample_consequent.sh
    # ./1_downsample_consequent.sh
    #
    # INPUT FILE: S6.bam
    # INPUT VALUES: FRACT - fractions (percentages) to downsample bam to and SEED
    # seed to be used in samtools
    #
    # OUTPUT FILE(S): folder downsampled_bams will be created. It will contain
    # downsampled bam files. Names fill follow pattern:
    # S6.downsampling_percentage.bam
    #
    # ATTENTION: MAKE SURE SAMTOOLS v1.9 IS IN YOUR PATH
    # -----------------------------------------------------------------------------
    # -----------------------------------------------------------------------------
    # LOAD ALL NESSECARY SOFTWARE
    # -----------------------------------------------------------------------------
    # MAKE SURE SAMTOOLS v1.9 IS IN YOUR PATH
    # -----------------------------------------------------------------------------
    # Inputs
    # -----------------------------------------------------------------------------
    BASE_DIR=`pwd`
    FULL_BAM='S6.bam'
    # fractions we want to downsample to. Always in descending order! Always
    # include 100 as first element
    FRACT=( 100 90 75 50 25 )
    # seed to be used in samtools
    SEED=42
    # -----------------------------------------------------------------------------
    # Outputs
    # -----------------------------------------------------------------------------
    DOWNSAMP_DIR=$BASE_DIR"/downsampled_bams/"
    mkdir -p $DOWNSAMP_DIR
    # -----------------------------------------------------------------------------
    # Perform CONSEQUITIVE downsampling
    # -----------------------------------------------------------------------------
    sample=`basename $FULL_BAM`
    sample=`echo $sample | sed 's/.bam//g'`
    for i in "${!FRACT[@]}"; do
        if [ "$i" -gt 0 ]; then
           fract="${FRACT[$i]}"
           # calculate updated fraction: a % which is needed to take from bam
           # from the previous step in order to achieve original desired fraction
           prevIdx=$(expr $i - 1)
           prevFract="${FRACT[$prevIdx]}"
           updFract=$(expr "${FRACT[$i]}" \* 100 / $prevFract)
           updFract=$(bc -l <<<$updFract)
           # get bam to downsample from: either original, or
           if [ "$i" -eq 1 ]; then
              bamToDowns=$FULL_BAM
           else
              bamToDowns=$DOWNSAMP_DIR$sample"."$prevFract".bam"
           fi
           downsampledBAM=$DOWNSAMP_DIR$sample"."$fract".bam"
           CURR_TIME=`date`
           echo '[' $CURR_TIME '] Started downsampling to '$fract' from '$bamToDowns
           echo '[' $CURR_TIME '] Updated fraction: '$updFract
           samtools view -bs $SEED"."$updFract $bamToDowns > $downsampledBAM
           samtools index $downsampledBAM
           CURR_TIME=`date`
           echo '[' $CURR_TIME '] Finished downsampling to '$fract' from '$bamToDowns
        fi
    done

    # copy full bam to downsampling folder to have everything in one place
    cp $FULL_BAM $DOWNSAMP_DIR$sample'.100.bam'
    samtools index $DOWNSAMP_DIR$sample'.100.bam'

    Also to improve the reproducibility of the issue I uploaded my original downsampled bam files here: https://github.com/McGranahanLab/mutect2_forceActive_behaviour/tree/master/downsampled_bams .

    As for the filtering, here things get even more interesting. The command to perform filtering was:

    gatk FilterMutectCalls --java-options "-Xmx4g -Xms4g -XX:ParallelGCThreads=1" \
    --reference $refGenDir$refGenName \
    -V $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.vcf" \
    -O $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.filt.vcf"

    and the command to extract passed variants:

    grep '^#' $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.filt.vcf" > $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.passFilt.vcf"
    grep -w PASS $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.filt.vcf" | grep -v '^#' >> $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.passFilt.vcf"

    Next, I uploaded vcf files containing variants which passed the filters into IGV:

    As you can see, sets of passed  filter variants between different percentages of downsampling are almost mutually exclusive. Variants located inside the pink rectangle are mostly indels, and for now I would like not to consider them. Yet, variants in the blue box are SNPs, and I zoomed in on them:


    Somehow, despite that variants were identified in 25% and 75% downsampled bam, they were not picked up in 50%, 90% and full (100%) bam. According to the vcf file (and to the IGV screenshot of the downsampled bams I posted in the original message) coverage should be sufficient:

    VCF containing passed filter variants for 25% downsampled bam (positions 124664488 and 124664491 from the blue box only):

    2	124664488	.	T	A	.	PASS	AS_FilterStatus=SITE;AS_SB_TABLE=162,89|3,2;DP=257;ECNT=2;GERMQ=93;MBQ=40,25;MFRL=169,208;MMQ=60,60;MPOS=4;NALOD=2.18;NLOD=43.87;POPAF=6.00;TLOD=5.38	GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB	0|1:76,5:0.068:81:38,3:32,0:0|1:124664488_T_A:124664488:47,29,3,2	0|0:175,0:6.694e-03:175:82,0:82,0:0|1:124664488_T_A:124664488:115,60,0,0
    2 124664491 . A G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=161,91|3,2;DP=262;ECNT=2;GERMQ=93;MBQ=38,23;MFRL=169,208;MMQ=60,60;MPOS=1;NALOD=2.17;NLOD=43.56;POPAF=6.00;TLOD=5.34 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:77,5:0.067:82:41,4:34,0:0|1:124664488_T_A:124664488:47,30,3,2 0|0:175,0:6.694e-03:175:82,0:87,0:0|1:124664488_T_A:124664488:114,61,0,0

    VCF containing passed filter variants for 75% downsampled bam (positions 124664488 and 124664491 from the blue box only):

    2	124664488	.	T	A	.	PASS	AS_FilterStatus=SITE;AS_SB_TABLE=53,112|3,4;DP=172;ECNT=2;GERMQ=93;MBQ=39,25;MFRL=175,208;MMQ=60,60;MPOS=4;NALOD=1.92;NLOD=24.53;POPAF=6.00;TLOD=5.13	GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB	0|1:78,6:0.061:84:37,3:32,0:0|1:124664488_T_A:124664488:25,53,3,3	0|0:87,1:0.012:88:42,0:40,1:0|1:124664488_T_A:124664488:28,59,0,1
    2 124664491 . A G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=53,114|3,4;DP=181;ECNT=2;GERMQ=93;MBQ=38,23;MFRL=173,208;MMQ=60,60;MPOS=1;NALOD=1.93;NLOD=24.53;POPAF=6.00;TLOD=5.10 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:79,6:0.060:85:42,5:36,0:0|1:124664488_T_A:124664488:25,54,3,3 0|0:88,1:0.012:89:42,0:46,0:0|1:124664488_T_A:124664488:28,60,0,1


    I also updated the variant calling script (https://github.com/McGranahanLab/mutect2_forceActive_behaviour/blob/master/2_mutect2_downsample.sh ) in the GitHub folder with the FilterMutectCalls command.

    I hope this information helps your investigation!

    Best regards,

    Maria.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Maria,

    Yes, thank you! I took a look at previous examples where users were facing similar issues. There can be issues at certain levels of coverage with the assembly engine, causing variants at some sites to not be called. I'm not particularly familiar of why this occurs, so maybe David Benjamin can comment. 

    I do know some things you could try to fix this issue with the assembly engine. First, I would recommend the options --recover-all-dangling-branches and --linked-de-bruijn-graph. You can try these separately and together. These troubleshooting options are covered in a very helpful article here: When HaplotypeCaller and Mutect2 do not call an expected variant. That article is also just generally helpful to look into why certain variants are called.

    Another Mutect2 option that you may want to look into is to adjust the downsampling in Mutect2. Mutect2 performs downsampling, which can affect which variants are called. If one loci has way too high of coverage, it may be thrown out as a possible error. The option you can move around is --max-reads-per-alignment-start.

    Finally, I would recommend verifying that your downsampling algorithm is downsampling as you want it to and that you aren't splitting up pairs. There is a helpful article on how to downsample with GATK from one of our devs here: Downsampling in GATK

    Here are the other posts I read that might be helpful:

    Let me know if you have further questions and if you find any of these options specifically help!

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hi Genevieve,

    Thank you for the suggestions! I read the articles and posts you’ve suggested as well as I tried all of the options in various combinations and I’m sorry to report that they unfortunately didn’t resolve my issue.

    First of all, I tried --recover-all-dangling-branches and --linked-de-bruijn-graph solo as well as combined. Here and everywhere below --force-active option is set to true just as in all analysis. I looked directly at created mutect2 bam because if there is a region which for some reason has no coverage just as shown in my original post, there is no way MUTECT2 could call a variant in that region.

    --linked-de-bruijn-graph solo:

    --recover-all-dangling-branches solo:

    --linked-de-bruijn-graph and --recover-all-dangling-branches together:

    As you can see from the screenshots above, these options did not do much of difference and coverage of whole regions present in lower percentage of downsampling is still lost in higher percentage of downsampling.

    I was aware of the --max-reads-per-alignment-start option and it's possible influence at the time I first posted. However, I read before about it and, according to this post by GATK team https://github.com/broadinstitute/gatk/issues/3808  it should only kick in with coverages > 5000, which is not the case for my data. Nonetheless, I tried to run Mutect2 with --max-reads-per-alignment-start set to 0 too:

    The situation unfortunately is not resolved by this option either.

    In one of the posts and articles you pointed out an issue was resolved by using --mitochondria mode which essentially sets --initial-tumor-lod to 0 and --tumor-lod-to-emit to 0. In my analysis I already had --tumor-lod-to-emit set to 0.0, so I Run Mutect2 with --initial-tumor-lod set to 0 and --max-reads-per-alignment-start set to 0 too:

    This also did not lead to resolution.

    In my attempts to fix the issue, I also tried various combinations of --recover-all-dangling-branches, --linked-de-bruijn-graph, --max-reads-per-alignment-start and --initial-tumor-lod options but they all resulted into nothing. I did not post IGV screenshots of my attempts to keep this post brief.

    Thank you pointing out the new DownsampleByDuplicateSet tool. However, I don't think it suits my goal as it is designed to answer to question "what if we have less material for sequencing?". In contrary, my original question was way simpler: "What if we sequence less?". In the same articled it's noticed that one can answer this question via downsampling with Picard tool DownsampleSam or samtools view. It's also noted that both Picard and samtools do take into consideration that reads come in pairs and treat them fairly. Therefore, I concluded that there is no problem with the downsampling with samtools.

    I really appreciate your help in solving this issue! Please let me know if I can provide any other data in addition to those on GitHub.

    Best regards,

    Maria.

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the update Maria and for providing these screenshots. I'm continuing to look into possible explanations. From the examples you chose, are there any specific coverage levels that you think have real variants and are the most accurate?

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hi Genevieve,

    I would recommend variants chr2:124664488 T|A and chr2:124664491 A|G for a deeper check. They were detected at 25% downsampling and 75% downsampling and even passed the filters, but not at 50% downsampling. Moreover, reads covering that region were completely lost at 90% and 100% Mutect2 bams which in my opinion is the main problem here. Also, one could notice that the coverage of the positions declines in Mutect2 bams with increase in the percentage of downsampling which I also find disturbing because it steadily increases in pre-mutect2 bams.

    IGV view of Mutect2 bams, position 124664488:


    IGV view of Mutect2 bams, position 124664491:


    IGV view of pre-Mutect2 bams, position 124664488:

    Best regards,
    Maria.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Maria,

    Thank you so much for the update and the screenshots! I took a close look at all your information with our developer team to get their thoughts and they looked into the code to see why you are seeing these discrepancies. 

    On the --force-active option:

    In the code, Mutect2 first distinguishes between nonactive and active regions. If --force-active is true, all regions are kept as active. Next, the assembly engine performs local re-assembly and finds all possible haplotypes in the active regions. This would be in all of your regions since you have force active true. If there is no sign of variation present, then the region is not printed to the bamout and these haplotypes are not kept for the final algorithm step: Evaluating the evidence for haplotypes and variant alleles

    Force active only changes whether or not the assembly engine double checks for possible variation at all sites. However, the code that distinguishes between active and nonactive regions is actually quite sensitive so we generally don't find that --force-active helps to find real variants. It's not usually worth the extra compute time.

    We think that the reason you are not seeing any reads in the bamout for some sites even though you are using --force-active is because there is not enough evidence for true variation at those coverage levels.

    Downsampling:

    It doesn't look like there are any issues with your downsampling code. Regarding Mutect2's downsampling, there are three arguments: --max-reads-per-alignment-start (which I mentioned before), --max-suspicious-reads-per-stride, and --downsampling-stride. Suspicious reads are reads with bad mapping quality. We agreed with what you wrote earlier, that changing any of these parameters wouldn't affect your results but I wanted to make sure I mentioned those options.

    Why are some variants only appearing at lower coverage?

    We took a look at the variants that you shared and for the chr2:124664488 T|A variant, it looks like this variant had low enough coverage of the A that it could be from random effects of downsampling. If the downsampling removed enough Ts by chance, then this could arise. At 100% coverage and with all the reads present, the reads supporting A could be below the calling threshold. In addition, there seems to be some strand bias, which could result in the variant not seen as a true variant when there is more read coverage.

    This also seems to be the case with the chr2:124664491 A|G variant. There can be variants that pass read filters but when more evidence is present, they do not hold up. This is just the case with downsampling. There is a chance that these are real variants that do not show up in the 100% coverage sample but they do seem like they are false positives.

    I hope this helps you understand Mutect2 better and please let me know if you have further questions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hi Genevieve,
    Thank you for the detailed response, I think now I understand Mutect2 workflow better! I fully agree with you that some cases of variants' detection at lower percentages of downsampling but not at the higher ones can be caused by downsampling introducing bias and just by chance picking up alternative allele more often than reference.

    However, I would expect this to have very little to zero effect while comparing bam file downsampled to 90% versus not downsampled, 100% bam. Yet, while performing comparison of the variants detected at 90% bam vs 100% bam on complete chr2 I found 24 variants which were detected at 90% bam, but not 100% bam. As previously, I would like to concentrate on SNPs rather than indels as they are easier to debug. Eight out of 24 variants were SNPs. One of the variants in the region chr2:181,134,751-181,135,048 caught my attention (2:181134924_G/A)


    As you can see from IGV screenshot, the same situation as I described in my previous posts takes place: this regions has coverage at 90% Mutect2 bam, however, it has no coverage at 100% Mutect2 bam. According to your previous post that would be due to Mutect2 detecting "no sign of variation present, then the region is not printed to the bamout and these haplotypes are not kept for the final algorithm step". But, if we zoom in we'll see that there is plenty of variation detected at 90% bam: variant at position 2:181134924 G/A has 53 "A" reads and 74 "G" reads, there is a clear 1bp deletion left to it (181134923), there is also a variant at position 181134910 G to T, which also has quite a decent ratio between alternative and reference allele (G:79, T:53). Same goes for the variants at positions 181134909 and 181134907. Somehow, it was enough of variation in order to preserve this region in 90% bam, but not at 100% bam. My question now is why addition of just 10% reads changes so much the behavior of the algorithm? Variable positions I mentioned above would not disappear with the additional 10% of the reads even if all of those reads would bring in reference allele because there is clearly enough reads supporting alternative allele (>50). Another question is why variants at positions 181134909 and 181134907 were not detected at all? I also don't want to bring in the discussion here the argument that mutations at positions 181134907 and 181134909 might later be considered as clustered events because filtering is performed at the later stages and we might consider relaxed filtering as well.

    Because this time I run Mutect2 on the whole chr2, the input files are too big to put them on GitHub, so I uploaded them to my drive:

    Downsampled bams: https://drive.google.com/file/d/1gjX4_I86xWKMZIrvP_C-Qxs1Hf6GJZ3A/view?usp=sharing 

    Reference genome: https://drive.google.com/file/d/1zeV9pCLAOl6kbAAIgA2vQyCI1aMFHKz5/view?usp=sharing 

    Command used to run Mutect2 was:

    downsampleBamDir="downsampled_bams_chr2full/"
    TUMOR_ID='S6'
    GERMLINE_ID='S6_GL'
    fract=90
    COMMON_SUFFIX="chr2full"
    tumorBAM=$downsampleBamDir$TUMOR_ID"."$fract"."$COMMON_SUFFIX".bam"
    glBAM=$downsampleBamDir$GERMLINE_ID"."$COMMON_SUFFIX".bam"

    refGenDir="reference_genome/"
    refGenName="GRCm38_68_chr2.fa"
    PROBES=$refGenDir"probes_chr2.bed"

    gatk Mutect2 --java-options "-Xmx8g -Xms8g -XX:ParallelGCThreads=1" \
    -R $refGenDir$refGenName \
    -L $PROBES \
    -I $tumorBAM -I $glBAM \
    --tumor-sample $TUMOR_ID --normal-sample $GERMLINE_ID \
    --initial-tumor-lod 0.0 --tumor-lod-to-emit 0.0 --force-active true \
    --max-reads-per-alignment-start 0 \
    --linked-de-bruijn-graph \
    -bamout $TUMOR_ID"."$fract".m2."$COMMON_SUFFIX".chr2full.bam" \
    -O $TUMOR_ID"."$fract".m2."$COMMON_SUFFIX".chr2full.vcf"

    Best regards,

    Maria

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Maria, 

    Thanks for the update so that we can continue to get to the bottom of this issue. The variant in the screenshot you shared looks strand biased since the reads are only on the negative strand. Is there something about your data that would cause more strand bias? All the variants we have seen have had some strand bias.

    I was also wondering if you could check these 8 SNPs to see how many pass filtering. If you could also check the other 7 SNPs to see if any are not strand biased, then we can look into the assembly graph closer at those sites.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hi Genevieve,

    Thank you for coming back to me so fast! All of the 8 variants I talked about earlier do pass FilterMutectCalls. Here is the corresponding lines from the vcf obtained after FilterMutectCalls:

    2 26484311 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=119,107|2,2;DP=236;ECNT=1;GERMQ=93;MBQ=39,28;MFRL=172,124;MMQ=60,60;MPOS=27;NALOD=1.99;NLOD=28.48;POPAF=6.00;TLOD=6.93 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:113,4:0.031:117:54,4:59,0:60,53,2,2 0/0:113,0:0.010:113:43,0:68,0:59,54,0,0
    2 28679027 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=73,44|2,3;DP=126;ECNT=2;GERMQ=93;MBQ=34,20;MFRL=194,100;MMQ=60,60;MPOS=3;NALOD=1.84;NLOD=18.86;POPAF=6.00;TLOD=9.91 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:52,5:0.103:57:22,0:29,4:0|1:28679027_G_A:28679027:30,22,2,3 0|0:65,0:0.015:65:31,0:32,0:0|1:28679027_G_A:28679027:43,22,0,0
    2 28679029 . TTCAAG T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=74,49|2,3;DP=130;ECNT=2;GERMQ=93;MBQ=38,23;MFRL=194,100;MMQ=60,60;MPOS=2;NALOD=1.84;NLOD=18.87;POPAF=6.00;TLOD=9.86 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:58,5:0.102:63:20,0:26,5:0|1:28679027_G_A:28679027:31,27,2,3 0|0:65,0:0.015:65:31,0:33,0:0|1:28679027_G_A:28679027:43,22,0,0
    2 49946234 . T G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=104,181|1,3;DP=294;ECNT=2;GERMQ=93;MBQ=37,23;MFRL=184,89;MMQ=60,60;MPOS=3;NALOD=2.16;NLOD=44.27;POPAF=6.00;TLOD=4.97 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:128,3:0.032:131:60,2:62,1:0|1:49946225_TGCTTGGA_T:49946225:47,81,1,2 0|0:157,1:7.365e-03:158:65,1:85,0:0|1:49946225_TGCTTGGA_T:49946225:57,100,0,1
    2 65506118 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=127,52|2,2;DP=196;ECNT=2;GERMQ=93;MBQ=39,26;MFRL=178,135;MMQ=60,60;MPOS=4;NALOD=2.02;NLOD=28.25;POPAF=6.00;TLOD=4.85 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:78,4:0.041:82:41,3:37,1:53,25,2,2 0/0:101,0:9.768e-03:101:50,0:51,0:74,27,0,0
    2 130101294 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=391,324|2,1;DP=735;ECNT=1;GERMQ=93;MBQ=39,40;MFRL=184,291;MMQ=60,60;MPOS=34;NALOD=2.50;NLOD=92.85;POPAF=6.00;TLOD=4.26 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:345,3:0.013:348:168,2:176,1:195,150,2,1 0/0:370,0:3.174e-03:370:194,0:173,0:196,174,0,0
    2 173227936 . C T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=16,7|4,1;DP=30;ECNT=1;GERMQ=43;MBQ=41,25;MFRL=214,145;MMQ=60,60;MPOS=4;NALOD=1.18;NLOD=3.90;POPAF=6.00;TLOD=9.22 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:10,5:0.351:15:4,1:6,3:6,4,4,1 0/0:13,0:0.062:13:6,0:7,0:10,3,0,0
    2 181008619 . T C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=8,91|1,5;DP=107;ECNT=2;GERMQ=93;MBQ=40,22;MFRL=198,107;MMQ=60,60;MPOS=1;NALOD=1.79;NLOD=16.53;POPAF=6.00;TLOD=10.01 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:43,6:0.107:49:18,6:21,0:0|1:181008614_CA_C:181008614:5,38,1,5 0|0:56,0:0.016:56:25,0:30,0:0|1:181008614_CA_C:181008614:3,53,0,0
    2 181134924 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=2,24|0,5;DP=33;ECNT=2;GERMQ=48;MBQ=41,24;MFRL=195,159;MMQ=60,60;MPOS=3;NALOD=1.16;NLOD=3.30;POPAF=6.00;TLOD=11.22 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:15,5:0.277:20:8,2:7,2:0|1:181134922_CG_C:181134922:1,14,0,5 0|0:11,0:0.067:11:7,0:4,0:0|1:181134922_CG_C:181134922:1,10,0,0

    I selected the variant I showed previously on IGV because it had the biggest BAF among the 8 variants from above. Also, as you may notice not all of the variants display strand bias, i.e. 130101294, 173227936 and 26484311 don't, while the rest have a varying degree of it.

    I also selected 7 variants without strand bias which do pass the FilterMutectCalls upon your request. I thought that text would be more brief and informative way to present them rather than IGV:

    2 19452953 . G T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=125,334|7,30;DP=522;ECNT=1;GERMQ=93;MBQ=34,38;MFRL=173,162;MMQ=60,60;MPOS=31;NALOD=2.35;NLOD=69.30;POPAF=6.00;TLOD=84.94 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:195,35:0.157:230:102,11:91,23:47,148,7,28 0/0:264,2:4.751e-03:266:140,1:121,1:78,186,0,2
    2 19452953 . G T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=125,334|7,30;DP=522;ECNT=1;GERMQ=93;MBQ=34,38;MFRL=173,162;MMQ=60,60;MPOS=31;NALOD=2.35;NLOD=69.30;POPAF=6.00;TLOD=84.94 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:195,35:0.157:230:102,11:91,23:47,148,7,28 0/0:264,2:4.751e-03:266:140,1:121,1:78,186,0,2
    2 35720098 . A T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=57,66|8,7;DP=138;ECNT=1;GERMQ=93;MBQ=38,30;MFRL=208,156;MMQ=60,60;MPOS=34;NALOD=1.77;NLOD=17.37;POPAF=6.00;TLOD=37.73 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:55,15:0.182:70:24,11:30,4:28,27,8,7 0/0:68,0:0.017:68:32,0:34,0:29,39,0,0
    2 104256229 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=214,343|15,35;DP=624;ECNT=1;GERMQ=93;MBQ=41,35;MFRL=187,192;MMQ=60,60;MPOS=31;NALOD=2.46;NLOD=82.56;POPAF=6.00;TLOD=120.66 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:241,50:0.172:291:111,19:127,30:107,134,15,35 0/0:316,0:3.583e-03:316:172,0:140,0:107,209,0,0
    2 112668039 . A T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=446,357|18,17;DP=865;ECNT=1;GERMQ=93;MBQ=37,36;MFRL=164,173;MMQ=60,60;MPOS=24;NALOD=2.57;NLOD=110.52;POPAF=6.00;TLOD=75.68 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:325,33:0.098:358:175,14:146,19:175,150,17,16 0/0:478,2:2.712e-03:480:258,1:218,0:271,207,1,1
    2 112833882 . A T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=91,133|8,6;DP=258;ECNT=2;GERMQ=93;MBQ=39,32;MFRL=182,186;MMQ=60,60;MPOS=10;NALOD=1.67;NLOD=32.54;POPAF=6.00;TLOD=24.83 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:99,13:0.116:112:40,4:55,8:39,60,7,6 0/0:125,1:0.015:126:67,1:55,0:52,73,1,0
    2 181354075 . G T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=86,75|9,5;DP=192;ECNT=2;GERMQ=93;MBQ=40,33;MFRL=192,183;MMQ=60,60;MPOS=16;NALOD=1.03;NLOD=19.43;POPAF=6.00;TLOD=27.56 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:84,13:0.136:97:44,5:39,7:49,35,8,5 0/0:77,1:0.024:78:36,1:38,0:37,40,1,0


    As for the strand bias of the data itself, my QC analysis did not show a presence of it. Nonetheless, I think that the biggest issue so far is not 'strandedness' of the variants but rather an absence of any coverage of the region under consideration in 100% mutect2 bam, while present in 90% bam, as, if I understood correctly, output to mutect2 bam is performed prior to any variant calling.

    Best regards,
    Maria.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Maria,

    Could you run Mutect2 again with the option --debug-graph-transformation for the 90% and 100% BAMs restricting the analysis to just around the 2:26484311 site? The --debug-graph-transformation option will create many .dot dot plot files and we can use these to more closely examine the assembly engine at this location. Then we will be able to figure out why this site appears in the 90% and not the 100% sample. You can upload these files into a folder to our FTP site: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671. Along with the dot plot files, we would also like to see the bamouts from this run. Please also verify that you are seeing the same results once you have ran with this option. 

    When you said "an absence of any coverage of the region under consideration in 100% mutect2 bam", does that mean there is no coverage in the 100% BAM in general, or just at these sites of interest?

    Please let me know when you have uploaded those files.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hi Genevieve,

    I think, there was a little bit of miscommunication above. Eight sites I listed above (26484311, 28679027, 28679029 (Indel actually), 49946234, 65506118, 130101294, 173227936, 181008619 and 181134924) were not found to have a variant at 100% bam, but were found to have a variant at 90% bam. There could be 2 reasons for that: 1) 100% mutect2 bam does have coverage at the site, but still mutect2 does not see a variant at those positions; 2) 100% mutect2 bam does not have coverage. In both cases I imply that 100% original bam (after alignment and post processing) does have a substantial coverage of the site. For 26484311, 65506118 and 181008619 the coverage in 100% mutect2 bam is present which makes your choice of 26484311 as a site to base debug on unfortunate. Rest of the variants (28679027, 28679029, 49946234, 130101294, 173227936, 181134924) do not have coverage in 100% Mutect2 bam. Therefore, I re-run Mutect2 with --debug-graph-transformations option on regions encompassing variants 181134924 and 28679027 (2 separate runs) and uploaded them to the server. I also run Mutect2 on 26484311 just in case.

    Here are screenshots for 181134924 and 28679027 showing that the absence of coverage issue at 100% bam persists:

    As for your question regarding "an absence of any coverage of the region under consideration in 100% mutect2 bam" I meant that coverage is absent for those specific sites. There are variants which are detected in both 90% and 100% Mutect2 bam, and obviously coverage is present for those sites in both 90% and 100% Mutect2 bam.

    Best regards,

    Maria.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Maria for the clarifications, we'll take a look and get back to you as soon as we can.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Maria,

    We made some changes to the debug graph transformations option since the version you are using so it would help if we can see the graphs from the latest version. Could you upload the files again with version 4.2.2.0? 

    We were seeing some strange behavior in the assembly graph and also we noticed that the 25mer graph was missing. Hopefully the newer graphs will help us solve the issue.

    I had another question: do you know if there are problems with the reference you are using that could cause assembly graph issues? Or potentially Ns and IUPAC bases in the genome?

    Let me know when you have uploaded those files, thank you for helping us look into this with so much detailed information!

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hi Genevieve,

    I’m glad to hear that we’re getting closer to the resolution of the problem!

    As requested, I run exactly the same code but with GATK v4.2.2.0 and unfortunately I don’t think it will help us, as MUTECT2 bams produced by v4.2.2.0 are empty. I still uploaded the results together with the results from v.4.1.8.1 and all GATK messages printed during the runs in one zip named marialitovchenko_debugGraphTrans.zip. I also tried to upload reference genome and input files I used (only chr2), but unfortunately they were too big.

    I also looked closely at the genome regions of interest and I didn't notice any N symbols or IUPAC ones in them.

    I hope the provided files help!
    Best regards,
    Maria.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for uploading the files, Maria. And just to confirm, you uploaded all the graphs? You did not see a 25mer graph with this GATK version either?

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Yes, I uploaded all of the files produced. There was indeed no 25mer graphs.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Just wanted to give you a heads up that we are still looking into this, thank you for your patience. 

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hello Genevieve,

    Thank you very much for an update!

    Best regards,

    Maria.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    One other update: we have determined that we will need to use your original BAM files to do a full thorough debugging because here are strange things going on in your assembly graphs and bamouts. This analysis will take some time and we are not able to schedule it for a few weeks. We will get back to you in about a month with our conclusions.

    Is this standard Illumina data?

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Sure, I'll be waiting for the results of your debug!

    Yes, it's standard Illumina data.

    Best regards,

    Maria.

    0
    Comment actions Permalink
  • Avatar
    Maria Litovchenko

    Hello Genevieve,

    Could you tell me please if there is an update regarding the issue?

    Best regards,

    Maria.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Maria,

    Thank you so much for bringing this back up, Maria. I have checked in with the developer assigned with this issue and they have not been able to take a closer look. Since it is a complicated issue, I want the expert to really be able to dive into the code.

    It is on their radar again and I'm hoping they can look before our office closes for the holiday break (Dec 23). Could you confirm that these downsampled bams are the ones to look at that recreate the issue? Which site should we look at first?

    Thank you,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk