Mutect2 issue with --force-active: not all regions are active
AnsweredHello GATK team,
I've been working on detection of tumor somatic variants on mouse WES data (tumor – germline pairs) and noticed an unexpected behaviour of Mutect2. Originally, I was asked by the collaborators to perform a read depth check of the samples and see if saturation was reached and samples' read depth could be reduced to cut the costs. To answer that question I sequentially downsampled tumor bam to 25%, 50%, 75% and 90% using samtools. By sequential downsampling I mean that I first downsampled full bam to 90% and then I downsampled obtained 90% bam to 83.3% which would result into 75% coverage from 100% bam. This was done to avoid a situation then some reads, just by chance are present in, for example, 50% downsampled bam but are not present in 75% bam. Next, I run Mutect2 on all of my downsampled bams. Knowing that Mutect2 can be rather conservative, I used set up arguments –tumor-lod-to-emit to 0.0 and --force-active to true. I expected that sets of mutations detected in downsampled bams will be nested, i.e. all variants detected in 25% bam will be found in 50% bam, all variants detected in 50% bam will be found in 75% bam, etc. However, this is not what I saw during manual exception of raw Mutect2 calls:
if one zoomes in to chr2:124,659,282-124,662,971:
There are some variants which are detected in 50% downsampled bam, but not in 100% bam! This situation can also be found in relation to other percentages of downsampling. Consequently I examined mutect2 bams:
So, variants were not detected because it seems that some of regions which have significant coverage in 75% mutect2 bam (or other % of downsampling) are not covered at all at 100% mutect2 bam. This is especially unexpected taking into account that I set --force-active to true. Basically, --force-active true somehow does not mean that all regions will be set to active in contrary to the option description. One may argue that the issue stems from downsampling and that reads covering those regions just were not included, but it's not true as we can see from IGV view of downsampled bams before Mutect2:
Just to clarify, my IGV was set not to downsample bams while loading.
Since I have WES I naturally used -L option in Mutect2 and gave it probes locations. I decided to try to run Mutect2 without this option:
As you can see, the situation somewhat improved, but also not, as regions with coverage at lower percentage of downsampling yet with absence of it at higher percentage still exist:
I decided to track the destiny of some reads to double check that they indeed disappear from higher % of downsampling mutect2 bam. I took a read named
A01065:54:HMHGLDRXX:1:1175:8169:3583 which maps originally with CIGAR string 98M3S (in input downsampled bams) to 124664309. This read is present in all downsampled bams:
Then I checked if it's present in mutect2 bams:
It was only found in 50% downsampled bam and 75% downsampled bam, but not in the rest!
Could you please help me to resolve the issues outlined above? It's especially important because if my estimations are correct and if this behaviour is indeed erroneous it could be that we only detect about 50% of somatic variants from what we could have detected.
In order to ease out the information transfer and possible error correction I created github folder which provides a fully reproducible example of the reported behaviour: https://github.com/McGranahanLab/mutect2_forceActive_behaviour . It contains both input data and well documented code.
Short summary of the used software:
- GATK v4.1.8.1
- openjdk version 1.8.0_265
- Preprocessing: reads were trimmed with cutadapt to minimum quality of 20 and minimum length of 20, then mapped with BWA mem, duplicates were removed with sambamba, BaseRecalibrator followed by ApplyBQSR was applied.
- Mutect2 command: gatk Mutect2 --java-options "-Xmx4g -Xms4g -XX:ParallelGCThreads=1" \
-R $refGenDir$refGenName -L $PROBES \
-I $tumorBAM -I $glBAM \
--tumor-sample $TUMOR_ID --normal-sample $GERMLINE_ID \
--tumor-lod-to-emit 0.0 --force-active true \
-bamout $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.bam" \
-O $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.vcf"
-
Thank you for this detailed explanation! We are looking into this closely to determine what is causing this pattern you are seeing.
Could you give more details regarding how you downsampled your reads and in what step of your pipeline?
It would also be helpful to see these comparison variants after FilterMutectCalls. The raw output of Mutect2 contains many false positives, and so I am wondering if these variants would be filtered or not.
Best,
Genevieve
-
Hi Genevieve,
Thank you for coming back to me!
Regarding the downsampling, I posted the code in the GitHub folder mentioned in the original message: https://github.com/McGranahanLab/mutect2_forceActive_behaviour , the script name is 1_downsample_consequent.sh. In my pipeline I downsample the bam file just after ApplyBQSR, so the bam is essentially ready for further processing such as variant calling. Shortly, I perform downsampling with samtools view -bs command, with seed fixed at 42. I do it sequentially to avoid non-overlapping between the different percentages of downsampling reads. I think the sequential downsampling is easier to grasp with the code so I also put the code from 1_downsample_consequent.sh below:
#!/bin/bash
# -----------------------------------------------------------------------------
# DESCRIPTION
# Script to perform CONSEQUITIVE downsampling of bam files to 25%, 50%, 75%,
# 90% of total reads. Consequitive downsampling means that to create, for
# example, 75% downsampled bam, we will downsample 90% bam, rather than
# original 100% bam. This techniques removes a factor of random downsampling
# and therefore presence/absence of some variants due to change of sampling or
# not sampling alternative allele. Downsampling is performed with samtools.
#
# HOW TO RUN: chmod +x 1_downsample_consequent.sh
# ./1_downsample_consequent.sh
#
# INPUT FILE: S6.bam
# INPUT VALUES: FRACT - fractions (percentages) to downsample bam to and SEED
# seed to be used in samtools
#
# OUTPUT FILE(S): folder downsampled_bams will be created. It will contain
# downsampled bam files. Names fill follow pattern:
# S6.downsampling_percentage.bam
#
# ATTENTION: MAKE SURE SAMTOOLS v1.9 IS IN YOUR PATH
# -----------------------------------------------------------------------------
# -----------------------------------------------------------------------------
# LOAD ALL NESSECARY SOFTWARE
# -----------------------------------------------------------------------------
# MAKE SURE SAMTOOLS v1.9 IS IN YOUR PATH
# -----------------------------------------------------------------------------
# Inputs
# -----------------------------------------------------------------------------
BASE_DIR=`pwd`
FULL_BAM='S6.bam'
# fractions we want to downsample to. Always in descending order! Always
# include 100 as first element
FRACT=( 100 90 75 50 25 )
# seed to be used in samtools
SEED=42
# -----------------------------------------------------------------------------
# Outputs
# -----------------------------------------------------------------------------
DOWNSAMP_DIR=$BASE_DIR"/downsampled_bams/"
mkdir -p $DOWNSAMP_DIR
# -----------------------------------------------------------------------------
# Perform CONSEQUITIVE downsampling
# -----------------------------------------------------------------------------
sample=`basename $FULL_BAM`
sample=`echo $sample | sed 's/.bam//g'`
for i in "${!FRACT[@]}"; do
if [ "$i" -gt 0 ]; then
fract="${FRACT[$i]}"
# calculate updated fraction: a % which is needed to take from bam
# from the previous step in order to achieve original desired fraction
prevIdx=$(expr $i - 1)
prevFract="${FRACT[$prevIdx]}"
updFract=$(expr "${FRACT[$i]}" \* 100 / $prevFract)
updFract=$(bc -l <<<$updFract)
# get bam to downsample from: either original, or
if [ "$i" -eq 1 ]; then
bamToDowns=$FULL_BAM
else
bamToDowns=$DOWNSAMP_DIR$sample"."$prevFract".bam"
fi
downsampledBAM=$DOWNSAMP_DIR$sample"."$fract".bam"
CURR_TIME=`date`
echo '[' $CURR_TIME '] Started downsampling to '$fract' from '$bamToDowns
echo '[' $CURR_TIME '] Updated fraction: '$updFract
samtools view -bs $SEED"."$updFract $bamToDowns > $downsampledBAM
samtools index $downsampledBAM
CURR_TIME=`date`
echo '[' $CURR_TIME '] Finished downsampling to '$fract' from '$bamToDowns
fi
done
# copy full bam to downsampling folder to have everything in one place
cp $FULL_BAM $DOWNSAMP_DIR$sample'.100.bam'
samtools index $DOWNSAMP_DIR$sample'.100.bam'Also to improve the reproducibility of the issue I uploaded my original downsampled bam files here: https://github.com/McGranahanLab/mutect2_forceActive_behaviour/tree/master/downsampled_bams .
As for the filtering, here things get even more interesting. The command to perform filtering was:
gatk FilterMutectCalls --java-options "-Xmx4g -Xms4g -XX:ParallelGCThreads=1" \
--reference $refGenDir$refGenName \
-V $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.vcf" \
-O $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.filt.vcf"and the command to extract passed variants:
grep '^#' $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.filt.vcf" > $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.passFilt.vcf"
grep -w PASS $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.filt.vcf" | grep -v '^#' >> $MUTECT2_DIR'/'$TUMOR_ID"."$fract".m2.passFilt.vcf"Next, I uploaded vcf files containing variants which passed the filters into IGV:
As you can see, sets of passed filter variants between different percentages of downsampling are almost mutually exclusive. Variants located inside the pink rectangle are mostly indels, and for now I would like not to consider them. Yet, variants in the blue box are SNPs, and I zoomed in on them:
Somehow, despite that variants were identified in 25% and 75% downsampled bam, they were not picked up in 50%, 90% and full (100%) bam. According to the vcf file (and to the IGV screenshot of the downsampled bams I posted in the original message) coverage should be sufficient:VCF containing passed filter variants for 25% downsampled bam (positions 124664488 and 124664491 from the blue box only):
2 124664488 . T A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=162,89|3,2;DP=257;ECNT=2;GERMQ=93;MBQ=40,25;MFRL=169,208;MMQ=60,60;MPOS=4;NALOD=2.18;NLOD=43.87;POPAF=6.00;TLOD=5.38 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:76,5:0.068:81:38,3:32,0:0|1:124664488_T_A:124664488:47,29,3,2 0|0:175,0:6.694e-03:175:82,0:82,0:0|1:124664488_T_A:124664488:115,60,0,0
2 124664491 . A G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=161,91|3,2;DP=262;ECNT=2;GERMQ=93;MBQ=38,23;MFRL=169,208;MMQ=60,60;MPOS=1;NALOD=2.17;NLOD=43.56;POPAF=6.00;TLOD=5.34 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:77,5:0.067:82:41,4:34,0:0|1:124664488_T_A:124664488:47,30,3,2 0|0:175,0:6.694e-03:175:82,0:87,0:0|1:124664488_T_A:124664488:114,61,0,0VCF containing passed filter variants for 75% downsampled bam (positions 124664488 and 124664491 from the blue box only):
2 124664488 . T A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=53,112|3,4;DP=172;ECNT=2;GERMQ=93;MBQ=39,25;MFRL=175,208;MMQ=60,60;MPOS=4;NALOD=1.92;NLOD=24.53;POPAF=6.00;TLOD=5.13 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:78,6:0.061:84:37,3:32,0:0|1:124664488_T_A:124664488:25,53,3,3 0|0:87,1:0.012:88:42,0:40,1:0|1:124664488_T_A:124664488:28,59,0,1
2 124664491 . A G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=53,114|3,4;DP=181;ECNT=2;GERMQ=93;MBQ=38,23;MFRL=173,208;MMQ=60,60;MPOS=1;NALOD=1.93;NLOD=24.53;POPAF=6.00;TLOD=5.10 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:79,6:0.060:85:42,5:36,0:0|1:124664488_T_A:124664488:25,54,3,3 0|0:88,1:0.012:89:42,0:46,0:0|1:124664488_T_A:124664488:28,60,0,1
I also updated the variant calling script (https://github.com/McGranahanLab/mutect2_forceActive_behaviour/blob/master/2_mutect2_downsample.sh ) in the GitHub folder with the FilterMutectCalls command.I hope this information helps your investigation!
Best regards,
Maria.
-
Hi Maria,
Yes, thank you! I took a look at previous examples where users were facing similar issues. There can be issues at certain levels of coverage with the assembly engine, causing variants at some sites to not be called. I'm not particularly familiar of why this occurs, so maybe David Benjamin can comment.
I do know some things you could try to fix this issue with the assembly engine. First, I would recommend the options --recover-all-dangling-branches and --linked-de-bruijn-graph. You can try these separately and together. These troubleshooting options are covered in a very helpful article here: When HaplotypeCaller and Mutect2 do not call an expected variant. That article is also just generally helpful to look into why certain variants are called.
Another Mutect2 option that you may want to look into is to adjust the downsampling in Mutect2. Mutect2 performs downsampling, which can affect which variants are called. If one loci has way too high of coverage, it may be thrown out as a possible error. The option you can move around is --max-reads-per-alignment-start.
Finally, I would recommend verifying that your downsampling algorithm is downsampling as you want it to and that you aren't splitting up pairs. There is a helpful article on how to downsample with GATK from one of our devs here: Downsampling in GATK.
Here are the other posts I read that might be helpful:
- https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2020-01-07-2019-07-10/24507-Mutect2-repeatedly-not-detecting-somatic-variant-IDH2-R172K-with-solid-read-support-and-5-AF
- https://gatk.broadinstitute.org/hc/en-us/community/posts/360059696811-Mutect2-not-calling-a-4-bp-deletion-in-BRCA1-with-50-AF
- https://www.biostars.org/p/448808/
Let me know if you have further questions and if you find any of these options specifically help!
Best,
Genevieve
-
Hi Genevieve,
Thank you for the suggestions! I read the articles and posts you’ve suggested as well as I tried all of the options in various combinations and I’m sorry to report that they unfortunately didn’t resolve my issue.
First of all, I tried --recover-all-dangling-branches and --linked-de-bruijn-graph solo as well as combined. Here and everywhere below --force-active option is set to true just as in all analysis. I looked directly at created mutect2 bam because if there is a region which for some reason has no coverage just as shown in my original post, there is no way MUTECT2 could call a variant in that region.
--linked-de-bruijn-graph solo:
--recover-all-dangling-branches solo:
--linked-de-bruijn-graph and --recover-all-dangling-branches together:
As you can see from the screenshots above, these options did not do much of difference and coverage of whole regions present in lower percentage of downsampling is still lost in higher percentage of downsampling.
I was aware of the --max-reads-per-alignment-start option and it's possible influence at the time I first posted. However, I read before about it and, according to this post by GATK team https://github.com/broadinstitute/gatk/issues/3808 it should only kick in with coverages > 5000, which is not the case for my data. Nonetheless, I tried to run Mutect2 with --max-reads-per-alignment-start set to 0 too:
The situation unfortunately is not resolved by this option either.
In one of the posts and articles you pointed out an issue was resolved by using --mitochondria mode which essentially sets --initial-tumor-lod to 0 and --tumor-lod-to-emit to 0. In my analysis I already had --tumor-lod-to-emit set to 0.0, so I Run Mutect2 with --initial-tumor-lod set to 0 and --max-reads-per-alignment-start set to 0 too:
This also did not lead to resolution.
In my attempts to fix the issue, I also tried various combinations of --recover-all-dangling-branches, --linked-de-bruijn-graph, --max-reads-per-alignment-start and --initial-tumor-lod options but they all resulted into nothing. I did not post IGV screenshots of my attempts to keep this post brief.
Thank you pointing out the new DownsampleByDuplicateSet tool. However, I don't think it suits my goal as it is designed to answer to question "what if we have less material for sequencing?". In contrary, my original question was way simpler: "What if we sequence less?". In the same articled it's noticed that one can answer this question via downsampling with Picard tool DownsampleSam or samtools view. It's also noted that both Picard and samtools do take into consideration that reads come in pairs and treat them fairly. Therefore, I concluded that there is no problem with the downsampling with samtools.
I really appreciate your help in solving this issue! Please let me know if I can provide any other data in addition to those on GitHub.
Best regards,
Maria.
-
Thanks for the update Maria and for providing these screenshots. I'm continuing to look into possible explanations. From the examples you chose, are there any specific coverage levels that you think have real variants and are the most accurate?
-
Hi Genevieve,
I would recommend variants chr2:124664488 T|A and chr2:124664491 A|G for a deeper check. They were detected at 25% downsampling and 75% downsampling and even passed the filters, but not at 50% downsampling. Moreover, reads covering that region were completely lost at 90% and 100% Mutect2 bams which in my opinion is the main problem here. Also, one could notice that the coverage of the positions declines in Mutect2 bams with increase in the percentage of downsampling which I also find disturbing because it steadily increases in pre-mutect2 bams.
IGV view of Mutect2 bams, position 124664488:
IGV view of Mutect2 bams, position 124664491:
IGV view of pre-Mutect2 bams, position 124664488:Best regards,
Maria. -
Hi Maria,
Thank you so much for the update and the screenshots! I took a close look at all your information with our developer team to get their thoughts and they looked into the code to see why you are seeing these discrepancies.
On the --force-active option:
In the code, Mutect2 first distinguishes between nonactive and active regions. If --force-active is true, all regions are kept as active. Next, the assembly engine performs local re-assembly and finds all possible haplotypes in the active regions. This would be in all of your regions since you have force active true. If there is no sign of variation present, then the region is not printed to the bamout and these haplotypes are not kept for the final algorithm step: Evaluating the evidence for haplotypes and variant alleles.
Force active only changes whether or not the assembly engine double checks for possible variation at all sites. However, the code that distinguishes between active and nonactive regions is actually quite sensitive so we generally don't find that --force-active helps to find real variants. It's not usually worth the extra compute time.
We think that the reason you are not seeing any reads in the bamout for some sites even though you are using --force-active is because there is not enough evidence for true variation at those coverage levels.
Downsampling:
It doesn't look like there are any issues with your downsampling code. Regarding Mutect2's downsampling, there are three arguments: --max-reads-per-alignment-start (which I mentioned before), --max-suspicious-reads-per-stride, and --downsampling-stride. Suspicious reads are reads with bad mapping quality. We agreed with what you wrote earlier, that changing any of these parameters wouldn't affect your results but I wanted to make sure I mentioned those options.
Why are some variants only appearing at lower coverage?
We took a look at the variants that you shared and for the chr2:124664488 T|A variant, it looks like this variant had low enough coverage of the A that it could be from random effects of downsampling. If the downsampling removed enough Ts by chance, then this could arise. At 100% coverage and with all the reads present, the reads supporting A could be below the calling threshold. In addition, there seems to be some strand bias, which could result in the variant not seen as a true variant when there is more read coverage.
This also seems to be the case with the chr2:124664491 A|G variant. There can be variants that pass read filters but when more evidence is present, they do not hold up. This is just the case with downsampling. There is a chance that these are real variants that do not show up in the 100% coverage sample but they do seem like they are false positives.
I hope this helps you understand Mutect2 better and please let me know if you have further questions.
Best,
Genevieve
-
Hi Genevieve,
Thank you for the detailed response, I think now I understand Mutect2 workflow better! I fully agree with you that some cases of variants' detection at lower percentages of downsampling but not at the higher ones can be caused by downsampling introducing bias and just by chance picking up alternative allele more often than reference.However, I would expect this to have very little to zero effect while comparing bam file downsampled to 90% versus not downsampled, 100% bam. Yet, while performing comparison of the variants detected at 90% bam vs 100% bam on complete chr2 I found 24 variants which were detected at 90% bam, but not 100% bam. As previously, I would like to concentrate on SNPs rather than indels as they are easier to debug. Eight out of 24 variants were SNPs. One of the variants in the region chr2:181,134,751-181,135,048 caught my attention (2:181134924_G/A)
As you can see from IGV screenshot, the same situation as I described in my previous posts takes place: this regions has coverage at 90% Mutect2 bam, however, it has no coverage at 100% Mutect2 bam. According to your previous post that would be due to Mutect2 detecting "no sign of variation present, then the region is not printed to the bamout and these haplotypes are not kept for the final algorithm step". But, if we zoom in we'll see that there is plenty of variation detected at 90% bam: variant at position 2:181134924 G/A has 53 "A" reads and 74 "G" reads, there is a clear 1bp deletion left to it (181134923), there is also a variant at position 181134910 G to T, which also has quite a decent ratio between alternative and reference allele (G:79, T:53). Same goes for the variants at positions 181134909 and 181134907. Somehow, it was enough of variation in order to preserve this region in 90% bam, but not at 100% bam. My question now is why addition of just 10% reads changes so much the behavior of the algorithm? Variable positions I mentioned above would not disappear with the additional 10% of the reads even if all of those reads would bring in reference allele because there is clearly enough reads supporting alternative allele (>50). Another question is why variants at positions 181134909 and 181134907 were not detected at all? I also don't want to bring in the discussion here the argument that mutations at positions 181134907 and 181134909 might later be considered as clustered events because filtering is performed at the later stages and we might consider relaxed filtering as well.Because this time I run Mutect2 on the whole chr2, the input files are too big to put them on GitHub, so I uploaded them to my drive:
Downsampled bams: https://drive.google.com/file/d/1gjX4_I86xWKMZIrvP_C-Qxs1Hf6GJZ3A/view?usp=sharing
Reference genome: https://drive.google.com/file/d/1zeV9pCLAOl6kbAAIgA2vQyCI1aMFHKz5/view?usp=sharing
Command used to run Mutect2 was:
downsampleBamDir="downsampled_bams_chr2full/"
TUMOR_ID='S6'
GERMLINE_ID='S6_GL'
fract=90
COMMON_SUFFIX="chr2full"
tumorBAM=$downsampleBamDir$TUMOR_ID"."$fract"."$COMMON_SUFFIX".bam"
glBAM=$downsampleBamDir$GERMLINE_ID"."$COMMON_SUFFIX".bam"
refGenDir="reference_genome/"
refGenName="GRCm38_68_chr2.fa"
PROBES=$refGenDir"probes_chr2.bed"
gatk Mutect2 --java-options "-Xmx8g -Xms8g -XX:ParallelGCThreads=1" \
-R $refGenDir$refGenName \
-L $PROBES \
-I $tumorBAM -I $glBAM \
--tumor-sample $TUMOR_ID --normal-sample $GERMLINE_ID \
--initial-tumor-lod 0.0 --tumor-lod-to-emit 0.0 --force-active true \
--max-reads-per-alignment-start 0 \
--linked-de-bruijn-graph \
-bamout $TUMOR_ID"."$fract".m2."$COMMON_SUFFIX".chr2full.bam" \
-O $TUMOR_ID"."$fract".m2."$COMMON_SUFFIX".chr2full.vcf"Best regards,
Maria
-
Hi Maria,
Thanks for the update so that we can continue to get to the bottom of this issue. The variant in the screenshot you shared looks strand biased since the reads are only on the negative strand. Is there something about your data that would cause more strand bias? All the variants we have seen have had some strand bias.
I was also wondering if you could check these 8 SNPs to see how many pass filtering. If you could also check the other 7 SNPs to see if any are not strand biased, then we can look into the assembly graph closer at those sites.
Best,
Genevieve
-
Hi Genevieve,
Thank you for coming back to me so fast! All of the 8 variants I talked about earlier do pass FilterMutectCalls. Here is the corresponding lines from the vcf obtained after FilterMutectCalls:
2 26484311 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=119,107|2,2;DP=236;ECNT=1;GERMQ=93;MBQ=39,28;MFRL=172,124;MMQ=60,60;MPOS=27;NALOD=1.99;NLOD=28.48;POPAF=6.00;TLOD=6.93 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:113,4:0.031:117:54,4:59,0:60,53,2,2 0/0:113,0:0.010:113:43,0:68,0:59,54,0,0
2 28679027 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=73,44|2,3;DP=126;ECNT=2;GERMQ=93;MBQ=34,20;MFRL=194,100;MMQ=60,60;MPOS=3;NALOD=1.84;NLOD=18.86;POPAF=6.00;TLOD=9.91 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:52,5:0.103:57:22,0:29,4:0|1:28679027_G_A:28679027:30,22,2,3 0|0:65,0:0.015:65:31,0:32,0:0|1:28679027_G_A:28679027:43,22,0,0
2 28679029 . TTCAAG T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=74,49|2,3;DP=130;ECNT=2;GERMQ=93;MBQ=38,23;MFRL=194,100;MMQ=60,60;MPOS=2;NALOD=1.84;NLOD=18.87;POPAF=6.00;TLOD=9.86 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:58,5:0.102:63:20,0:26,5:0|1:28679027_G_A:28679027:31,27,2,3 0|0:65,0:0.015:65:31,0:33,0:0|1:28679027_G_A:28679027:43,22,0,0
2 49946234 . T G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=104,181|1,3;DP=294;ECNT=2;GERMQ=93;MBQ=37,23;MFRL=184,89;MMQ=60,60;MPOS=3;NALOD=2.16;NLOD=44.27;POPAF=6.00;TLOD=4.97 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:128,3:0.032:131:60,2:62,1:0|1:49946225_TGCTTGGA_T:49946225:47,81,1,2 0|0:157,1:7.365e-03:158:65,1:85,0:0|1:49946225_TGCTTGGA_T:49946225:57,100,0,1
2 65506118 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=127,52|2,2;DP=196;ECNT=2;GERMQ=93;MBQ=39,26;MFRL=178,135;MMQ=60,60;MPOS=4;NALOD=2.02;NLOD=28.25;POPAF=6.00;TLOD=4.85 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:78,4:0.041:82:41,3:37,1:53,25,2,2 0/0:101,0:9.768e-03:101:50,0:51,0:74,27,0,0
2 130101294 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=391,324|2,1;DP=735;ECNT=1;GERMQ=93;MBQ=39,40;MFRL=184,291;MMQ=60,60;MPOS=34;NALOD=2.50;NLOD=92.85;POPAF=6.00;TLOD=4.26 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:345,3:0.013:348:168,2:176,1:195,150,2,1 0/0:370,0:3.174e-03:370:194,0:173,0:196,174,0,0
2 173227936 . C T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=16,7|4,1;DP=30;ECNT=1;GERMQ=43;MBQ=41,25;MFRL=214,145;MMQ=60,60;MPOS=4;NALOD=1.18;NLOD=3.90;POPAF=6.00;TLOD=9.22 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:10,5:0.351:15:4,1:6,3:6,4,4,1 0/0:13,0:0.062:13:6,0:7,0:10,3,0,0
2 181008619 . T C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=8,91|1,5;DP=107;ECNT=2;GERMQ=93;MBQ=40,22;MFRL=198,107;MMQ=60,60;MPOS=1;NALOD=1.79;NLOD=16.53;POPAF=6.00;TLOD=10.01 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:43,6:0.107:49:18,6:21,0:0|1:181008614_CA_C:181008614:5,38,1,5 0|0:56,0:0.016:56:25,0:30,0:0|1:181008614_CA_C:181008614:3,53,0,0
2 181134924 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=2,24|0,5;DP=33;ECNT=2;GERMQ=48;MBQ=41,24;MFRL=195,159;MMQ=60,60;MPOS=3;NALOD=1.16;NLOD=3.30;POPAF=6.00;TLOD=11.22 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:15,5:0.277:20:8,2:7,2:0|1:181134922_CG_C:181134922:1,14,0,5 0|0:11,0:0.067:11:7,0:4,0:0|1:181134922_CG_C:181134922:1,10,0,0I selected the variant I showed previously on IGV because it had the biggest BAF among the 8 variants from above. Also, as you may notice not all of the variants display strand bias, i.e. 130101294, 173227936 and 26484311 don't, while the rest have a varying degree of it.
I also selected 7 variants without strand bias which do pass the FilterMutectCalls upon your request. I thought that text would be more brief and informative way to present them rather than IGV:
2 19452953 . G T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=125,334|7,30;DP=522;ECNT=1;GERMQ=93;MBQ=34,38;MFRL=173,162;MMQ=60,60;MPOS=31;NALOD=2.35;NLOD=69.30;POPAF=6.00;TLOD=84.94 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:195,35:0.157:230:102,11:91,23:47,148,7,28 0/0:264,2:4.751e-03:266:140,1:121,1:78,186,0,2
2 19452953 . G T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=125,334|7,30;DP=522;ECNT=1;GERMQ=93;MBQ=34,38;MFRL=173,162;MMQ=60,60;MPOS=31;NALOD=2.35;NLOD=69.30;POPAF=6.00;TLOD=84.94 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:195,35:0.157:230:102,11:91,23:47,148,7,28 0/0:264,2:4.751e-03:266:140,1:121,1:78,186,0,2
2 35720098 . A T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=57,66|8,7;DP=138;ECNT=1;GERMQ=93;MBQ=38,30;MFRL=208,156;MMQ=60,60;MPOS=34;NALOD=1.77;NLOD=17.37;POPAF=6.00;TLOD=37.73 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:55,15:0.182:70:24,11:30,4:28,27,8,7 0/0:68,0:0.017:68:32,0:34,0:29,39,0,0
2 104256229 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=214,343|15,35;DP=624;ECNT=1;GERMQ=93;MBQ=41,35;MFRL=187,192;MMQ=60,60;MPOS=31;NALOD=2.46;NLOD=82.56;POPAF=6.00;TLOD=120.66 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:241,50:0.172:291:111,19:127,30:107,134,15,35 0/0:316,0:3.583e-03:316:172,0:140,0:107,209,0,0
2 112668039 . A T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=446,357|18,17;DP=865;ECNT=1;GERMQ=93;MBQ=37,36;MFRL=164,173;MMQ=60,60;MPOS=24;NALOD=2.57;NLOD=110.52;POPAF=6.00;TLOD=75.68 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:325,33:0.098:358:175,14:146,19:175,150,17,16 0/0:478,2:2.712e-03:480:258,1:218,0:271,207,1,1
2 112833882 . A T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=91,133|8,6;DP=258;ECNT=2;GERMQ=93;MBQ=39,32;MFRL=182,186;MMQ=60,60;MPOS=10;NALOD=1.67;NLOD=32.54;POPAF=6.00;TLOD=24.83 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:99,13:0.116:112:40,4:55,8:39,60,7,6 0/0:125,1:0.015:126:67,1:55,0:52,73,1,0
2 181354075 . G T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=86,75|9,5;DP=192;ECNT=2;GERMQ=93;MBQ=40,33;MFRL=192,183;MMQ=60,60;MPOS=16;NALOD=1.03;NLOD=19.43;POPAF=6.00;TLOD=27.56 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:84,13:0.136:97:44,5:39,7:49,35,8,5 0/0:77,1:0.024:78:36,1:38,0:37,40,1,0
As for the strand bias of the data itself, my QC analysis did not show a presence of it. Nonetheless, I think that the biggest issue so far is not 'strandedness' of the variants but rather an absence of any coverage of the region under consideration in 100% mutect2 bam, while present in 90% bam, as, if I understood correctly, output to mutect2 bam is performed prior to any variant calling.Best regards,
Maria. -
Hi Maria,
Could you run Mutect2 again with the option --debug-graph-transformation for the 90% and 100% BAMs restricting the analysis to just around the 2:26484311 site? The --debug-graph-transformation option will create many .dot dot plot files and we can use these to more closely examine the assembly engine at this location. Then we will be able to figure out why this site appears in the 90% and not the 100% sample. You can upload these files into a folder to our FTP site: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671. Along with the dot plot files, we would also like to see the bamouts from this run. Please also verify that you are seeing the same results once you have ran with this option.
When you said "an absence of any coverage of the region under consideration in 100% mutect2 bam", does that mean there is no coverage in the 100% BAM in general, or just at these sites of interest?
Please let me know when you have uploaded those files.
Best,
Genevieve
-
Hi Genevieve,
I think, there was a little bit of miscommunication above. Eight sites I listed above (26484311, 28679027, 28679029 (Indel actually), 49946234, 65506118, 130101294, 173227936, 181008619 and 181134924) were not found to have a variant at 100% bam, but were found to have a variant at 90% bam. There could be 2 reasons for that: 1) 100% mutect2 bam does have coverage at the site, but still mutect2 does not see a variant at those positions; 2) 100% mutect2 bam does not have coverage. In both cases I imply that 100% original bam (after alignment and post processing) does have a substantial coverage of the site. For 26484311, 65506118 and 181008619 the coverage in 100% mutect2 bam is present which makes your choice of 26484311 as a site to base debug on unfortunate. Rest of the variants (28679027, 28679029, 49946234, 130101294, 173227936, 181134924) do not have coverage in 100% Mutect2 bam. Therefore, I re-run Mutect2 with --debug-graph-transformations option on regions encompassing variants 181134924 and 28679027 (2 separate runs) and uploaded them to the server. I also run Mutect2 on 26484311 just in case.
Here are screenshots for 181134924 and 28679027 showing that the absence of coverage issue at 100% bam persists:
As for your question regarding "an absence of any coverage of the region under consideration in 100% mutect2 bam" I meant that coverage is absent for those specific sites. There are variants which are detected in both 90% and 100% Mutect2 bam, and obviously coverage is present for those sites in both 90% and 100% Mutect2 bam.
Best regards,
Maria.
-
Thanks Maria for the clarifications, we'll take a look and get back to you as soon as we can.
-
Hi Maria,
We made some changes to the debug graph transformations option since the version you are using so it would help if we can see the graphs from the latest version. Could you upload the files again with version 4.2.2.0?
We were seeing some strange behavior in the assembly graph and also we noticed that the 25mer graph was missing. Hopefully the newer graphs will help us solve the issue.
I had another question: do you know if there are problems with the reference you are using that could cause assembly graph issues? Or potentially Ns and IUPAC bases in the genome?
Let me know when you have uploaded those files, thank you for helping us look into this with so much detailed information!
Best,
Genevieve
-
Hi Genevieve,
I’m glad to hear that we’re getting closer to the resolution of the problem!
As requested, I run exactly the same code but with GATK v4.2.2.0 and unfortunately I don’t think it will help us, as MUTECT2 bams produced by v4.2.2.0 are empty. I still uploaded the results together with the results from v.4.1.8.1 and all GATK messages printed during the runs in one zip named marialitovchenko_debugGraphTrans.zip. I also tried to upload reference genome and input files I used (only chr2), but unfortunately they were too big.
I also looked closely at the genome regions of interest and I didn't notice any N symbols or IUPAC ones in them.
I hope the provided files help!
Best regards,
Maria. -
Thanks for uploading the files, Maria. And just to confirm, you uploaded all the graphs? You did not see a 25mer graph with this GATK version either?
-
Yes, I uploaded all of the files produced. There was indeed no 25mer graphs.
-
Just wanted to give you a heads up that we are still looking into this, thank you for your patience.
-
Hello Genevieve,
Thank you very much for an update!
Best regards,
Maria.
-
One other update: we have determined that we will need to use your original BAM files to do a full thorough debugging because here are strange things going on in your assembly graphs and bamouts. This analysis will take some time and we are not able to schedule it for a few weeks. We will get back to you in about a month with our conclusions.
Is this standard Illumina data?
-
Sure, I'll be waiting for the results of your debug!
Yes, it's standard Illumina data.
Best regards,
Maria.
-
Hello Genevieve,
Could you tell me please if there is an update regarding the issue?
Best regards,
Maria.
-
Hi Maria,
Thank you so much for bringing this back up, Maria. I have checked in with the developer assigned with this issue and they have not been able to take a closer look. Since it is a complicated issue, I want the expert to really be able to dive into the code.
It is on their radar again and I'm hoping they can look before our office closes for the holiday break (Dec 23). Could you confirm that these downsampled bams are the ones to look at that recreate the issue? Which site should we look at first?
Thank you,
Genevieve
-
Hi Genevieve,
Sorry for the late reply! Yes, those bams are original ones which I used in the very first post of this thread and they fit very well for the purpose. The bams are actually very very small to ensure that MUTECT2 runs very fast. They contain just one region: chr2:124,659,282-124,662,971.
Best regards,
Maria.
-
Thank you for your patience while we looked into this. I know it's not ideal to have to wait so long for answers. To find the underlying issue, we had to get the insight of our developer team and give this issue the time it needed. I finally have an update regarding why you are seeing these strange results! The good news is that we don't think there are any major Mutect2 bugs and we think the problem is totally fixable.
In your input BAM on IGV, we saw soft clipped UMI or adapter sequences on the ends of many of your reads. These sequences seem to be causing issues with the assembly engine in Mutect2 and probably also caused issues in your original assembly.
If this is UMI data, we would recommend going back and correcting your UMI preprocessing steps. We have a tool to mark duplicates for UMI data here: UmiAwareMarkDuplicatesWithMateCigar. If these are adapter sequences, you'll want to go back to your adapter trimming step and make sure that you are correctly removing the adapter sequences from your sequencing. You could also just roughly chop off 5 bases from the 5' end of your reads. Removing these UMI/adapter sequences is very important to ensuring you have quality downstream results.
If for some reason you cannot trim these adapter sequences, another alternative would be to run Mutect2 with --dont-use-soft-clipped-bases. This argument was actually broken prior to GATK 4.2.0.0, so you'll need to use at least GATK 4.2.0.0.
We took a look at the variant calls you specified to verify that this adapter trimming problem was actually causing the problems you saw. We saw two new variants that popped up from the 90 to 75 downsampled reads, however, when looking at the input BAM, these variants are not actually supported in the pileups. They seem to be mirages from the adapter sequences that Mutect2 picked up when performing local assembly.
We probably missed this underlying issue because we were mainly looking at the bamouts in IGV. My colleague recommended that the best way to troubleshoot calls is to go back to the input bam in IGV and take a look at the pileup there. It seems like the calls you were finding in the downsampled bams were unfortunately not real signs of variance.
Please let me know if this makes sense to you and if you have questions regarding the solutions we proposed.
Best regards,
Genevieve
Please sign in to leave a comment.
25 comments