Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Option "--dont-use-soft-clipped-bases" does not seem to work in Mutect2

0

11 comments

  • Avatar
    Bhanu Gandham

    Hi janrehker

     

    The purpose of that argument is to exclude soft clipped bases from analysis and seems like that is what is happening here. Can you post a screenshot of the region with --dont-use-soft-clipped-bases false to compare.

    0
    Comment actions Permalink
  • Avatar
    janrehker

    Hi Bhanu Gandham,

    Thanks a bunch for the quick answer and here is the screenshot with --dont-use-soft-clipped-bases false:

     

     

    The reason for my concern is the variant in the middle of the screen. As it is sitting inside a primer site (the soft clipped reads in the first screenshot), realigning soft clipped reads should affect the allelic fraction (AF) of this variant by diluting this variant by the ~100% acurate reference matching primer sequences.

    I am expecting around 22-23% AF (close to half of the sequence on this position are soft clipped primers), but I get around 13% with GATK. This number is expected when I do not softclip my reads on the primer site which led me to the assumption that the parameter stayed inactive.

    Here is now what I get for this variant with --dont-use-soft-clipped-bases true (second screenshot):

    chr19 10602835 . G C . . DP=1260;ECNT=2;MBQ=20,20;MFRL=134,159;MMQ=60,60;MPOS=30;PON;POPAF=4.49;TLOD=309.75 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:1061,168:0.134:1229:8,1:1045,165:570,491,113,55

    and --dont-use-soft-clipped-bases false

    chr19 10602835 . G C . . DP=1260;ECNT=2;MBQ=20,20;MFRL=134,162;MMQ=60,60;MPOS=30;PON;POPAF=4.49;TLOD=318.98 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:1061,168:0.135:1229:8,1:1045,165:569,492,113,55

     

    edit: I am able to spot softclipped reads from the original file that have been realigned in both bam-output files from GATK when switching the option or off.

     

    Best regards,

    Jan

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi,

     

    1. can you please confirm if the soft clipped reads in the bamout cover the variant sites? 
    2. We think that the reason the AF doesn't change is because all the evidence of C(alt) allele is from reads other than the soft clipped reads.Which is why with/without --dont-use-soft-clipped-bases argument is giving the same results. If you see the IGV screenshot of the soft clipped bases, they support G allele not C.
    0
    Comment actions Permalink
  • Avatar
    janrehker

    Thanks for your reply.

     

    1. Yes, the variant is indeed centered on all the screenshots and the soft clipped bases are primers that match the reference sequence.

    2. AF is (or should be) calculated as (alternate) / (alternate + reference). If you remove half of the reference reads by softclipping them (and this is the whole point of the soft clipping procedure), the denominator and AF should change, when the number of alternate reads stays the same. If I take the same data and do not apply softclipping at all before running GATK, I still get 13-14% AF with Mutect2.

     

    So I guess, despite setting the option to false, GATK will still try to realign the soft clipped bases and take them into account for variant calling and analysis?

    It seems I am still struggeling to get my point accross..
    The thing is that I actually don't want this soft clipped part of the sequence to be used, no matter if it aligns correctly. Neither for analysis of alternate reads, nor for reference. It is artificial sequence. Is there an option to prohibit realignment or calling? Or maybe a more drastic measure to adjust the basequality of primers to 0? The latter would be a dirty hack and I would have to construct two alignments that would have to be called seperately. Hardclipping would be an option but also change mapping quality and so I am trying to avoid this.

    While realignment of primers totaly makes sense for indel calling it actually dilutes the AF on SNPs and indels in primer regions that are also covered by overlapping valid sequence. There has been a paper about this issue (PMID: 28484262). They used samtools pilup for variant calling which does not realign the soft clipped primers. When I run freebayes on my soft clipped sample I also get the expected VAF of around 23%. If I run it on the sample before I soft clip it, I get 13%, similar to the GATK result.

     

    Best regards,

    Jan

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi janrehker

     

    We looked at the code again and we think that the reason you are seeing no difference with the --dont-use-soft-clipped-bases argument on or off is because those reads are not used in the realignment either way due to other reasons such as base qual, mapq, etc. 

    0
    Comment actions Permalink
  • Avatar
    janrehker

    Hi Bhanu Gandham,

    Thanks a bunch for digging into this, I really appreciate it and would like to give a small update.

    I extracted all the soft clipped reads on this specific primer position (Pic.1). I then passed this file to GATK, switching --dont-use-soft-clipped-bases to false. Pic.2 shows the bam output from GATK properly aligning the reads. Alignment looks good with mapping quality usually 60, base quality in the primer region between 31-45.

    I can also call a different variant inside the primer region.

    chr19 10602815 . A G . . AS_SB_TABLE=236,594|1,1;DP=832;ECNT=3;MBQ=20,20;MFRL=138,103;MMQ=60,60;MPOS=21;POPAF=4.49;TLOD=4.26 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:830,2:3.306e-03:832:0,0:818,2:236,594,1,1

    Raw coverage in this region is around 900, so I would expect most of the reads to be OK. I also get two variants inside the sample DNA, one of them is 12 bp away from the primer site. Maybe this first variant is just to close to the end of the reads?

    I am going to try again by just masking the Primers with N's, so I can be sure they cannot be realigned. If I then still get the same numbers and AF like in my first approach, I will be certain that those reads were never taken into assumption (at least for this first variant).

     

    Once again, Thanks for your time!

     

    Jan

     

    Pic.1

     

    Pic.2

    0
    Comment actions Permalink
  • Avatar
    janrehker

    After making absolutely sure to remove the primers from the equation by N-masking them I compared to just soft clipping combined with employing the <--dont-use-soft-clipped-bases true> option as shown above, I now get the expected VAF.

     

    N-masked:

    chr19 10602835 . G C . . AS_SB_TABLE=101,385|37,103;DP=632;ECNT=4;MBQ=31,20;MFRL=169,137;MMQ=60,60;MPOS=35;PON;POPAF=4.49;TLOD=362.49 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:486,140:0.211:626:0,0:482,138:101,385,37,103

     

    soft clipping + --dont-use-soft-clipped-bases true

    (chr19 10602835 . G C . . DP=1260;ECNT=2;MBQ=20,20;MFRL=134,159;MMQ=60,60;MPOS=30;PON;POPAF=4.49;TLOD=309.75 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:1061,168:0.134:1229:8,1:1045,165:570,491,113,55)

     

    Therefore calling on two bam-files (N-masked/hard clipped versus non altered) and then merging calls from inside and outside of primer regions might be the best strategy here. Not sure why the --dont-use-soft-clipped-bases option did not work as intended for me though.

     

    Anyways, thanks for reading about my struggle with GATK and variant calling in general! ;-)

     

    Sincerely,

    Jan

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    HI janrehker

     

    A few questions for you:

    1. is this single-end reads data?
    2. do the reads have tlen field in the bam file? This is to determine if these reads have weird fragment lengths. This could be one reason why these reads are being filtered out by Mutect2.
    3. are reads flagged as properly paired in the sam file?
    4. Can you please try to run with latest version of Mutect2 v4.1.7.0 and see if the issue persist?
    5. In addition to answers to questions above, can you also please provide a bam snippet of this region. You can find info about how to share data with us here: https://gatk.zendesk.com/hc/en-us/articles/360035889671
    0
    Comment actions Permalink
  • Avatar
    janrehker

    Hello Bhanu Gandham,

    1. This is paired end data

    2. Short survey by eyeballing shows +/- 30-230 bp in the length field of the SAM records. Removing them from the analysis is actually, what I want. But this happens only by N-masking the primers, not by just soft clipping combined with any setting for --dont-use-soft-clipped-bases (true /  false).

    3. Once again a quick survey: Most of them have 83 or163 as samflag which should be properly paired according to https://broadinstitute.github.io/picard/explain-flags.html

    4. I ran the file on GATK Mutect2 v4.1.7.0 and the issue still persists. As those are Primers which naturally have the same starting point, I also tried setting the --max-reads-per-alignment-start to 10000000 but this did not help either.

    5. bam snippet including command line, message log and resulting vcf (can be replicated with the snippet) is in the uploaded zip file posts_360067732252.zip - The relevant position would be chr19 10602835

     

    Probably I am missing something really obvious here, so once again a big Thank you in advance.

     

    Best,

    Jan

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi janrehker

     

    We finally found the bug!!! Thank you for your patience:)

    We are working on it and you can follow its progress here: https://github.com/broadinstitute/gatk/issues/6686

    Thanks for bringing this to our attention.

    0
    Comment actions Permalink
  • Avatar
    janrehker

    Awesome, thanks a bunch!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk