Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Problems in subsampling reads for Mutect2 with --max-reads-per-alignment-start

Answered
1

13 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi yqiu,

    Thanks for writing into the forum! I hope we can figure out what is going on with this site!

    I'm wondering what other troubleshooting steps you have tried in order to call this site? We have a resource with all our troubleshooting recommendations located here: https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant

    If those ideas don't help out with calling this site, could you share what this site looks like when it is called, for example with --max-reads-per-alignment-start 4000? It would be helpful to see the entire VCF line to be able to see the INFO field as well. I'm also curious if this site passes filtering with FilterMutectCalls.

    Could you also share what this site looks like in the input BAM? For example, an IGV screenshot with the depth.

    Best,

    Genevieve

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi yqiu,

    I discussed this with the developer team to get more specific reasons why you are seeing this variant called at certain values of --max-reads-per-alignment-start and not others. 

    First, we have some more recommendations regarding how to get high quality results with the data you have.

    1. We do not recommend running MarkDuplicates with amplicon data
    2. Use a GATK version newer than 4.2.0.0 and turn off using soft clipped bases with --dont-use-soft-clipped-bases.
    3. Check how many reads are filtered in the GATK program log
    4. What's the DP when you turn off the down sampler?
    5. Check out our troubleshooting document here: https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant

    Second, we think that this variant is being called at some downsampling levels and not others because of assembly issues. At high depths, the assembly graph becomes less like a graph and more like a crazy mess which can mess with the pruning filter. This results in luck in terms of which variants are being called when you are using the downsampler at all with amplicon data.

    Let me know if you have further questions.

    Best,

    Genevieve

    1
    Comment actions Permalink
  • Avatar
    yqiu

    Hi Genevieve,

    Thanks for the prompt response!

    The troubleshooting doc is super helpful. We noticed that turning off downsampling was suggested for amplicon data without UMIs. My understanding is that applying downsampling on amplicon data without UMIs may lead to data loss, which was exactly what we have observed in our data. Could you please elaborate a bit more about the downsampling algorithm? Does the algorithm try to avoid duplicates while downsampling?

    Below is the vcf entry that Mutect2 reported when we set "--max-reads-per-alignment-start 4000".

    CHR POS . C G . . AS_SB_TABLE=1337,1243|697,626;DP=3907;ECNT=33;MBQ=38,38;MFRL=121,121;MMQ=60,60;MPOS=29;POPAF=7.30;TLOD=4450.00 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:2580,1323:0.335:3903:2576,1320:0,0:1337,1243,697,626

    Also, here is the IGV screen shot for full bam in the target region.

    • There are two amplicons covering the target region. The target variant is highlighted by an arrow, with >30,000 coverage and VAF=0.33.
    • Please note that IGV was set to show the soft-clipped bases (i.e. amplicon primers in our use case).

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    yqiu Do you have amplicon data without UMIs? I'm confused about what you mean here:

    We noticed that turning off downsampling was suggested for amplicon data without UMIs. My understanding is that applying downsampling on amplicon data without UMIs may lead to data loss, which was exactly what we have observed in our data. Could you please elaborate a bit more about the downsampling algorithm? Does the algorithm try to avoid duplicates while downsampling?

    0
    Comment actions Permalink
  • Avatar
    yqiu

    Sorry for the confusion.

    Yes. All the data we showed are amplicon data without UMIs.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ok, I see. To get the best results with your data, I would recommend turning off the downsampling argument with Mutect2. 

    I'll try look into the details of the downsampling algorithm but it may take me some time to get the specifics to you.

    0
    Comment actions Permalink
  • Avatar
    yqiu

    Thanks a lot for the suggestion! We will disable downsampling for now as you suggested.

    But it would be great to learn about the reason underlying downsampling though we do understand it is non-trivial. Looking forward to your input and appreciate it.  

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi yqiu,

    The positional downsampler in both HaplotypeCaller and Mutect2 downsamples reads based on the total number of reads that start at a particular base on the reference. Since with amplicon data there are many reads with the same start and end position, most of the amplicon depth will be lost to downsampling. 

    Let me know if you have further questions about this.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    yqiu

    Hi Genevieve,

    Thanks a lot for the suggestion and explanation with downsampling. We will disable downsampling for amplicon data. 

    But it still unclear to me why there are reads with --max-reads-per-alignment-start 4000 but not --max-reads-per-alignment-start 12000. In theory, the reads with --max-reads-per-alignment-start 4000 should be a subset of reads with --max-reads-per-alignment-start 12000. But for the position, we see 3,903 reads under 4000 but only 5 reads under 12000. 

    Is it a bug in GATK or something in downsampling I don't get to yet?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    This does not sound like a bug in GATK since you are able to solve the problem once you turn off downsampling, as we recommend. We don't have the capacity at this moment to further look into what is happening with your specific alignments, since you were able to solve the issue.

    Not necessarily 4000 would be a subset of 12000. This looks like just a strange effect from the amplicon data and the alignment start positions.

    0
    Comment actions Permalink
  • Avatar
    yqiu

    Hi Genevieve,

    Thanks a lot for the prompt reply. 

    Could you help explain a little bit more about why "Not necessarily 4000 would be a subset of 12000"? There might be something about downsampling that I may have missed.

    My understanding is that Mutect2 will look at reads starting at each position and downsample to certain number if number of exceed threshold at the position. If so, 12000 should have at least the same read or more reads starting at each position than 4000. But what we observed is that 12000 has lower coverage than 4000. 

    0
    Comment actions Permalink
  • Avatar
    yqiu

    Hi Genevieve,

    Sorry for the delay in response. Thanks for the suggestion. We have no problems when disable subsampling as you suggested.

    Best,

    Yunjiang

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great to hear, thanks for the update!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk