Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

HaplotypeCaller fails at regions of high depth

0

10 comments

  • Avatar
    Genevieve Brandt (she/her)

    Thank you for your post, ulitskyi! I want to let you know we have received your question. We'll get back to you if we have any updates or follow up questions. 

    Please see our Support Policy for more details about how we prioritize responding to questions. 

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Hi, any news on this issue?

    Thanks,

    Igor.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for checking in, Igor. We do not have any updates at this time. However, when we do take a closer look with our development team, it would be helpful if we can see specific examples of this issue (IGV screenshots, variants of interest) so we can determine why they were not called. 

    Also, have you seen this article? It is very helpful for troubleshooting this situation. When HaplotypeCaller and Mutect2 do not call an expected variant

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi ulitskyi,

    I noticed that you already uploaded data. We require that you wait until a support team member asks you before you upload data.

    With that being said, I am wondering what happens when you use the default --max-reads-per-alignment-start of 50? I also noticed that you turned off the NotDuplicateReadFilter. Is this because of your targeted sequencing?

    Let me know what you find and we can continue to look into possible solutions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Hi Genevieve,

    Yes, I uploaded the data upfront. We don't use NotDuplicateReadFilter, since the data is for targeted multiplex amplicon sequencing. We experimented a lot with different max-reads-per-alignment-start settings. In this case I think 50 works, but it creates problems in other amplicons. In general, it seems that setting different max-reads-per-alignment-start values can somet doimes overcome this problem, but it is impossible to pinpoint a particular read that is causing problems. It does occur almost always at regions with high depth. The same SNPs are called very well with UnifiedGenotyper. 

    Please advise - Thanks,

    Igor.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Igor for the update. I did determine that this is a known issue with HaplotypeCaller: https://github.com/broadinstitute/gatk/issues/7567. However, there hasn't been much discussion on that issue ticket for any solutions. I will follow up with the developer team and let them know that you are seeing this too. 

    Would you also be able to post your example on that github issue thread so that there is more information about the issue you are seeing?

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Thank you Genevieve. I can post the example there. Note that downsampling is not really a solution that seems to work consistently well, as some regions behave better with downsampling while others become worse. So it would be better if the root cause of why the positions with the mutations are not reported at all is understood.

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yes, that is the goal of the issue ticket I shared. They would like to understand why the current strategy of downsampling does not consistently work to call all variants with amplicon data. We don't see this issue with non-amplicon data.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi ulitskyi,

    I brought this up with our developer team to see if there was anything else we recommended for amplicon data. They recommended that you fully turned off the downsampler off by setting --max-reads-per-alignment-start to 0. 

    Then, we were wondering if what you are seeing are borderline calls or easy/obvious calls being dropped when you change the downsampling parameters. If they are borderline calls, then adding more reads could make the evidence turn out differently. However, if they are easy and obvious calls, we have seen the assembly fail after more cycles in the assembly graph. When you add more reads, you add more chances to have a cycle in the assembly graph. 

    There are certain options we would recommend changing for these two different cases, could you let us know which you are seeing in your data?

    We also wanted to note that we added a pileup caller within HaplotypeCaller, to turn it on, use the option --pileup-detection

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Dear Genevieve,

    Thank you for checking. We indeed use "--max-reads-per-alignment-start 0", but that does not help with the occasional miscalls mentioned in this thread. I'm referring to very clear calls, sequenced at high depth, that appear homozygous in IGV, and yet are just not called at all by HaplotypeCaller. I think these are indeed cases where some additional reads cause the assembly to fail. What can be done in this case?

    I will try the pileup caller - Thanks.

    Igor.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk