Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

HaplotypeCaller fails at regions of high depth

0

15 comments

  • Avatar
    Genevieve Brandt (she/her)

    Thank you for your post, ulitskyi! I want to let you know we have received your question. We'll get back to you if we have any updates or follow up questions. 

    Please see our Support Policy for more details about how we prioritize responding to questions. 

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Hi, any news on this issue?

    Thanks,

    Igor.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for checking in, Igor. We do not have any updates at this time. However, when we do take a closer look with our development team, it would be helpful if we can see specific examples of this issue (IGV screenshots, variants of interest) so we can determine why they were not called. 

    Also, have you seen this article? It is very helpful for troubleshooting this situation. When HaplotypeCaller and Mutect2 do not call an expected variant

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi ulitskyi,

    I noticed that you already uploaded data. We require that you wait until a support team member asks you before you upload data.

    With that being said, I am wondering what happens when you use the default --max-reads-per-alignment-start of 50? I also noticed that you turned off the NotDuplicateReadFilter. Is this because of your targeted sequencing?

    Let me know what you find and we can continue to look into possible solutions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Hi Genevieve,

    Yes, I uploaded the data upfront. We don't use NotDuplicateReadFilter, since the data is for targeted multiplex amplicon sequencing. We experimented a lot with different max-reads-per-alignment-start settings. In this case I think 50 works, but it creates problems in other amplicons. In general, it seems that setting different max-reads-per-alignment-start values can somet doimes overcome this problem, but it is impossible to pinpoint a particular read that is causing problems. It does occur almost always at regions with high depth. The same SNPs are called very well with UnifiedGenotyper. 

    Please advise - Thanks,

    Igor.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Igor for the update. I did determine that this is a known issue with HaplotypeCaller: https://github.com/broadinstitute/gatk/issues/7567. However, there hasn't been much discussion on that issue ticket for any solutions. I will follow up with the developer team and let them know that you are seeing this too. 

    Would you also be able to post your example on that github issue thread so that there is more information about the issue you are seeing?

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Thank you Genevieve. I can post the example there. Note that downsampling is not really a solution that seems to work consistently well, as some regions behave better with downsampling while others become worse. So it would be better if the root cause of why the positions with the mutations are not reported at all is understood.

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yes, that is the goal of the issue ticket I shared. They would like to understand why the current strategy of downsampling does not consistently work to call all variants with amplicon data. We don't see this issue with non-amplicon data.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi ulitskyi,

    I brought this up with our developer team to see if there was anything else we recommended for amplicon data. They recommended that you fully turned off the downsampler off by setting --max-reads-per-alignment-start to 0. 

    Then, we were wondering if what you are seeing are borderline calls or easy/obvious calls being dropped when you change the downsampling parameters. If they are borderline calls, then adding more reads could make the evidence turn out differently. However, if they are easy and obvious calls, we have seen the assembly fail after more cycles in the assembly graph. When you add more reads, you add more chances to have a cycle in the assembly graph. 

    There are certain options we would recommend changing for these two different cases, could you let us know which you are seeing in your data?

    We also wanted to note that we added a pileup caller within HaplotypeCaller, to turn it on, use the option --pileup-detection

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Dear Genevieve,

    Thank you for checking. We indeed use "--max-reads-per-alignment-start 0", but that does not help with the occasional miscalls mentioned in this thread. I'm referring to very clear calls, sequenced at high depth, that appear homozygous in IGV, and yet are just not called at all by HaplotypeCaller. I think these are indeed cases where some additional reads cause the assembly to fail. What can be done in this case?

    I will try the pileup caller - Thanks.

    Igor.

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Hi Genevieve,

    I've uploaded a detailed bug report with three other cases of the same problem (along with the bam slices and the commands, all in problems_ulitskyi.zip). All of these are obvious calls that should be easily made based on the read inspection in IGV.  There are two cases where when a larger number of reads is admitted HC stops making the call, and one case where the reverse happens - the call is not made when as many as 200 reads per alignment start are admitted, but is made when the parameter is raised to 400. setting max-reads-per-alignment-start to 0 causes the first 2 calls to fail, and only the third is made. So there doesn't seem to be a max-reads-per-alignment-start threshold that would consistently allow to call all three of these. Changing other parameters doesn't seem to help. I hope these examples can be used to figure out when HC fails and to fix the bug.

    Thanks,

    Igor.

    0
    Comment actions Permalink
  • Avatar
    Eve C
    Hi Genevieve and GATK team, 

    Just wondering whether you were able to find a solution/recommendations for dealing with ulitskyi's issue? We have had an almost identical problem, where we have fairly high coverage germline amplicon data, ~8-16K depth. For 3 samples we have an 1-bp del variant that should be a very easy and obvious call when reviewed in IGV (and confirmed by orthogonal method), that is not being called by the Haplotype caller. We played around with --max-reads-per-alignment-start and found that we can make a particular variant call if we set it to 1000, but are worried that we may be missing other variants as ulitskyi described here. We are using gatk version 4.2.6.1. Happy to provide bams/logs/details, but our case looks super similar to that of ulitskyi's.

    I also note you recommended the --pileup-detection option, but can also see that there is a disclaimer on that saying it is a beta release and not fully tested, so we are reluctant to use this as we do need the calls to be as reliable as possible for a diagnostic setting. 

    Many thanks in advance!
    0
    Comment actions Permalink
  • Avatar
    James Emery

    Hello Eve C. We generally don't test on and support Amplicon data so your results might vary.  However, we do have some general wisdom for debugging some of the obvious issues you might encounter while using Amplicon data.

    First make sure to disable MarkDuplicates from your pipeline (as start position based duplicates marking is dangerous with amplicons). Second we generally recommend removing softclipped bases from the input reads with:

    --dont-use-soft-clipped-bases true

    It is true that the PileupCaller is in beta in the version you are using, however in the upcoming release of GATK it is leaving beta and will be producing results equivalent to DRAGEN v3.7.8 for pileup-calling.

    As for MaxReadsPerAlignmentStart, you might consider disabling the downsampling entirely (by using the value 0). The standard downsampling based on start position causes problems with amplicon data and is likely to end up over-downsampling your reads. We also have some more general advice to draw upon for trying to debug why variants are missing https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant. 

    0
    Comment actions Permalink
  • Avatar
    ulitskyi

    Thanks James. We've actually tried all of these varying parameter changes, and yet as Eve indicates as well, none of them solve the problem, as at some number of reads (which varies by region/sample) HaplotypeCaller stops calling the variant (or even mentioning it in the debug output). The PileupCaller is indeed calling them, but it seems to have some other issues (false positive calls in noisy regions), which we are trying to figure out.

    0
    Comment actions Permalink
  • Avatar
    David Roazen

    Hi ulitskyi,

    If the PileupCaller recovers the missing sensitivity, but introduces false positives, you can always use downstream variant scoring/filtration tools such as CNNScoreVariants or VariantRecalibrator to try to filter out the false positives at a later stage of your pipeline. 

    Regards,

    David

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk