HaplotypeCaller fails at regions of high depth
I'm looking a region from a targeted sequecing run. It is sequenced at high depth (~60K). If only 10,000 or even 40,000 reads per alignment start are used, there is no problem, and the region is called with high confidence. However, if I allow all the reads (which is beneficial at other positions), no call is made at all. Tweaking other parameters doesn't seem to help at all.
The files have been uploaded to ulitsky.for_gatk.zip
REQUIRED for all errors and issues:
a) GATK version used: 4.2.6.1
b) Exact command used:
This doesn't work:
java -Xmx6g -jar ../gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar HaplotypeCaller -R ~/data/human/sequence/hg19.fa -L dummy.bed -G StandardAnnotation --mapping-quality-threshold-for-genotyping 10 -mbq 17 --standard-min-confidence-threshold-for-calling 25 --disable-read-filter NotDuplicateReadFilter --kmer-size 30 --kmer-size 10 --max-reads-per-alignment-start 50000 -I focused.focused.bam -O focused.50000.vcf
This works fine:
java -Xmx6g -jar ../gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar HaplotypeCaller -R ~/data/human/sequence/hg19.fa -L dummy.bed -G StandardAnnotation --mapping-quality-threshold-for-genotyping 10 -mbq 17 --standard-min-confidence-threshold-for-calling 25 --disable-read-filter NotDuplicateReadFilter --kmer-size 30 --kmer-size 10 --max-reads-per-alignment-start 10000 -I focused.focused.bam -O focused.10000.vcf
c) Entire program log:
15:14:47.208 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
15:14:47.241 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/labs/ulitsky/igoru/gatk/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:14:47.337 INFO HaplotypeCaller - ------------------------------------------------------------
15:14:47.337 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.6.1
15:14:47.337 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
15:14:47.337 INFO HaplotypeCaller - Executing as igoru@pho.wexac.weizmann.ac.il on Linux v3.10.0-1160.59.1.el7.x86_64 amd64
15:14:47.338 INFO HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_121-b13
15:14:47.338 INFO HaplotypeCaller - Start Date/Time: September 26, 2022 3:14:47 PM IDT
15:14:47.338 INFO HaplotypeCaller - ------------------------------------------------------------
15:14:47.338 INFO HaplotypeCaller - ------------------------------------------------------------
15:14:47.338 INFO HaplotypeCaller - HTSJDK Version: 2.24.1
15:14:47.338 INFO HaplotypeCaller - Picard Version: 2.27.1
15:14:47.338 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
15:14:47.338 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:14:47.338 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:14:47.338 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:14:47.338 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:14:47.338 INFO HaplotypeCaller - Deflater: IntelDeflater
15:14:47.338 INFO HaplotypeCaller - Inflater: IntelInflater
15:14:47.339 INFO HaplotypeCaller - GCS max retries/reopens: 20
15:14:47.339 INFO HaplotypeCaller - Requester pays: disabled
15:14:47.339 INFO HaplotypeCaller - Initializing engine
15:14:47.631 INFO FeatureManager - Using codec BEDCodec to read file file:///home/labs/ulitsky/igoru/gatk/agl/dummy.bed
15:14:47.637 INFO IntervalArgumentCollection - Processing 50 bp from intervals
15:14:47.641 INFO HaplotypeCaller - Done initializing engine
15:14:47.647 INFO HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
15:14:47.658 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/labs/ulitsky/igoru/gatk/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
15:14:47.660 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/labs/ulitsky/igoru/gatk/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
15:14:47.691 INFO IntelPairHmm - Using CPU-supported AVX-512 instructions
15:14:47.691 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
15:14:47.692 INFO IntelPairHmm - Available threads: 40
15:14:47.692 INFO IntelPairHmm - Requested threads: 4
15:14:47.692 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
15:14:47.710 INFO ProgressMeter - Starting traversal
15:14:47.710 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
15:14:51.848 INFO HaplotypeCaller - 33 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
33 total reads filtered
15:14:51.849 INFO ProgressMeter - unmapped 0.1 1 14.5
15:14:51.850 INFO ProgressMeter - Traversal complete. Processed 1 total regions in 0.1 minutes.
15:14:51.854 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.04473419
15:14:51.855 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.21058462500000003
15:14:51.856 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.14 sec
15:14:51.856 INFO HaplotypeCaller - Shutting down engine
[September 26, 2022 3:14:51 PM IDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.08 minutes.
Runtime.totalMemory()=2594177024
-
Thank you Genevieve. I can post the example there. Note that downsampling is not really a solution that seems to work consistently well, as some regions behave better with downsampling while others become worse. So it would be better if the root cause of why the positions with the mutations are not reported at all is understood.
-
Thank you for your post, ulitskyi! I want to let you know we have received your question. We'll get back to you if we have any updates or follow up questions.
Please see our Support Policy for more details about how we prioritize responding to questions.
-
Hi, any news on this issue?
Thanks,
Igor.
-
Thanks for checking in, Igor. We do not have any updates at this time. However, when we do take a closer look with our development team, it would be helpful if we can see specific examples of this issue (IGV screenshots, variants of interest) so we can determine why they were not called.
Also, have you seen this article? It is very helpful for troubleshooting this situation. When HaplotypeCaller and Mutect2 do not call an expected variant
-
Hi ulitskyi,
I noticed that you already uploaded data. We require that you wait until a support team member asks you before you upload data.
With that being said, I am wondering what happens when you use the default --max-reads-per-alignment-start of 50? I also noticed that you turned off the NotDuplicateReadFilter. Is this because of your targeted sequencing?
Let me know what you find and we can continue to look into possible solutions.
Best,
Genevieve
-
Hi Genevieve,
Yes, I uploaded the data upfront. We don't use NotDuplicateReadFilter, since the data is for targeted multiplex amplicon sequencing. We experimented a lot with different max-reads-per-alignment-start settings. In this case I think 50 works, but it creates problems in other amplicons. In general, it seems that setting different max-reads-per-alignment-start values can somet doimes overcome this problem, but it is impossible to pinpoint a particular read that is causing problems. It does occur almost always at regions with high depth. The same SNPs are called very well with UnifiedGenotyper.
Please advise - Thanks,
Igor.
-
Thanks Igor for the update. I did determine that this is a known issue with HaplotypeCaller: https://github.com/broadinstitute/gatk/issues/7567. However, there hasn't been much discussion on that issue ticket for any solutions. I will follow up with the developer team and let them know that you are seeing this too.
Would you also be able to post your example on that github issue thread so that there is more information about the issue you are seeing?
Thank you,
Genevieve
-
Yes, that is the goal of the issue ticket I shared. They would like to understand why the current strategy of downsampling does not consistently work to call all variants with amplicon data. We don't see this issue with non-amplicon data.
-
Hi ulitskyi,
I brought this up with our developer team to see if there was anything else we recommended for amplicon data. They recommended that you fully turned off the downsampler off by setting --max-reads-per-alignment-start to 0.
Then, we were wondering if what you are seeing are borderline calls or easy/obvious calls being dropped when you change the downsampling parameters. If they are borderline calls, then adding more reads could make the evidence turn out differently. However, if they are easy and obvious calls, we have seen the assembly fail after more cycles in the assembly graph. When you add more reads, you add more chances to have a cycle in the assembly graph.
There are certain options we would recommend changing for these two different cases, could you let us know which you are seeing in your data?
We also wanted to note that we added a pileup caller within HaplotypeCaller, to turn it on, use the option --pileup-detection.
Best,
Genevieve
-
Dear Genevieve,
Thank you for checking. We indeed use "--max-reads-per-alignment-start 0", but that does not help with the occasional miscalls mentioned in this thread. I'm referring to very clear calls, sequenced at high depth, that appear homozygous in IGV, and yet are just not called at all by HaplotypeCaller. I think these are indeed cases where some additional reads cause the assembly to fail. What can be done in this case?
I will try the pileup caller - Thanks.
Igor.
-
Hi Genevieve,
I've uploaded a detailed bug report with three other cases of the same problem (along with the bam slices and the commands, all in problems_ulitskyi.zip). All of these are obvious calls that should be easily made based on the read inspection in IGV. There are two cases where when a larger number of reads is admitted HC stops making the call, and one case where the reverse happens - the call is not made when as many as 200 reads per alignment start are admitted, but is made when the parameter is raised to 400. setting max-reads-per-alignment-start to 0 causes the first 2 calls to fail, and only the third is made. So there doesn't seem to be a max-reads-per-alignment-start threshold that would consistently allow to call all three of these. Changing other parameters doesn't seem to help. I hope these examples can be used to figure out when HC fails and to fix the bug.
Thanks,
Igor.
-
Hi Genevieve and GATK team,
Just wondering whether you were able to find a solution/recommendations for dealing with ulitskyi's issue? We have had an almost identical problem, where we have fairly high coverage germline amplicon data, ~8-16K depth. For 3 samples we have an 1-bp del variant that should be a very easy and obvious call when reviewed in IGV (and confirmed by orthogonal method), that is not being called by the Haplotype caller. We played around with --max-reads-per-alignment-start and found that we can make a particular variant call if we set it to 1000, but are worried that we may be missing other variants as ulitskyi described here. We are using gatk version 4.2.6.1. Happy to provide bams/logs/details, but our case looks super similar to that of ulitskyi's.
I also note you recommended the --pileup-detection option, but can also see that there is a disclaimer on that saying it is a beta release and not fully tested, so we are reluctant to use this as we do need the calls to be as reliable as possible for a diagnostic setting.
Many thanks in advance! -
Hello Eve C. We generally don't test on and support Amplicon data so your results might vary. However, we do have some general wisdom for debugging some of the obvious issues you might encounter while using Amplicon data.
First make sure to disable MarkDuplicates from your pipeline (as start position based duplicates marking is dangerous with amplicons). Second we generally recommend removing softclipped bases from the input reads with:--dont-use-soft-clipped-bases true
It is true that the PileupCaller is in beta in the version you are using, however in the upcoming release of GATK it is leaving beta and will be producing results equivalent to DRAGEN v3.7.8 for pileup-calling.
As for MaxReadsPerAlignmentStart, you might consider disabling the downsampling entirely (by using the value 0). The standard downsampling based on start position causes problems with amplicon data and is likely to end up over-downsampling your reads. We also have some more general advice to draw upon for trying to debug why variants are missing https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant. -
Thanks James. We've actually tried all of these varying parameter changes, and yet as Eve indicates as well, none of them solve the problem, as at some number of reads (which varies by region/sample) HaplotypeCaller stops calling the variant (or even mentioning it in the debug output). The PileupCaller is indeed calling them, but it seems to have some other issues (false positive calls in noisy regions), which we are trying to figure out.
-
Hi ulitskyi,
If the PileupCaller recovers the missing sensitivity, but introduces false positives, you can always use downstream variant scoring/filtration tools such as CNNScoreVariants or VariantRecalibrator to try to filter out the false positives at a later stage of your pipeline.
Regards,
David
Please sign in to leave a comment.
15 comments