Haplotype calls deletion followed by insertion instead of indel
Hi,
I am using the HaplotypeCaller (GATK 4.4.0.0). When I look at the input BAM file in IGV, I expect the variant NC_000015.9:g.48760182_48760185delinsGGGT
. However, HaplotypeCaller reports NC_000015.9:g.48760182_48760185del
as well as an insertion NC_000015.9:g.48760184_48760185insGGGT
(i.e. two distinct variants instead of a single indel). In the bamout
, one can clearly see that the local realignment suggests the deletion + insertion and not the indel.
I specified my issue in detail previously on GitHub. There, I was told that "the cost for an insertion + deletion is likely to cost less than 4 adjacent SNPs". For my understanding: *If* the cost of 4 adjacent SNPs would be less than insertion + deletion (i.e. the opposite of what seems to be the case), would HaplotypeCaller then call 4 SNPs or a single insertion (i.e. 4 or 1 line in the resulting VCF)?
James also wrote on GitHub: "Fortunately most variant evaluation scripts re-align the haplotypes locally which should exclude this from counting against the tool in most cases." I don't understand at all what is meant by this. Could anyone explains this a little bit better?
Everything is detailed in the GitHub post. I'm happy to copy & paste if you want me to (but I rather have as single source of truth) . Here is the required information as per the forum template:
a) GATK version used: 4.4.0.0
b) Exact command used:
java -jar gatk-package-4.4.0.0-local.jar HaplotypeCaller -R GRCh37.fa -I ./in.bam --output out.vcf -L in.bed -bamout bamout.bam
c) Entire program log:
13:29:09.524 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/rollf/Downloads/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
13:29:09.562 INFO HaplotypeCaller - ------------------------------------------------------------
13:29:09.564 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.4.0.0
13:29:09.564 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
13:29:09.565 INFO HaplotypeCaller - Executing as rollf@heulSUSE on Linux v5.14.21-150400.24.55-default amd64
13:29:09.565 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.6+10-LTS
13:29:09.565 INFO HaplotypeCaller - Start Date/Time: April 11, 2023 at 1:29:09 PM CEST
13:29:09.565 INFO HaplotypeCaller - ------------------------------------------------------------
13:29:09.565 INFO HaplotypeCaller - ------------------------------------------------------------
13:29:09.566 INFO HaplotypeCaller - HTSJDK Version: 3.0.5
13:29:09.566 INFO HaplotypeCaller - Picard Version: 3.0.0
13:29:09.566 INFO HaplotypeCaller - Built for Spark Version: 3.3.1
13:29:09.567 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:29:09.567 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:29:09.567 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:29:09.567 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:29:09.567 INFO HaplotypeCaller - Deflater: IntelDeflater
13:29:09.568 INFO HaplotypeCaller - Inflater: IntelInflater
13:29:09.568 INFO HaplotypeCaller - GCS max retries/reopens: 20
13:29:09.568 INFO HaplotypeCaller - Requester pays: disabled
13:29:09.569 INFO HaplotypeCaller - Initializing engine
13:29:09.723 INFO FeatureManager - Using codec BEDCodec to read file file:///home/rollf/tickets/OPS-3211/github/in.bed
13:29:09.728 INFO IntervalArgumentCollection - Processing 20240 bp from intervals
13:29:09.733 INFO HaplotypeCaller - Done initializing engine
13:29:09.743 INFO HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
13:29:09.775 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/rollf/Downloads/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
13:29:09.776 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/rollf/Downloads/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
13:29:09.794 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
13:29:09.794 INFO IntelPairHmm - Available threads: 8
13:29:09.794 INFO IntelPairHmm - Requested threads: 4
13:29:09.794 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
13:29:09.827 INFO ProgressMeter - Starting traversal
13:29:09.827 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
13:29:10.335 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 15:48760180 and possibly subsequent; at least 10 samples must have called genotypes
13:29:10.377 INFO HaplotypeCaller - 0 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
7 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
7 total reads filtered out of 126 reads processed
13:29:10.378 INFO ProgressMeter - 15:48767430 0.0 69 7527.3
13:29:10.379 INFO ProgressMeter - Traversal complete. Processed 69 total regions in 0.0 minutes.
13:29:10.405 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.001990354
13:29:10.406 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.011560668000000001
13:29:10.406 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.03 sec
13:29:10.483 INFO HaplotypeCaller - Shutting down engine
[April 11, 2023 at 1:29:10 PM CEST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=327155712
-
Unless I'm misunderstanding, I think this is essentially a representation issue choosing between several different essentially equivalent variants. As you say, if HaplotypeCaller's scoring indicated that 4 adjacent SNPs are more likely than a deletion + insertion than it would potentially be written out that way (either as 4 different variants or 1 single MNP depending on the exact details and parameters.). There are many different possible ways to represent the same variants in a VCF and we do our best to output a coherent one but we may not always pick the ideal version.
What James is saying about local realignment, is that many tools that you would use downstream to compare variants against each other will not look at the vcf's representation in isolation but will attempt to realign them in order to produce a representation which is less representation dependent.
If your downstream analysis needs your variants to be expressed in a specific format I would recommend using a standardization tool which can put it into the format you expect. I don't have a great suggestion for what tools are available but I'm sure there are a number that produce different formats.
-
Hey Louis,
thanks a lot. This does help a lot for my understanding!
Please sign in to leave a comment.
2 comments