GQ=0 upstream of regions with no mapped reads
Hi,
I've come across some puzzling results with HaplotypeCaller, and I cannot figure out if this is intended.
I've noticed that for roughly 10-12bp upstream of regions with no coverage GQ values are zero (despite looking normal). This behaviour is only upstream of missing regions, and not downstream.
Below I've included an examples VCF file displaying this behaviour. From position 28 to position 39 GQ is for some reason zero.
#VCF file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test
NC_003143.1 1 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:1,0:1:45:0,45
NC_003143.1 2 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:1,0:1:45:0,45
NC_003143.1 3 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:1,0:1:45:0,45
NC_003143.1 4 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:1,0:1:45:0,45
NC_003143.1 5 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:2,0:2:90:0,90
NC_003143.1 6 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:3,0:3:99:0,135
NC_003143.1 7 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:3,0:3:99:0,135
NC_003143.1 8 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:3,0:3:99:0,135
NC_003143.1 9 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 10 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 11 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 12 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 13 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 14 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 15 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 16 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 17 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 18 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 19 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 20 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 21 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 22 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 23 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 24 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 25 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 26 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,180
NC_003143.1 27 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:99:0,135
NC_003143.1 28 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:0:0,0
NC_003143.1 29 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:0:0,0
NC_003143.1 30 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:0:0,0
NC_003143.1 31 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:0:0,0
NC_003143.1 32 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:0:0,0
NC_003143.1 33 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:0:0,0
NC_003143.1 34 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:3,1:4:0:0,0
NC_003143.1 35 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:0:0,0
NC_003143.1 36 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:4,0:4:0:0,0
NC_003143.1 37 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:3,0:3:0:0,0
NC_003143.1 38 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:3,0:3:0:0,0
NC_003143.1 39 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:1,0:1:0:0,0
NC_003143.1 40 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 41 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 42 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 43 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 44 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 45 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 46 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 47 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 48 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 49 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 50 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 51 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 52 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 53 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 54 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 55 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 56 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 57 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 58 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 59 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 60 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
NC_003143.1 61 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:0,0:0:0:0,0
########################
REQUIRED for all errors and issues:
a) GATK version used:
gatk-4.3.0.0
b) Exact command used:
$GATK HaplotypeCaller -I ${bam} -O test.vcf.gz -R $ref -ERC BP_RESOLUTION -ploidy 1 -bamout bamout.bam
c) Entire program log:
Using GATK jar /maps/projects/lundbeck/people/cqr376/software/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /maps/projects/lundbeck/people/cqr376/software/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar HaplotypeCaller -I ../3_mapping/bam/test/test.lib.data.ref.filter.bam -O test.vcf.gz -R ../1_testing/ref.fasta -ERC BP_RESOLUTION -ploidy 1 -bamout bamout.bam
12:52:47.461 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/maps/projects/lundbeck/people/cqr376/software/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:52:47.711 INFO HaplotypeCaller - ------------------------------------------------------------
12:52:47.712 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.3.0.0
12:52:47.712 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
12:52:47.712 INFO HaplotypeCaller - Executing as cqr376@dandycomp03fl.unicph.domain on Linux v4.18.0-477.27.1.el8_8.x86_64 amd64
12:52:47.712 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_92-b15
12:52:47.713 INFO HaplotypeCaller - Start Date/Time: October 27, 2023 12:52:47 PM CEST
12:52:47.713 INFO HaplotypeCaller - ------------------------------------------------------------
12:52:47.713 INFO HaplotypeCaller - ------------------------------------------------------------
12:52:47.714 INFO HaplotypeCaller - HTSJDK Version: 3.0.1
12:52:47.714 INFO HaplotypeCaller - Picard Version: 2.27.5
12:52:47.714 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
12:52:47.714 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:52:47.715 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:52:47.715 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:52:47.715 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:52:47.715 INFO HaplotypeCaller - Deflater: IntelDeflater
12:52:47.715 INFO HaplotypeCaller - Inflater: IntelInflater
12:52:47.715 INFO HaplotypeCaller - GCS max retries/reopens: 20
12:52:47.715 INFO HaplotypeCaller - Requester pays: disabled
12:52:47.715 INFO HaplotypeCaller - Initializing engine
12:52:48.312 INFO HaplotypeCaller - Done initializing engine
12:52:48.317 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
12:52:48.340 INFO HaplotypeCallerEngine - Currently, physical phasing is only available for diploid samples.
12:52:48.341 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
12:52:48.341 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
12:52:48.422 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/maps/projects/lundbeck/people/cqr376/software/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
12:52:48.428 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/maps/projects/lundbeck/people/cqr376/software/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
12:52:48.459 INFO IntelPairHmm - Using CPU-supported AVX-512 instructions
12:52:48.460 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
12:52:48.461 INFO IntelPairHmm - Available threads: 96
12:52:48.462 INFO IntelPairHmm - Requested threads: 4
12:52:48.462 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
12:52:48.506 INFO ProgressMeter - Starting traversal
12:52:48.507 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
12:52:48.795 INFO HaplotypeCaller - 0 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
0 total reads filtered
12:52:48.798 INFO ProgressMeter - unmapped 0.0 1 206.9
12:52:48.799 INFO ProgressMeter - Traversal complete. Processed 1 total regions in 0.0 minutes.
12:52:48.810 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
12:52:48.811 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
12:52:48.811 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec
12:52:48.819 INFO HaplotypeCaller - Shutting down engine
[October 27, 2023 12:52:48 PM CEST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2366636032
-
Does this behavior happen with all your samples or is it specific to a single sample?
Can you provide more details of your bam file, images of the problematic region or maybe share a snippet of your bam file?
-
Hi Gökalp Çelik,
Thanks for your quick reply. It seems to happen across samples (at least in the three I checked).
Here's a link to a very small bam and fasta that should replicate the problem:
https://drive.google.com/drive/folders/1nZ6eewdXvL04OhQ9UnCNZ4ukV3fzNRWx?usp=sharing
I've also attached an IGV overview.
Best,
Frederik
-
Hi Gökalp Çelik,
Did you find time to look at this?
Thanks!
-
The INDEL algorithm of HaplotypeCaller has a behavior with repetitive and high complexity regions which results in GQ=0 when assembly cannot decide for a large spanning indel vs an immediate short indel depending on the final assembly produced. Since HaplotypeCaller engine decides its genotypes by looking at the final assembly but not the actual reads mapped directly this behavior is almost similar to having reads mapping multiple regions equally with MAPQ=0.
The whole region you have seem to have polymers of Ts and Gs surrounded with palindromes or partly repetitive nucleotides it is highly likely that your case ends with GQ=0 due to this particular behavior.
If all your data are small like this I would recommend not using HaplotypeCaller but instead you may want to stick with bcftools mpileup or GATK3 UnifiedGenotyper tools all of which uses the mapped reads directly instead of a graph assembly.
I hope this helps.
Please sign in to leave a comment.
4 comments