HaplotypeCaller inconsistency with different interval lists
GATK version used: The Genome Analysis Toolkit (GATK) v4.2.0.0-SNAPSHO
Exact command used:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/gatk-package-4.2.0.0-SNAPSHOT-local.jar HaplotypeCaller -I BAM.bam -L all.list -R GRCh37.fa --lenient --output-mode EMIT_ALL_ACTIVE_SITES -ip 100 --sample-ploidy 2 -O OUT.vcf --interval-merging-rule OVERLAPPING_ONLY
Why does Haplotype Caller give different results for the same position with different list provided. In my case I am using two different lists:
- All genome positions (all.list)
- Only chromosome 11 positions (same as in all.list I just use grep "^11:" to get positions)
In the end HaplotypeCaller calls variant with chr11.list but do not calls variant with whole genome list provided. Do you have any recommendations how to make HaplotypeCaller consistent at same positions but with different lists?
-
Are there no variants in the VCF from the all.list HaplotypeCaller command, or no VCF at all? Could you share the complete program log from your HaplotypeCaller command with the all.list?
Thank you,
Genevieve
-
Hi,
Log with chr11.list:06:05:49.339 INFO HaplotypeCaller - ------------------------------------------------------------
06:05:49.339 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.0.0-SNAPSHOT
06:05:49.339 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
06:05:49.340 INFO HaplotypeCaller - Executing as root on Linux v4.9.0-8-amd64 amd64
06:05:49.340 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.11+9-post-Debian-1deb10u1
06:05:49.340 INFO HaplotypeCaller - Start Date/Time: September 10, 2021 at 6:05:49 AM UTC
06:05:49.340 INFO HaplotypeCaller - ------------------------------------------------------------
06:05:49.340 INFO HaplotypeCaller - ------------------------------------------------------------
06:05:49.340 INFO HaplotypeCaller - HTSJDK Version: 2.24.0
06:05:49.340 INFO HaplotypeCaller - Picard Version: 2.25.0
06:05:49.340 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
06:05:49.340 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:05:49.340 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:05:49.340 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:05:49.340 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:05:49.341 INFO HaplotypeCaller - Deflater: IntelDeflater
06:05:49.341 INFO HaplotypeCaller - Inflater: IntelInflater
06:05:49.341 INFO HaplotypeCaller - GCS max retries/reopens: 20
06:05:49.341 INFO HaplotypeCaller - Requester pays: disabled
06:05:49.341 INFO HaplotypeCaller - Initializing engine
06:05:49.492 INFO IntervalArgumentCollection - Processing 11669 bp from intervals
06:05:49.495 INFO HaplotypeCaller - Done initializing engine
06:05:49.508 INFO HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
06:05:49.516 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/gatk-package-4.2.0.0-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_utils.so
06:05:49.517 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/gatk-package-4.2.0.0-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
06:05:49.544 INFO IntelPairHmm - Using CPU-supported AVX-512 instructions
06:05:49.544 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
06:05:49.544 INFO IntelPairHmm - Available threads: 48
06:05:49.545 INFO IntelPairHmm - Requested threads: 4
06:05:49.545 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
06:05:49.570 INFO ProgressMeter - Starting traversal
06:05:49.570 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
06:05:51.486 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 11:1016704 and possibly subsequent; at least 10 samples must have called genotypes
06:06:04.649 INFO ProgressMeter - 11:1018020 0.3 10 39.8
06:06:09.216 INFO HaplotypeCaller - 6171 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
10 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
6181 total reads filtered
06:06:09.216 INFO ProgressMeter - 11:121028481 0.3 129 394.0
06:06:09.216 INFO ProgressMeter - Traversal complete. Processed 129 total regions in 0.3 minutes.
06:06:09.263 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.048460615000000005
06:06:09.263 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 9.203253743000001
06:06:09.263 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 2.89 sec
06:06:09.263 INFO HaplotypeCaller - Shutting down engineI get variant:
11 1016944 . G T 308.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.080;DP=1217;ExcessHet=3.0103;FS=16.489;MLEAC=1;MLEAF=0.500;MQ=48.21;MQRankSum=-4.107;QD=0.26;ReadPosRankSum=1.175;SOR=3.438 GT:AD:DP:GQ:PL 0/1:1120,88:1208:99:316,0,46731
Log with all.list:
06:13:46.821 INFO HaplotypeCaller - ------------------------------------------------------------
06:13:46.821 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.0.0-SNAPSHOT
06:13:46.821 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
06:13:46.821 INFO HaplotypeCaller - Executing as root on Linux v4.9.0-8-amd64 amd64
06:13:46.821 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.11+9-post-Debian-1deb10u1
06:13:46.821 INFO HaplotypeCaller - Start Date/Time: September 10, 2021 at 6:13:46 AM UTC
06:13:46.822 INFO HaplotypeCaller - ------------------------------------------------------------
06:13:46.822 INFO HaplotypeCaller - ------------------------------------------------------------
06:13:46.822 INFO HaplotypeCaller - HTSJDK Version: 2.24.0
06:13:46.822 INFO HaplotypeCaller - Picard Version: 2.25.0
06:13:46.822 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
06:13:46.822 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:13:46.822 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:13:46.822 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:13:46.822 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:13:46.822 INFO HaplotypeCaller - Deflater: IntelDeflater
06:13:46.822 INFO HaplotypeCaller - Inflater: IntelInflater
06:13:46.822 INFO HaplotypeCaller - GCS max retries/reopens: 20
06:13:46.822 INFO HaplotypeCaller - Requester pays: disabled
06:13:46.822 INFO HaplotypeCaller - Initializing engine
06:13:46.986 INFO IntervalArgumentCollection - Processing 202370 bp from intervals
06:13:46.994 INFO HaplotypeCaller - Done initializing engine
06:13:47.007 INFO HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
06:13:47.015 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/gatk-package-4.2.0.0-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_utils.so
06:13:47.016 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/gatk-package-4.2.0.0-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
06:13:47.043 INFO IntelPairHmm - Using CPU-supported AVX-512 instructions
06:13:47.043 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
06:13:47.044 INFO IntelPairHmm - Available threads: 48
06:13:47.044 INFO IntelPairHmm - Requested threads: 4
06:13:47.044 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
06:13:47.079 INFO ProgressMeter - Starting traversal
06:13:47.079 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
06:13:47.487 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 1:1290276 and possibly subsequent; at least 10 samples must have called genotypes
06:13:57.113 INFO ProgressMeter - 2:179441424 0.2 340 2033.1
06:14:07.128 INFO ProgressMeter - 6:132891656 0.3 790 2364.2
06:14:17.763 INFO ProgressMeter - 9:33797817 0.5 1030 2014.1
06:14:34.030 INFO ProgressMeter - 11:1018020 0.8 1180 1508.0
06:14:44.031 INFO ProgressMeter - 14:106329355 0.9 1560 1643.5
06:14:54.031 INFO ProgressMeter - 19:39221730 1.1 2020 1810.3
06:14:56.195 INFO HaplotypeCaller - 64088 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
59 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
2 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
64149 total reads filtered
06:14:56.196 INFO ProgressMeter - 22:45278966 1.2 2229 1935.0
06:14:56.196 INFO ProgressMeter - Traversal complete. Processed 2229 total regions in 1.2 minutes.
06:14:56.305 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.109933602
06:14:56.305 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 26.370827184000003
06:14:56.305 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 15.93 sec
06:14:56.305 INFO HaplotypeCaller - Shutting down engine
No variant at position 11 1016944 is detected.
It looks like information about other chromosomes affect results for one specific chromosome. -
Thanks for sharing those logs, it doesn't look like are any specific error messages.
Are there any variants at all in your output VCF when using the all.list?
Best,
Genevieve
-
Yes, there are. Output of bcftools stats:
# SN [2]id [3]key [4]value
SN 0 number of samples: 1
SN 0 number of records: 2289
SN 0 number of no-ALTs: 904
SN 0 number of SNPs: 1310
SN 0 number of MNPs: 0
SN 0 number of indels: 74
SN 0 number of others: 1
SN 0 number of multiallelic sites: 2
SN 0 number of multiallelic SNP sites: 2The problem is inconsistency at same position (11 1016944) with two lists. With longer list there is no variant called, with shorter there is variant. And shorter list is subset of longer (Every position in shorter list exists in longer list). It looks like GATK4 HC takes information from other positions (even from other chromosomes) while doing local assembly, is there any way to solve this inconsistency?
-
Thanks for updating with this description, this problem is more clear to me now! HaplotypeCaller should not be using information from other positions when doing local reassembly. If you can point me to some data of why you think that is occurring here, I can parse through what is going on.
I think it would be helpful to take a closer look at the position 11:1016944 and work through this troubleshooting document: When HaplotypeCaller and Mutect2 do not call an expected variant. The document was written by our developers for the case you are describing, which is trying to figure out why there is confusing behavior with HaplotypeCaller. I think you will find it very helpful!
Let me know what you find from looking into that site.
Best,
Genevieve
Please sign in to leave a comment.
5 comments