Abnormal Results from GATK HaplotypeCaller --different variant Call from the fastq and its downsample fastq from the same fastqdata
AnsweredDear Genevieve,
I have a PE fastq from a multiprimer Panel. When I use the same commandline(see the code) and parameters analysising the raw fastq and downsample fastq. I get different results seeing the picture (GATK haplotypcaller out bam ). Check the input bam, they are both T/C (AF ~50%, standard heterozygous). The real result from sanger seqence is T/C, as the same as the result from downsample fastq. It's beyond understanding.
I can not look for the reason. Why do I see that the reads with ALT are filtered by GATK Haplotypecaller when using the raw fastq?
I also output the dot file. If you need please let me know. Thank you.
/storage/data_sh/software/gatk-4.2.3.0/gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G -Djava.io.tmpdir=./" HaplotypeCaller \
-R $ref_dir/ucsc.hg19.fasta -I $bam_dir/${sample}.bq20.bam -L $bedfile_SNP \
--max-reads-per-alignment-start 0 \
--annotate-with-num-discovered-alleles true \
--output-mode EMIT_ALL_CONFIDENT_SITES \
-emit-ref-confidence BP_RESOLUTION \
--recover-all-dangling-branches true \
--debug-graph-transformations \
--bam-output $var_dir/${sample}.GATK_rawdata.output.bam \
-O $var_dir/${sample}_1_EMIT_ALL_SITES_annotate-with-num-discovered-alleles.bq20.gvcf
Best wishes,
Amy
-
Hi Amy zhang,
We have a troubleshooting article for this type of question: When HaplotypeCaller and Mutect2 do not call an expected variant. Please take a look and see if any of the suggestions help with the call in the original bam file.
If that doesn't work, I'm going to need more information. Could you send the VCF lines for the downsample and the original BAM? Could you also share what this site looks like in each of the input BAMs in IGV? Please click on the site of interest so we can see the depth of each allele.
Thank you,
Genevieve
-
Hi Genevieve,
Thanks for your reply. Before writing this post, I have red the troubleshooting article for this type of question: When HaplotypeCaller and Mutect2 do not call an expected variant. I tried some parameters but it didn't work well. I'm still working on.
You need more information as fallowing : ("Could you send the VCF lines for the downsample and the original BAM? Could you also share what this site looks like in each of the input BAMs in IGV? Please click on the site of interest so we can see the depth of each allele. " )
1. information from downsample fastq:
2. information from rawdata fastq:
thanks,
Amy
-
Hi Amy zhang,
Thank you for those images! It is very helpful to illuminate the issue. I have more follow up questions:
- Have you run GenotypeGVCFs on these GVCFs? Please share the VCF lines after GenotypeGVCFs.
- Try turning off HaplotypeCaller's internal downsampling with --max-reads-per-alignment-start 0
- I would also like to see the complete program log from HaplotypeCaller both from the original and downsampled bams.
Thank you,
Genevieve
-
Hi Genevieve,
The second point which you mentioned was used. The other two points will provide later.
thanks a lot,
Amy
-
Hi Genevieve,
I'm trying modify the parameters --kmer-size([20-35]) and --max-num-haplotypes-in-population(1024). It works on this site. But I don't know if it cause other sites different. I know the meaning of --kmer-size, but I'm not sure about the meaning of --max-num-haplotypes-in-population from the document. Could you explain it specifically? Thanks a lot.
Amy
-
By increasing the --max-num-haplotypes-in-population, you are increasing the number of haplotypes considered at each active region: https://gatk.broadinstitute.org/hc/en-us/articles/360036227632-Evaluating-the-evidence-for-haplotypes-and-variant-alleles-HaplotypeCaller-and-Mutect2-.
This could add extra run time for your analysis because the algorithm doing more math. If you are having luck when adding this, though, maybe it is because there are many possible haplotypes in your samples?
It would still be interesting to see the program log from HaplotypeCaller from the original and downsampled bams when the variant is not called. I wanted to see how many reads are being filtered.
-
Hi Genevieve,
The program log what you need from HaplotypeCaller from the original (not called) and downsampled bams(called) as fallowing:
1.The program log from HaplotypeCaller from the original bam:
1.1 rawdata_bam_GATK_log:
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx50G -Djava.io.tmpdir=./ -jar /storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar HaplotypeCaller -R /data/database/GATK_resource_bundle/broad_bundle_hg19_v2.8//ucsc.hg19.fasta -I /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/rawdata/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223.bq20.bam -L /storage/data_sh/home/zhangyuangyuan/project/MultiPrimerDesign/STR_software/pipline/HBA_V6.0_SNP.bed --max-reads-per-alignment-start 0 --annotate-with-num-discovered-alleles true --output-mode EMIT_ALL_CONFIDENT_SITES -emit-ref-confidence BP_RESOLUTION --recover-all-dangling-branches true --bam-output /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/rawdata/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223.GATK.output.bam -O /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/rawdata/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223_1_EMIT_ALL_SITES_annotate-with-num-discovered-alleles_old.bq20.gvcf
18:00:05.062 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 17, 2022 6:00:05 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
18:00:05.547 INFO HaplotypeCaller - ------------------------------------------------------------
18:00:05.548 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.3.0
18:00:05.548 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
18:00:05.548 INFO HaplotypeCaller - Executing as zhangyuanyuan@yikong-sz7 on Linux v3.10.0-957.el7.x86_64 amd64
18:00:05.549 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_181-b13
18:00:05.549 INFO HaplotypeCaller - Start Date/Time: January 17, 2022 6:00:05 PM CST
18:00:05.549 INFO HaplotypeCaller - ------------------------------------------------------------
18:00:05.549 INFO HaplotypeCaller - ------------------------------------------------------------
18:00:05.550 INFO HaplotypeCaller - HTSJDK Version: 2.24.1
18:00:05.550 INFO HaplotypeCaller - Picard Version: 2.25.4
18:00:05.550 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
18:00:05.550 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:00:05.550 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:00:05.551 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:00:05.551 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:00:05.551 INFO HaplotypeCaller - Deflater: IntelDeflater
18:00:05.551 INFO HaplotypeCaller - Inflater: IntelInflater
18:00:05.551 INFO HaplotypeCaller - GCS max retries/reopens: 20
18:00:05.551 INFO HaplotypeCaller - Requester pays: disabled
18:00:05.551 INFO HaplotypeCaller - Initializing engine
18:00:06.744 INFO FeatureManager - Using codec BEDCodec to read file file:///storage/data_sh/home/zhangyuangyuan/project/MultiPrimerDesign/STR_software/pipline/HBA_V6.0_SNP.bed
18:00:06.786 INFO IntervalArgumentCollection - Processing 129 bp from intervals
18:00:06.796 INFO HaplotypeCaller - Done initializing engine
18:00:06.798 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
18:00:06.869 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
18:00:06.870 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
18:00:06.977 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
18:00:07.150 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
18:00:07.347 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
18:00:07.348 INFO IntelPairHmm - Available threads: 128
18:00:07.348 INFO IntelPairHmm - Requested threads: 4
18:00:07.348 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
18:00:07.517 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "gvcf". Defaulting to VCF.
18:00:07.582 INFO ProgressMeter - Starting traversal
18:00:07.583 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
18:00:12.329 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position chr16:89279 and possibly subsequent; at least 10 samples must have called genotypes
18:00:18.330 INFO ProgressMeter - chr16:103423 0.2 10 55.8
18:00:28.724 INFO ProgressMeter - chr16:139601 0.4 30 85.1
18:00:48.422 INFO ProgressMeter - chr16:207731 0.7 60 88.2
18:01:01.611 INFO ProgressMeter - chr16:240280 0.9 70 77.7
18:01:16.699 INFO ProgressMeter - chr16:931686 1.2 100 86.8
18:01:24.050 INFO HaplotypeCaller - 0 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
0 total reads filtered
18:01:24.050 INFO ProgressMeter - chr16:1273982 1.3 128 100.4
18:01:24.051 INFO ProgressMeter - Traversal complete. Processed 128 total regions in 1.3 minutes.
18:01:24.096 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.21496658800000001
18:01:24.096 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 21.589894392
18:01:24.096 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 8.68 sec
18:01:25.659 INFO HaplotypeCaller - Shutting down engine
[January 17, 2022 6:01:25 PM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 1.34 minutes.
Runtime.totalMemory()=7057440768
1.2 rawdata_bam_result
[user bamfiles]$ cat *old*.gvcf |grep 223597
chr16 223597 . T C,<NON_REF> 0 . BaseQRankSum=-3.195;DP=1552;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=-3.969;NDA=2;RAW_MQandDP=4587200,1552;ReadPosRankSum=-5.655 GT:AD:DP:GQ:PL:SB 0/0:1530,15,0:1545:99:0,4160,47768,4570,47812,48222:488,1042,12,32. The program log from HaplotypeCaller from the downsampled bam:
2.1 downsample_bam_GATK_log:
Using GATK jar /storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx50G -Djava.io.tmpdir=./ -jar /storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar HaplotypeCaller -R /data/database/GATK_resource_bundle/broad_bundle_hg19_v2.8//ucsc.hg19.fasta -I /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/downsample/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223.bq20.bam -L /storage/data_sh/home/zhangyuangyuan/project/MultiPrimerDesign/STR_software/pipline/HBA_V6.0_SNP.bed --max-reads-per-alignment-start 0 --annotate-with-num-discovered-alleles true --output-mode EMIT_ALL_CONFIDENT_SITES -emit-ref-confidence BP_RESOLUTION --recover-all-dangling-branches true --bam-output /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/downsample/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223.GATK.output.bam -O /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/rawdata/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223_1_EMIT_ALL_SITES_annotate-with-num-discovered-alleles_downsample.bq20.gvcf
18:10:09.016 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 17, 2022 6:10:09 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
18:10:09.355 INFO HaplotypeCaller - ------------------------------------------------------------
18:10:09.356 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.3.0
18:10:09.356 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
18:10:09.356 INFO HaplotypeCaller - Executing as zhangyuanyuan@yikong-sz7 on Linux v3.10.0-957.el7.x86_64 amd64
18:10:09.356 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_181-b13
18:10:09.357 INFO HaplotypeCaller - Start Date/Time: January 17, 2022 6:10:08 PM CST
18:10:09.357 INFO HaplotypeCaller - ------------------------------------------------------------
18:10:09.357 INFO HaplotypeCaller - ------------------------------------------------------------
18:10:09.358 INFO HaplotypeCaller - HTSJDK Version: 2.24.1
18:10:09.358 INFO HaplotypeCaller - Picard Version: 2.25.4
18:10:09.358 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
18:10:09.358 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:10:09.358 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:10:09.359 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:10:09.359 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:10:09.359 INFO HaplotypeCaller - Deflater: IntelDeflater
18:10:09.359 INFO HaplotypeCaller - Inflater: IntelInflater
18:10:09.359 INFO HaplotypeCaller - GCS max retries/reopens: 20
18:10:09.359 INFO HaplotypeCaller - Requester pays: disabled
18:10:09.360 INFO HaplotypeCaller - Initializing engine
18:10:10.253 INFO FeatureManager - Using codec BEDCodec to read file file:///storage/data_sh/home/zhangyuangyuan/project/MultiPrimerDesign/STR_software/pipline/HBA_V6.0_SNP.bed
18:10:10.294 INFO IntervalArgumentCollection - Processing 129 bp from intervals
18:10:10.309 INFO HaplotypeCaller - Done initializing engine
18:10:10.313 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
18:10:10.359 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
18:10:10.359 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
18:10:10.477 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
18:10:10.573 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
18:10:10.710 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
18:10:10.711 INFO IntelPairHmm - Available threads: 128
18:10:10.711 INFO IntelPairHmm - Requested threads: 4
18:10:10.711 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
18:10:10.805 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "gvcf". Defaulting to VCF.
18:10:10.858 INFO ProgressMeter - Starting traversal
18:10:10.859 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
18:10:12.965 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position chr16:89279 and possibly subsequent; at least 10 samples must have called genotypes
18:10:22.111 INFO ProgressMeter - chr16:127230 0.2 20 106.7
18:10:38.667 INFO ProgressMeter - chr16:240280 0.5 70 151.0
18:10:48.740 INFO ProgressMeter - chr16:1071366 0.6 110 174.2
18:10:51.690 INFO HaplotypeCaller - 0 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
0 total reads filtered
18:10:51.691 INFO ProgressMeter - chr16:1273982 0.7 128 188.1
18:10:51.691 INFO ProgressMeter - Traversal complete. Processed 128 total regions in 0.7 minutes.
18:10:51.727 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.101591923
18:10:51.727 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 11.953453107000001
18:10:51.728 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 7.25 sec
18:10:52.493 INFO HaplotypeCaller - Shutting down engine
[January 17, 2022 6:10:52 PM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.73 minutes.
Runtime.totalMemory()=55978229762.2 downsample_bam_result:
[user bamfiles]$ cat *downsample*.gvcf |grep 223597
chr16 223597 . T C,<NON_REF> 8394.64 . BaseQRankSum=3.569;DP=957;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-2.090;NDA=2;RAW_MQandDP=2497200,957;ReadPosRankSum=1.447 GT:AD:DP:GQ:PL:SB 0/1:229,249,0:478:99:8402,0,7364,9090,8113,17203:226,3,249,0Thanks a lot,
Amy
-
Thanks Amy. I'm confused about your raw data result:
chr16 223597 . T C,<NON_REF> 0 . BaseQRankSum=-3.195;DP=1552;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=-3.969;NDA=2;RAW_MQandDP=4587200,1552;ReadPosRankSum=-5.655 GT:AD:DP:GQ:PL:SB 0/0:1530,15,0:1545:99:0,4160,47768,4570,47812,48222:488,1042,12,3
In the image you shared earlier, there were only 1086 reads in the raw data bam.
Please sign in to leave a comment.
8 comments