Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Abnormal Results from GATK HaplotypeCaller --different variant Call from the fastq and its downsample fastq from the same fastqdata

Answered
0

8 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Amy zhang,

    We have a troubleshooting article for this type of question: When HaplotypeCaller and Mutect2 do not call an expected variant. Please take a look and see if any of the suggestions help with the call in the original bam file.

    If that doesn't work, I'm going to need more information. Could you send the VCF lines for the downsample and the original BAM? Could you also share what this site looks like in each of the input BAMs in IGV? Please click on the site of interest so we can see the depth of each allele. 

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Amy zhang

    Hi Genevieve,

    Thanks for your reply. Before writing this post, I have red the  troubleshooting article for this type of question: When HaplotypeCaller and Mutect2 do not call an expected variant. I tried some parameters but it didn't work well. I'm still working on.

    You need more information as fallowing : ("Could you send the VCF lines for the downsample and the original BAM? Could you also share what this site looks like in each of the input BAMs in IGV? Please click on the site of interest so we can see the depth of each allele. " )

    1. information from downsample fastq:

    2. information from rawdata fastq:

    thanks,

    Amy

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Amy zhang,

    Thank you for those images! It is very helpful to illuminate the issue. I have more follow up questions:

    • Have you run GenotypeGVCFs on these GVCFs? Please share the VCF lines after GenotypeGVCFs.
    • Try turning off HaplotypeCaller's internal downsampling with --max-reads-per-alignment-start 0
    • I would also like to see the complete program log from HaplotypeCaller both from the original and downsampled bams.

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Amy zhang

    Hi Genevieve,

    The second point which you mentioned was used. The other two points will provide later.

    thanks a lot,

    Amy

     

    0
    Comment actions Permalink
  • Avatar
    Amy zhang

    Hi Genevieve,

    I'm trying modify the parameters --kmer-size([20-35]) and --max-num-haplotypes-in-population(1024). It works on this site. But I don't know if it cause other sites different. I know the meaning of --kmer-size, but I'm not sure about the meaning of --max-num-haplotypes-in-population from the document. Could you explain it specifically? Thanks a lot.

    Amy

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    By increasing the --max-num-haplotypes-in-population, you are increasing the number of haplotypes considered at each active region: https://gatk.broadinstitute.org/hc/en-us/articles/360036227632-Evaluating-the-evidence-for-haplotypes-and-variant-alleles-HaplotypeCaller-and-Mutect2-.

    This could add extra run time for your analysis because the algorithm doing more math. If you are having luck when adding this, though, maybe it is because there are many possible haplotypes in your samples?

    It would still be interesting to see the program log from HaplotypeCaller from the original and downsampled bams when the variant is not called. I wanted to see how many reads are being filtered.

    0
    Comment actions Permalink
  • Avatar
    Amy zhang

    Hi Genevieve,

    The program log what you need from HaplotypeCaller from the original (not called) and downsampled bams(called) as fallowing:

    1.The program log from HaplotypeCaller from the original bam:

    1.1 rawdata_bam_GATK_log:
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx50G -Djava.io.tmpdir=./ -jar /storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar HaplotypeCaller -R /data/database/GATK_resource_bundle/broad_bundle_hg19_v2.8//ucsc.hg19.fasta -I /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/rawdata/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223.bq20.bam -L /storage/data_sh/home/zhangyuangyuan/project/MultiPrimerDesign/STR_software/pipline/HBA_V6.0_SNP.bed --max-reads-per-alignment-start 0 --annotate-with-num-discovered-alleles true --output-mode EMIT_ALL_CONFIDENT_SITES -emit-ref-confidence BP_RESOLUTION --recover-all-dangling-branches true --bam-output /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/rawdata/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223.GATK.output.bam -O /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/rawdata/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223_1_EMIT_ALL_SITES_annotate-with-num-discovered-alleles_old.bq20.gvcf
    18:00:05.062 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Jan 17, 2022 6:00:05 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    18:00:05.547 INFO HaplotypeCaller - ------------------------------------------------------------
    18:00:05.548 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.3.0
    18:00:05.548 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
    18:00:05.548 INFO HaplotypeCaller - Executing as zhangyuanyuan@yikong-sz7 on Linux v3.10.0-957.el7.x86_64 amd64
    18:00:05.549 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_181-b13
    18:00:05.549 INFO HaplotypeCaller - Start Date/Time: January 17, 2022 6:00:05 PM CST
    18:00:05.549 INFO HaplotypeCaller - ------------------------------------------------------------
    18:00:05.549 INFO HaplotypeCaller - ------------------------------------------------------------
    18:00:05.550 INFO HaplotypeCaller - HTSJDK Version: 2.24.1
    18:00:05.550 INFO HaplotypeCaller - Picard Version: 2.25.4
    18:00:05.550 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
    18:00:05.550 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    18:00:05.550 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    18:00:05.551 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    18:00:05.551 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    18:00:05.551 INFO HaplotypeCaller - Deflater: IntelDeflater
    18:00:05.551 INFO HaplotypeCaller - Inflater: IntelInflater
    18:00:05.551 INFO HaplotypeCaller - GCS max retries/reopens: 20
    18:00:05.551 INFO HaplotypeCaller - Requester pays: disabled
    18:00:05.551 INFO HaplotypeCaller - Initializing engine
    18:00:06.744 INFO FeatureManager - Using codec BEDCodec to read file file:///storage/data_sh/home/zhangyuangyuan/project/MultiPrimerDesign/STR_software/pipline/HBA_V6.0_SNP.bed
    18:00:06.786 INFO IntervalArgumentCollection - Processing 129 bp from intervals
    18:00:06.796 INFO HaplotypeCaller - Done initializing engine
    18:00:06.798 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
    18:00:06.869 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
    18:00:06.870 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
    18:00:06.977 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
    18:00:07.150 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
    18:00:07.347 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    18:00:07.348 INFO IntelPairHmm - Available threads: 128
    18:00:07.348 INFO IntelPairHmm - Requested threads: 4
    18:00:07.348 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
    18:00:07.517 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "gvcf". Defaulting to VCF.
    18:00:07.582 INFO ProgressMeter - Starting traversal
    18:00:07.583 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
    18:00:12.329 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position chr16:89279 and possibly subsequent; at least 10 samples must have called genotypes
    18:00:18.330 INFO ProgressMeter - chr16:103423 0.2 10 55.8
    18:00:28.724 INFO ProgressMeter - chr16:139601 0.4 30 85.1
    18:00:48.422 INFO ProgressMeter - chr16:207731 0.7 60 88.2
    18:01:01.611 INFO ProgressMeter - chr16:240280 0.9 70 77.7
    18:01:16.699 INFO ProgressMeter - chr16:931686 1.2 100 86.8
    18:01:24.050 INFO HaplotypeCaller - 0 read(s) filtered by: MappingQualityReadFilter
    0 read(s) filtered by: MappingQualityAvailableReadFilter
    0 read(s) filtered by: MappedReadFilter
    0 read(s) filtered by: NotSecondaryAlignmentReadFilter
    0 read(s) filtered by: NotDuplicateReadFilter
    0 read(s) filtered by: PassesVendorQualityCheckReadFilter
    0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
    0 read(s) filtered by: GoodCigarReadFilter
    0 read(s) filtered by: WellformedReadFilter
    0 total reads filtered
    18:01:24.050 INFO ProgressMeter - chr16:1273982 1.3 128 100.4
    18:01:24.051 INFO ProgressMeter - Traversal complete. Processed 128 total regions in 1.3 minutes.
    18:01:24.096 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.21496658800000001
    18:01:24.096 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 21.589894392
    18:01:24.096 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 8.68 sec
    18:01:25.659 INFO HaplotypeCaller - Shutting down engine
    [January 17, 2022 6:01:25 PM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 1.34 minutes.
    Runtime.totalMemory()=7057440768


    1.2 rawdata_bam_result
    [user bamfiles]$ cat *old*.gvcf |grep 223597
    chr16 223597 . T C,<NON_REF> 0 . BaseQRankSum=-3.195;DP=1552;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=-3.969;NDA=2;RAW_MQandDP=4587200,1552;ReadPosRankSum=-5.655 GT:AD:DP:GQ:PL:SB 0/0:1530,15,0:1545:99:0,4160,47768,4570,47812,48222:488,1042,12,3

    2. The program log from HaplotypeCaller from the downsampled bam:

    2.1 downsample_bam_GATK_log:

    Using GATK jar /storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx50G -Djava.io.tmpdir=./ -jar /storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar HaplotypeCaller -R /data/database/GATK_resource_bundle/broad_bundle_hg19_v2.8//ucsc.hg19.fasta -I /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/downsample/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223.bq20.bam -L /storage/data_sh/home/zhangyuangyuan/project/MultiPrimerDesign/STR_software/pipline/HBA_V6.0_SNP.bed --max-reads-per-alignment-start 0 --annotate-with-num-discovered-alleles true --output-mode EMIT_ALL_CONFIDENT_SITES -emit-ref-confidence BP_RESOLUTION --recover-all-dangling-branches true --bam-output /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/downsample/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223.GATK.output.bam -O /storage/data_sh/Project_2020Q3_Q4/Project_run_log/test_by_zyy/RD08-43/Project_Result_XK-055_V3.2/20211226_E100034076_U1039_WQP_pooling_20211224_1/Project_0002_20211224/rawdata/HBA/E100034076_L01_P_012_20211223/analysis_3.2/bamfiles//E100034076_L01_P_012_20211223_1_EMIT_ALL_SITES_annotate-with-num-discovered-alleles_downsample.bq20.gvcf
    18:10:09.016 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Jan 17, 2022 6:10:09 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    18:10:09.355 INFO HaplotypeCaller - ------------------------------------------------------------
    18:10:09.356 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.3.0
    18:10:09.356 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
    18:10:09.356 INFO HaplotypeCaller - Executing as zhangyuanyuan@yikong-sz7 on Linux v3.10.0-957.el7.x86_64 amd64
    18:10:09.356 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_181-b13
    18:10:09.357 INFO HaplotypeCaller - Start Date/Time: January 17, 2022 6:10:08 PM CST
    18:10:09.357 INFO HaplotypeCaller - ------------------------------------------------------------
    18:10:09.357 INFO HaplotypeCaller - ------------------------------------------------------------
    18:10:09.358 INFO HaplotypeCaller - HTSJDK Version: 2.24.1
    18:10:09.358 INFO HaplotypeCaller - Picard Version: 2.25.4
    18:10:09.358 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
    18:10:09.358 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    18:10:09.358 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    18:10:09.359 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    18:10:09.359 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    18:10:09.359 INFO HaplotypeCaller - Deflater: IntelDeflater
    18:10:09.359 INFO HaplotypeCaller - Inflater: IntelInflater
    18:10:09.359 INFO HaplotypeCaller - GCS max retries/reopens: 20
    18:10:09.359 INFO HaplotypeCaller - Requester pays: disabled
    18:10:09.360 INFO HaplotypeCaller - Initializing engine
    18:10:10.253 INFO FeatureManager - Using codec BEDCodec to read file file:///storage/data_sh/home/zhangyuangyuan/project/MultiPrimerDesign/STR_software/pipline/HBA_V6.0_SNP.bed
    18:10:10.294 INFO IntervalArgumentCollection - Processing 129 bp from intervals
    18:10:10.309 INFO HaplotypeCaller - Done initializing engine
    18:10:10.313 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
    18:10:10.359 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
    18:10:10.359 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
    18:10:10.477 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
    18:10:10.573 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/storage/data_sh/software/gatk-4.2.3.0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
    18:10:10.710 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    18:10:10.711 INFO IntelPairHmm - Available threads: 128
    18:10:10.711 INFO IntelPairHmm - Requested threads: 4
    18:10:10.711 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
    18:10:10.805 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "gvcf". Defaulting to VCF.
    18:10:10.858 INFO ProgressMeter - Starting traversal
    18:10:10.859 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
    18:10:12.965 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position chr16:89279 and possibly subsequent; at least 10 samples must have called genotypes
    18:10:22.111 INFO ProgressMeter - chr16:127230 0.2 20 106.7
    18:10:38.667 INFO ProgressMeter - chr16:240280 0.5 70 151.0
    18:10:48.740 INFO ProgressMeter - chr16:1071366 0.6 110 174.2
    18:10:51.690 INFO HaplotypeCaller - 0 read(s) filtered by: MappingQualityReadFilter
    0 read(s) filtered by: MappingQualityAvailableReadFilter
    0 read(s) filtered by: MappedReadFilter
    0 read(s) filtered by: NotSecondaryAlignmentReadFilter
    0 read(s) filtered by: NotDuplicateReadFilter
    0 read(s) filtered by: PassesVendorQualityCheckReadFilter
    0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
    0 read(s) filtered by: GoodCigarReadFilter
    0 read(s) filtered by: WellformedReadFilter
    0 total reads filtered
    18:10:51.691 INFO ProgressMeter - chr16:1273982 0.7 128 188.1
    18:10:51.691 INFO ProgressMeter - Traversal complete. Processed 128 total regions in 0.7 minutes.
    18:10:51.727 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.101591923
    18:10:51.727 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 11.953453107000001
    18:10:51.728 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 7.25 sec
    18:10:52.493 INFO HaplotypeCaller - Shutting down engine
    [January 17, 2022 6:10:52 PM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.73 minutes.
    Runtime.totalMemory()=5597822976

    2.2 downsample_bam_result:
    [user bamfiles]$ cat *downsample*.gvcf |grep 223597
    chr16 223597 . T C,<NON_REF> 8394.64 . BaseQRankSum=3.569;DP=957;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-2.090;NDA=2;RAW_MQandDP=2497200,957;ReadPosRankSum=1.447 GT:AD:DP:GQ:PL:SB 0/1:229,249,0:478:99:8402,0,7364,9090,8113,17203:226,3,249,0

     

    Thanks a lot,

    Amy

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Amy. I'm confused about your raw data result:

    chr16 223597 . T C,<NON_REF> 0 . BaseQRankSum=-3.195;DP=1552;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=-3.969;NDA=2;RAW_MQandDP=4587200,1552;ReadPosRankSum=-5.655 GT:AD:DP:GQ:PL:SB 0/0:1530,15,0:1545:99:0,4160,47768,4570,47812,48222:488,1042,12,3

    In the image you shared earlier, there were only 1086 reads in the raw data bam. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk