Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

SNP VCF file and bamout not the same

Answered
0

11 comments

  • Avatar
    Pamela Bretscher

    Hi Wiebe Kooistra,

    Here are a few troubleshooting resources and explanations for when HaplotypeCaller misses an expected variant:

    https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called 

    https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant 

    Please let me know if this does not answer your question.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Wiebe Kooistra

    Hi Pamela Bretscher,

    Thank you for you answer.

    I looked at all suggestions in both links. I found that if I turn off the downsampling, the GQ value becomes 0: Ls_98839 100 . A <NON_REF> . . END=100 GT:DP:GQ:MIN_DP:PL 0/0:2808:0:2808:0,0,0

    The SNP does not show up in the bamout file when I run this command: 

    gatk HaplotypeCaller -R ref.fasta -ERC GVCF --dont-use-soft-clipped-bases true --do-not-run-physical-phasing true -bamout sample.bruijn.bam --linked-de-bruijn-graph true --max-assembly-region-size 1000 --max-reads-per-alignment-start 0 -I sample.sorted.bam -O sample.bruijn.gvcf

    Is there a way to get gatk to call this SNP, because when we look at the BWA bam file, we do see a SNP in this location.

    With kind regards,

    Wiebe

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Wiebe Kooistra,

    Here is a link to a forum discussion about a similar problem that may provide some useful information: https://gatk.broadinstitute.org/hc/en-us/community/posts/360077647812-Why-do-a-clear-expected-variant-not-show-up-in-the-Mutect2-vcf-file 

    Have you tried the argument --pruning-lod-threshold to lower the LOD threshold score to see if the variant can be called?

    Kind regards,

    Pamela

     

    0
    Comment actions Permalink
  • Avatar
    Wiebe Kooistra

    Hey Pamela Bretscher,

    I am using HaplotypeCaller and not Mutect2, but I did change --pruning-lod-threshold to the same levels as the gentlemen did (1.3, 0.5 and 0.1).

    The genotype and the GQ score stayed the same (homozygous for ref) as without the lod threshold argument.

    Are there other options I can try to change?

    Regards,

    Wiebe

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Wiebe Kooistra,

    I've been looking more at the commands you used and your BAM files to try to determine the problem. Could you provide the full GVCF and VCF lines as well as the full stack trace? It looks like you are losing a lot of reads and I'd like to look further into why the SNP is not being called but I need more information.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Wiebe Kooistra

    Hey Pamela,

    For the GVCF file do you want to see multiple lines or just of the SNP location that I am expecting?

    For the full stack trace? Do you mean the output HaplotypeCaller is giving when running? and if so do you want debug on?

    Regards,

    Wiebe

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hey Wiebe Kooistra,

    Yes, the output while Haplotype Caller is running, and please include the full GVCF lines. You can include the stack trace without debug on.

    Thanks,

    Pamela 

    0
    Comment actions Permalink
  • Avatar
    Wiebe Kooistra

    Hey Pamela Bretscher

    The full GVCF line is:

    Ls_98839 100 . A <NON_REF> . . END=100 GT:DP:GQ:MIN_DP:PL 0/0:2007:0:2007:0,0,0

    The stack trace is:

    15:11:08.430 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/ekkaia/home_btr/wiebek/miniconda3/envs/nextflow_env/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Jul 01, 2021 3:11:08 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    15:11:08.968 INFO HaplotypeCaller - ------------------------------------------------------------
    15:11:08.969 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.0.0
    15:11:08.969 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
    15:11:08.969 INFO HaplotypeCaller - Executing as wiebek@saruman on Linux v4.19.2-arch1-1-ARCH amd64
    15:11:08.969 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
    15:11:08.969 INFO HaplotypeCaller - Start Date/Time: July 1, 2021 3:11:08 PM CEST
    15:11:08.969 INFO HaplotypeCaller - ------------------------------------------------------------
    15:11:08.969 INFO HaplotypeCaller - ------------------------------------------------------------
    15:11:08.970 INFO HaplotypeCaller - HTSJDK Version: 2.24.0
    15:11:08.970 INFO HaplotypeCaller - Picard Version: 2.25.0
    15:11:08.970 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
    15:11:08.970 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    15:11:08.970 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    15:11:08.970 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    15:11:08.970 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    15:11:08.970 INFO HaplotypeCaller - Deflater: IntelDeflater
    15:11:08.970 INFO HaplotypeCaller - Inflater: IntelInflater
    15:11:08.970 INFO HaplotypeCaller - GCS max retries/reopens: 20
    15:11:08.970 INFO HaplotypeCaller - Requester pays: disabled
    15:11:08.970 INFO HaplotypeCaller - Initializing engine
    15:11:09.740 INFO HaplotypeCaller - Done initializing engine
    15:11:09.743 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
    15:11:09.810 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
    15:11:09.810 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
    15:11:09.884 ERROR ReadThreadingAssembler - JunctionTreeLinkedDeBruijnGraph is enabled.
    This is an experimental assembly graph mode that has not been fully validated


    15:11:09.886 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/data/ekkaia/home_btr/wiebek/miniconda3/envs/nextflow_env/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
    15:11:09.951 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/data/ekkaia/home_btr/wiebek/miniconda3/envs/nextflow_env/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
    15:11:09.991 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    15:11:09.992 INFO IntelPairHmm - Available threads: 16
    15:11:09.992 INFO IntelPairHmm - Requested threads: 4
    15:11:09.992 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
    15:11:09.995 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "gvcf". Defaulting to VCF.
    15:11:10.085 INFO ProgressMeter - Starting traversal
    15:11:10.085 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
    15:11:18.965 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position V8_LG_1_123270968:36 and possibly subsequent; at least 10 samples must have called genotypes
    15:11:23.869 INFO ProgressMeter - V8_LG_2_1458139:1 0.2 30 130.6
    15:11:35.654 INFO ProgressMeter - V8_LG3_200695773:190 0.4 60 140.8
    15:11:45.779 INFO ProgressMeter - Ls_104977:174 0.6 110 184.9
    15:11:56.410 INFO ProgressMeter - Ls_114008:1 0.8 150 194.3
    15:12:07.537 INFO ProgressMeter - Ls_115962:113 1.0 180 188.0
    15:12:17.682 INFO ProgressMeter - Ls_119232:161 1.1 210 186.4
    15:12:28.019 INFO ProgressMeter - Ls_128870:1 1.3 260 200.2
    15:12:40.276 INFO ProgressMeter - Ls_134906:47 1.5 310 206.2
    15:12:52.160 INFO ProgressMeter - Ls_142754:65 1.7 350 205.7
    15:13:03.947 INFO ProgressMeter - Ls_150531:1 1.9 390 205.5
    15:13:15.790 INFO ProgressMeter - Ls_158076:56 2.1 440 210.0
    15:13:26.200 INFO ProgressMeter - Ls_166264:1 2.3 490 216.0
    15:13:39.629 INFO ProgressMeter - Ls_175078:174 2.5 540 216.7
    15:13:54.097 INFO ProgressMeter - Ls_180729:174 2.7 600 219.5
    15:14:04.444 INFO ProgressMeter - Ls_195362:99 2.9 650 223.7
    15:14:16.149 INFO ProgressMeter - Ls_199100:99 3.1 700 225.7
    15:14:27.079 INFO ProgressMeter - Ls_201638:174 3.3 740 225.4
    15:14:37.105 INFO ProgressMeter - Ls_202133:1 3.5 770 223.2
    15:14:47.479 INFO ProgressMeter - Ls_204325:174 3.6 810 223.6
    15:15:01.000 INFO ProgressMeter - Ls_208908:174 3.8 860 223.5
    15:15:11.156 INFO ProgressMeter - Ls_210531:1 4.0 890 221.5
    15:15:23.610 INFO ProgressMeter - Ls_214227:1 4.2 940 222.5
    15:15:35.344 INFO ProgressMeter - Ls_215154:1 4.4 970 219.4
    15:15:48.212 INFO ProgressMeter - Ls_217534:174 4.6 1010 217.9
    15:16:00.037 INFO ProgressMeter - Ls_220635:1 4.8 1050 217.3
    15:16:11.297 INFO ProgressMeter - Ls_224099:174 5.0 1100 219.1
    15:16:22.674 INFO ProgressMeter - Ls_234507:1 5.2 1170 224.6
    15:16:33.752 INFO ProgressMeter - Ls_238002:1 5.4 1210 224.3
    15:16:44.709 INFO ProgressMeter - Ls_240850:174 5.6 1250 224.1
    15:16:56.408 INFO ProgressMeter - Ls_244015:1 5.8 1280 221.8
    15:17:10.056 INFO ProgressMeter - Ls_250232:99 6.0 1350 225.0
    15:17:22.588 INFO ProgressMeter - Ls_255280:1 6.2 1400 225.5
    15:17:33.085 INFO ProgressMeter - Ls_263897:1 6.4 1460 228.7
    15:17:44.366 INFO ProgressMeter - Ls_271056:174 6.6 1520 231.3
    15:17:56.532 INFO ProgressMeter - Ls_271941:99 6.8 1560 230.3
    15:18:07.829 INFO ProgressMeter - Ls_275212:174 7.0 1610 231.2
    15:18:19.096 INFO ProgressMeter - Ls_280233:174 7.2 1650 230.8
    15:18:30.177 INFO ProgressMeter - Ls_286821:174 7.3 1690 230.4
    15:18:43.528 INFO ProgressMeter - Ls_290637:99 7.6 1730 228.9
    15:18:57.886 INFO ProgressMeter - Ls_293905:174 7.8 1780 228.3
    15:19:10.878 INFO ProgressMeter - Ls_301783:31 8.0 1830 228.4
    15:19:23.928 INFO ProgressMeter - Ls_304381:46 8.2 1870 227.2
    15:19:34.027 INFO ProgressMeter - Ls_311649:1 8.4 1910 227.4
    15:19:46.126 INFO ProgressMeter - Ls_314735:1 8.6 1950 226.7
    15:19:57.148 INFO ProgressMeter - Ls_317001:99 8.8 2000 227.7
    15:20:07.651 INFO ProgressMeter - Ls_318628:99 9.0 2040 227.7
    15:20:21.404 INFO ProgressMeter - Ls_323623:88 9.2 2100 228.5
    15:20:32.341 INFO ProgressMeter - Ls_328464:1 9.4 2150 229.4
    15:20:42.576 INFO ProgressMeter - Ls_330686:174 9.5 2190 229.5
    15:20:52.664 INFO ProgressMeter - Ls_33461:198 9.7 2230 229.7
    15:21:03.059 INFO ProgressMeter - Ls_339215:174 9.9 2280 230.7
    15:21:15.055 INFO ProgressMeter - Ls_344697:1 10.1 2330 231.1
    15:21:25.616 INFO ProgressMeter - Ls_347626:1 10.3 2360 230.0
    15:21:38.798 INFO ProgressMeter - Ls_355434:1 10.5 2430 231.9
    15:21:49.222 WARN DepthPerSampleHC - Annotation will not be calculated at position Ls_357561:69 and possibly subsequent; genotype for sample sample is not called
    15:21:49.223 WARN StrandBiasBySample - Annotation will not be calculated at position Ls_357561:69 and possibly subsequent; genotype for sample sample is not called
    15:21:49.224 WARN DepthPerSampleHC - Annotation will not be calculated at position Ls_357561:71 and possibly subsequent; genotype for sample sample is not called
    15:21:49.224 WARN StrandBiasBySample - Annotation will not be calculated at position Ls_357561:71 and possibly subsequent; genotype for sample sample is not called
    15:21:49.355 INFO ProgressMeter - Ls_35830:1 10.7 2460 230.9
    15:22:01.132 INFO ProgressMeter - Ls_363311:99 10.9 2500 230.4
    15:22:12.934 INFO ProgressMeter - Ls_370321:1 11.0 2550 230.8
    15:22:23.305 INFO ProgressMeter - Ls_3944:99 11.2 2580 229.9
    15:22:35.256 INFO ProgressMeter - Ls_46803:66 11.4 2620 229.4
    15:22:47.212 INFO ProgressMeter - Ls_55927:22 11.6 2680 230.7
    15:22:57.557 INFO ProgressMeter - Ls_84663:174 11.8 2740 232.4
    15:23:00.439 WARN DepthPerSampleHC - Annotation will not be calculated at position Ls_8858:69 and possibly subsequent; genotype for sample sample is not called
    15:23:00.439 WARN StrandBiasBySample - Annotation will not be calculated at position Ls_8858:69 and possibly subsequent; genotype for sample sample is not called
    15:23:08.744 INFO ProgressMeter - Ls_99271:149 12.0 2780 232.1
    15:23:20.731 INFO ProgressMeter - LSM0001900:24 12.2 2820 231.6
    15:23:31.930 INFO ProgressMeter - LSM0001929:24 12.4 2860 231.3
    15:23:42.506 INFO ProgressMeter - Ls_201047:99 12.5 2890 230.5
    15:23:47.716 INFO HaplotypeCaller - 0 read(s) filtered by: MappingQualityReadFilter
    0 read(s) filtered by: MappingQualityAvailableReadFilter
    0 read(s) filtered by: MappedReadFilter
    0 read(s) filtered by: NotSecondaryAlignmentReadFilter
    0 read(s) filtered by: NotDuplicateReadFilter
    0 read(s) filtered by: PassesVendorQualityCheckReadFilter
    0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
    0 read(s) filtered by: GoodCigarReadFilter
    0 read(s) filtered by: WellformedReadFilter
    0 total reads filtered
    15:23:47.716 INFO ProgressMeter - LSM0159726:1 12.6 2916 230.9
    15:23:47.716 INFO ProgressMeter - Traversal complete. Processed 2916 total regions in 12.6 minutes.
    15:23:47.734 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.601760336
    15:23:47.734 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 508.015057328
    15:23:47.734 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 57.14 sec
    15:23:51.258 INFO HaplotypeCaller - Shutting down engine
    [July 1, 2021 3:23:51 PM CEST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 12.72 minutes.
    Runtime.totalMemory()=4689231872
    Using GATK jar /data/ekkaia/home_btr/wiebek/miniconda3/envs/nextflow_env/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /data/ekkaia/home_btr/wiebek/miniconda3/envs/nextflow_env/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar HaplotypeCaller -R ref.fasta -ERC GVCF --dont-use-soft-clipped-bases true --do-not-run-physical-phasing true -bamout sample.bruijn.bam --linked-de-bruijn-graph true -I sample.sorted.bam -O sample.bruijn.gvcf --max-assembly-region-size 1000 --max-reads-per-alignment-start 1000

    I hope this is able to help. If there is info missing please let me know.

    Regards,

    Wiebe

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Wiebe Kooistra,

    Thank you for providing this information. Which VCF file and bamout image does this correspond to? It looks like the DP is 2007 in your GVCF which doesn't match the VCF you shared. Can you please include the complete VCF line and the commands you used to genotype?

    Thank you,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Wiebe Kooistra

    Hi Pamela Bretscher,

    I do not have a bamout image for this.

    The command that was run:

    gatk HaplotypeCaller -R ref.fasta -ERC GVCF --dont-use-soft-clipped-bases true --do-not-run-physical-phasing true -bamout sample.bruijn.bam --linked-de-bruijn-graph true -I sample.sorted.bam -O sample.bruijn.gvcf --max-assembly-region-size 1000 --max-reads-per-alignment-start 1000

    The SNP I am interested in does not show up in the bamout file. I think this is because

    Ls_98839 100 . A <NON_REF> . . END=100 GT:DP:GQ:MIN_DP:PL 0/0:2007:0:2007:0,0,0

    The GQ score is 0 and because gatk says there is no SNP, it also does not come up in the VCF file.

    To create the VCF file I run:

    gatk GenotypeGVCFs -R ref.fasta -V sample.bruijn.gvcf -O sample.bruijn.vcf

    This is also why I am confused, because when looking at the BWA bam file, we can see 99% G (while ref is A), but for some reason gatk is unable to see this SNP.

    Regards,

    Wiebe

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Wiebe Kooistra,

    Because HaplotypeCaller does local realignment, the results will not always match what is expected. The GQ=0 indicates that the tool detected something happening in this region, but was unable to call it correctly. Therefore, HaplotypeCaller is having trouble identifying what is going on in the region of your expected variant. If you would like to continue to look into this to see if the tool can call your SNP, I would suggest running HaplotypeCaller on the region using all default parameters and providing the full stack trace, gvcf, and bamout image if possible. I apologize that there is not a clear-cut solution or option you can use for calling an SNP that HaplotypeCaller does not recognize.

    Kind regards,

    Pamela

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk