Abnormal result from Haplotype caller using reduced representation bisulphite sequencing data
Hello all!
I was trying to call variants from reduced representation bisulphite sequencing data (RRBS) (it's from a cattle btw) by using the double masking method mentioned in this paper, Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional bayesian approaches, which the bisulphite sequencing bam file is adapted to run in conventional variant calling software such as GATK. I followed the steps mentioned in the paper to mask the RRBS bam files and ran the haplotype caller and genotypegvcf.
The output is very weird, there are only two types of SNP substitutions (A -> G and T-> C) present in the variant called, and their proportion are approximately equal (see image). Moreover, this does not occur if I am running with cattle whole genome bisulphite sequencing data (The paper used WGBS too).
Does anyone happen to know the reason for this weird result in RRBS?
Please let me know if you want any further information and thank you for reading this message
Regards,
Jing Qi
**The color represent different samples, each representing an RRBS sample sequenced from a cattle.
REQUIRED for all errors and issues:
a) GATK version used: 4.4.0, Java: 18.0.2
b) Exact command used: gatk HaplotypeCaller --input Holstein1_masked_recal.bam --reference GCF_002263795.1_ARS-UCD1.2_with_y_refseq_chrids.fa -output Holstein1.g.vcf.gz -ERC GVCF --standard-min-confidence-threshold-for-calling 1 --min-base-quality-score 1 --intervals 1 --intervals 2 --intervals 3 --intervals 4 --intervals 5 --intervals 6 --intervals 7 --intervals 8 --intervals 9 --intervals 10 --intervals 11 --intervals 12 --intervals 13 --intervals 14 --intervals 15 --intervals 16 --intervals 17 --intervals 18 --intervals 19 --intervals 20 --intervals 21 --intervals 22 --intervals 23 --intervals 24 --intervals 25 --intervals 26 --intervals 27 --intervals 28 --intervals 29 --intervals "X" --intervals "Y" --base-quality-score-threshold 6 --tmp-dir . --java-options "-Xmx104853M"
c) Entire program log:
INFO: Converting SIF file to temporary sandbox...
Using GATK jar /usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx104853M -jar /usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar HaplotypeCaller --input Holstein1_masked_recal.bam --reference GCF_002263795.1_ARS-UCD1.2_with_y_refseq_chrids.fa -output Holstein1.g.vcf.gz -ERC GVCF --standard-min-confidence-threshold-for-calling 1 --min-base-quality-score 1 --intervals 1 --intervals 2 --intervals 3 --intervals 4 --intervals 5 --intervals 6 --intervals 7 --intervals 8 --intervals 9 --intervals 10 --intervals 11 --intervals 12 --intervals 13 --intervals 14 --intervals 15 --intervals 16 --intervals 17 --intervals 18 --intervals 19 --intervals 20 --intervals 21 --intervals 22 --intervals 23 --intervals 24 --intervals 25 --intervals 26 --intervals 27 --intervals 28 --intervals 29 --intervals X --intervals Y --base-quality-score-threshold 6 --tmp-dir .
12:02:23.604 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:02:23.639 INFO HaplotypeCaller - ------------------------------------------------------------
12:02:23.641 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.4.0.0
12:02:23.641 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
12:02:23.641 INFO HaplotypeCaller - Executing as jchong@node2f11.ecdf.ed.ac.uk on Linux v3.10.0-1160.88.1.el7.x86_64 amd64
12:02:23.642 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
12:02:23.642 INFO HaplotypeCaller - Start Date/Time: October 23, 2023 at 12:02:23 PM GMT
12:02:23.642 INFO HaplotypeCaller - ------------------------------------------------------------
12:02:23.642 INFO HaplotypeCaller - ------------------------------------------------------------
12:02:23.643 INFO HaplotypeCaller - HTSJDK Version: 3.0.5
12:02:23.643 INFO HaplotypeCaller - Picard Version: 3.0.0
12:02:23.643 INFO HaplotypeCaller - Built for Spark Version: 3.3.1
12:02:23.643 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:02:23.643 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:02:23.643 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:02:23.643 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:02:23.643 INFO HaplotypeCaller - Deflater: IntelDeflater
12:02:23.644 INFO HaplotypeCaller - Inflater: IntelInflater
12:02:23.644 INFO HaplotypeCaller - GCS max retries/reopens: 20
12:02:23.644 INFO HaplotypeCaller - Requester pays: disabled
12:02:23.644 INFO HaplotypeCaller - Initializing engine
12:02:23.992 INFO IntervalArgumentCollection - Processing 2671695104 bp from intervals
12:02:24.003 INFO HaplotypeCaller - Done initializing engine
12:02:24.005 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
12:02:24.068 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
12:02:24.068 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
12:02:24.079 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
12:02:24.082 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
12:02:24.108 INFO IntelPairHmm - Using CPU-supported AVX-512 instructions
12:02:24.108 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
12:02:24.109 INFO IntelPairHmm - Available threads: 12
12:02:24.109 INFO IntelPairHmm - Requested threads: 4
12:02:24.109 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
12:02:24.169 INFO ProgressMeter - Starting traversal
12:02:24.170 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
ProgressMeter log, not included here
12:53:32.026 INFO HaplotypeCaller - 9954598 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
9954598 total reads filtered out of 94267858 reads processed
12:53:32.027 INFO ProgressMeter - Y:43298923 51.1 8952744 175094.5
12:53:32.027 INFO ProgressMeter - Traversal complete. Processed 8952744 total regions in 51.1 minutes.
12:53:32.136 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.10995963
12:53:32.137 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 3.193785182
12:53:32.137 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 10.11 sec
12:53:32.138 INFO HaplotypeCaller - Shutting down engine
[October 23, 2023 at 12:53:32 PM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 51.14 minutes.
Runtime.totalMemory()=3590324224
INFO: Cleaning up image...
-
The article specifically states that variant callers using local reassembly such as HaplotypeCaller will have difficulties in adding zero base quality scores into the assembly therefore forcing the engine to add those bases will increase the false positives and may not produce all reliable results. Variant callers that are recommended for the workflow are mostly alignment based callers like samtools or UnifiedGenotyper or haplotype based callers such as freebayes.
This statement already tells us that HaplotypeCaller with default parameters may not be suitable for WGBS yet for RRBS. GATK has a beta tool MethylationTypeCaller which may be more suitable for this purpose.
Also due to the nature of the regions collected in the RRBS data it may skew the number and types of SNPs present in the output data since WGBS is more complete but RRBS is only a subset of the whole picture.
I will relay this question to our team members to get a more in depth response.
Regards.
-
Thank you so much for your reply, Gökalp Çelik !
This is helpful, I will try MethylationTypeCaller.Not sure if this could be useful but I also ran Platypus (excluding zero base quality score) which used the local reassembly method as well on the RRBS data and the output result has all types of SNPs substitutions.
Once again thank you for your response and I hope to hear from you soon.Regards,
Jing Qi
Please sign in to leave a comment.
2 comments