Difference between Germline short variant calling pipeline of lpWGS and WGS
REQUIRED for all errors and issues:
a) GATK version used:
GATK version 4.3.0
b) Exact command used:
gatk HaplotypeCaller
c) Entire program log:
None
My question is that, for low pass WGS (1~10x) and WGS (30~40x), is there any difference between their pipeline of doing haplotype calling and joint sample calling?
If so, what points should we take care of? If not much difference, can I apply the germline short variant calling pipeline here?(Germline short variant discovery (SNPs + Indels) – GATK (broadinstitute.org)
Many thanks!!!
-
Hi Yifei Dai,
I'd recommend the same pipeline for low pass WGS. That's what we've been running since the low pass 1000 Genomes days. The main caveat is that you'll want to NOT run ReblockGVCF or GnarlyGenotyper, just the classic GenotypeGVCFs joint calling. The other caveat is that you should avoid mixing low coverage and high coverage WGS or VQSR will tend to be biased against the variants only occurring in low coverage samples.
-
Hi Laura Gauthier,
Thank you so much for the reply. But after I tried to run the recalibrated bam file with gatk HaplotypeCaller command, the following warning shows up:
Using GATK jar /opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx60g -jar /opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar HaplotypeCaller -R /minicubio/yifei/germline/GRCh38.chr.fa -I /minicubio/yifei/germline/Germline-short-variant-calling/WGS/result/pre_data_WGS/90-470151026-FR01_1600729045S01/90-470151026-FR01_1600729045S01_recalibrated.bam --emit-ref-confidence GVCF --active-probability-threshold 0.0015 --assembly-region-padding 100 -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation -OBI true -O /minicubio/yifei/germline/Germline-short-variant-calling/WGS/result/HC_call_WGS/90-470151026-FR01_1600729045S01-90-470151026-FR01_1600729045S01/90-470151026-FR01/90-470151026-FR01_chr1_HC_calls.g.vcf.gz -L chr1
21:41:42.860 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
21:41:42.861 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardHCAnnotation) is enabled for this tool by default
21:41:42.907 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
21:41:43.132 INFO HaplotypeCaller - ------------------------------------------------------------
21:41:43.133 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.5.1
21:41:43.133 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
21:41:43.133 INFO HaplotypeCaller - Executing as daiyifei@cubio on Linux v5.4.0-137-generic amd64
21:41:43.133 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
21:41:43.133 INFO HaplotypeCaller - Start Date/Time: June 5, 2023 9:41:42 PM EDT
21:41:43.133 INFO HaplotypeCaller - ------------------------------------------------------------
21:41:43.133 INFO HaplotypeCaller - ------------------------------------------------------------
21:41:43.133 INFO HaplotypeCaller - HTSJDK Version: 2.15.1
21:41:43.133 INFO HaplotypeCaller - Picard Version: 2.18.2
21:41:43.134 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
21:41:43.134 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
21:41:43.134 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
21:41:43.134 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
21:41:43.134 INFO HaplotypeCaller - Deflater: IntelDeflater
21:41:43.134 INFO HaplotypeCaller - Inflater: IntelInflater
21:41:43.134 INFO HaplotypeCaller - GCS max retries/reopens: 20
21:41:43.134 INFO HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
21:41:43.134 INFO HaplotypeCaller - Initializing engine
21:41:43.489 INFO IntervalArgumentCollection - Processing 248956422 bp from intervals
21:41:43.494 INFO HaplotypeCaller - Done initializing engine
21:41:43.503 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
21:41:43.506 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
21:41:43.506 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
21:41:43.515 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
21:41:43.570 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
21:41:43.609 INFO IntelPairHmm - Using CPU-supported AVX-512 instructions
21:41:43.609 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
21:41:43.609 INFO IntelPairHmm - Available threads: 80
21:41:43.610 INFO IntelPairHmm - Requested threads: 4
21:41:43.610 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
21:41:43.911 INFO ProgressMeter - Starting traversal
21:41:43.912 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
21:41:53.913 INFO ProgressMeter - chr1:2176630 0.2 8140 48835.1
21:41:55.961 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:41:55.961 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:41:55.962 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:41:55.962 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:41:55.962 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:41:55.962 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:03.921 INFO ProgressMeter - chr1:4526787 0.3 17290 51846.7
21:42:04.220 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:04.223 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:04.224 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:04.224 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:13.921 INFO ProgressMeter - chr1:6874028 0.5 26440 52864.1
21:42:15.976 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:15.977 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:16.711 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:16.712 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:17.394 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:17.394 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:17.395 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:17.395 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:17.395 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:17.395 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:19.746 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:19.746 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:19.980 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
21:42:19.980 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
Is this because of the low coverage issue for lpWGS? If so how to address it?
Here is the command I used to run gatk HaplotypeCaller
gatk --java-options '-Xmx60g' HaplotypeCaller \
-R "${REFERENCE}" \
-I "${BAM}" \
--emit-ref-confidence GVCF \
-O "${OUTPUT_GVCF_DIR}/${SAMPLE_NAME}_${CHROM}_HC_calls.g.vcf.gz" \
-L "${CHROM}"
Many thanks!! -
Hi Yifei Dai,
I think those warnings are because you have complete coverage drop out at some positions, which is to be expected for low coverage WGS data. GATK can't calculate the annotations for AD and SB because there are no reads to count. You could run DepthOfCoverage to confirm, but I'm honestly not that familiar with that particular tool. I would probably open the bam in IGV and navigate to the first position reported in the log and look around. To be clear, this isn't a problem, it's just a warning that at some positions there are no reads.
-
Hi Laura Gauthier ,
Thanks for your reply! But I'm a bit confused that if it is no coverage and if it is done on a single sample, how do we know there is a genetic variant in the first place? As I do the haplotype calling for each lpWGS sample and parallel for different chromosomes. Could you please explain more?
Many thanks!
Best regards,
Yifei -
Hi Yifei Dai,
This is admittedly a problem, and one that's rather difficult to address when the rate of heterozygosity is about 1/1000 bases (as in humans). GenotypeGVCFs has an option to output all sites (with `-all-sites`), but that makes for some very large output files. If you're only concerned about samples with no coverage where other samples in your cohort have variants, then this is what the GVCF format from HaplotypeCaller's reference confidence mode (the default in the single-sample pipeline) is meant to address: https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format
-Laura
-
Hi Laura Gauthier,
Thank you so much for the clarification! I have another issue which also related to low-coverage data processing. Could you please elaborate on how the imputation step is done by using GATK? Is it finished while doing joint sample calling? Or should this step be done by using other tools? I looked it up on the forum but I didn't find any clue. So could you please tell me more about this or provide me some link that might be helpful? Thanks!
Best regards,
-Yifei -
Hi Yifei Dai,
We like the U Michigan imputation server. There's a Terra featured workspace here (https://app.terra.bio/#workspaces/warp-pipelines/Imputation) with some documentation and the corresponding WDL workflow. I believe you do have to upload your VCFs, so if this is clinical data where that's a concern, then this might not work for you.
-Laura
-
Hi Laura Gauthier,
Thanks for providing the information!
Best,
Yifei
Please sign in to leave a comment.
8 comments