Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Difference between Germline short variant calling pipeline of lpWGS and WGS

1

8 comments

  • Avatar
    Laura Gauthier

    Hi Yifei Dai,

    I'd recommend the same pipeline for low pass WGS.  That's what we've been running since the low pass 1000 Genomes days.  The main caveat is that you'll want to NOT run ReblockGVCF or GnarlyGenotyper, just the classic GenotypeGVCFs joint calling. The other caveat is that you should avoid mixing low coverage and high coverage WGS or VQSR will tend to be biased against the variants only occurring in low coverage samples.

    1
    Comment actions Permalink
  • Avatar
    Yifei Dai

    Hi Laura Gauthier,

    Thank you so much for the reply. But after I tried to run the recalibrated bam file with gatk HaplotypeCaller command, the following warning shows up:

    Using GATK jar /opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx60g -jar /opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar HaplotypeCaller -R /minicubio/yifei/germline/GRCh38.chr.fa -I /minicubio/yifei/germline/Germline-short-variant-calling/WGS/result/pre_data_WGS/90-470151026-FR01_1600729045S01/90-470151026-FR01_1600729045S01_recalibrated.bam --emit-ref-confidence GVCF --active-probability-threshold 0.0015 --assembly-region-padding 100 -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation -OBI true -O /minicubio/yifei/germline/Germline-short-variant-calling/WGS/result/HC_call_WGS/90-470151026-FR01_1600729045S01-90-470151026-FR01_1600729045S01/90-470151026-FR01/90-470151026-FR01_chr1_HC_calls.g.vcf.gz -L chr1
    21:41:42.860 WARN  GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
    21:41:42.861 WARN  GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardHCAnnotation) is enabled for this tool by default
    21:41:42.907 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    21:41:43.132 INFO  HaplotypeCaller - ------------------------------------------------------------
    21:41:43.133 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.5.1
    21:41:43.133 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
    21:41:43.133 INFO  HaplotypeCaller - Executing as daiyifei@cubio on Linux v5.4.0-137-generic amd64
    21:41:43.133 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
    21:41:43.133 INFO  HaplotypeCaller - Start Date/Time: June 5, 2023 9:41:42 PM EDT
    21:41:43.133 INFO  HaplotypeCaller - ------------------------------------------------------------
    21:41:43.133 INFO  HaplotypeCaller - ------------------------------------------------------------
    21:41:43.133 INFO  HaplotypeCaller - HTSJDK Version: 2.15.1
    21:41:43.133 INFO  HaplotypeCaller - Picard Version: 2.18.2
    21:41:43.134 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    21:41:43.134 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    21:41:43.134 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    21:41:43.134 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    21:41:43.134 INFO  HaplotypeCaller - Deflater: IntelDeflater
    21:41:43.134 INFO  HaplotypeCaller - Inflater: IntelInflater
    21:41:43.134 INFO  HaplotypeCaller - GCS max retries/reopens: 20
    21:41:43.134 INFO  HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    21:41:43.134 INFO  HaplotypeCaller - Initializing engine
    21:41:43.489 INFO  IntervalArgumentCollection - Processing 248956422 bp from intervals
    21:41:43.494 INFO  HaplotypeCaller - Done initializing engine
    21:41:43.503 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
    21:41:43.506 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
    21:41:43.506 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
    21:41:43.515 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
    21:41:43.570 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/miniconda3/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
    21:41:43.609 INFO  IntelPairHmm - Using CPU-supported AVX-512 instructions
    21:41:43.609 WARN  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    21:41:43.609 INFO  IntelPairHmm - Available threads: 80
    21:41:43.610 INFO  IntelPairHmm - Requested threads: 4
    21:41:43.610 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
    21:41:43.911 INFO  ProgressMeter - Starting traversal
    21:41:43.912 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
    21:41:53.913 INFO  ProgressMeter -         chr1:2176630              0.2                  8140          48835.1
    21:41:55.961 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:41:55.961 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:41:55.962 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:41:55.962 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:41:55.962 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:41:55.962 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:03.921 INFO  ProgressMeter -         chr1:4526787              0.3                 17290          51846.7
    21:42:04.220 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:04.223 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:04.224 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:04.224 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:13.921 INFO  ProgressMeter -         chr1:6874028              0.5                 26440          52864.1
    21:42:15.976 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:15.977 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:16.711 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:16.712 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:17.394 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:17.394 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:17.395 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:17.395 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:17.395 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:17.395 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:19.746 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:19.746 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:19.980 WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    21:42:19.980 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null

    Is this because of the low coverage issue for lpWGS? If so how to address it?
    Here is the command I used to run gatk HaplotypeCaller
        gatk --java-options '-Xmx60g' HaplotypeCaller \
            -R "${REFERENCE}" \
            -I "${BAM}" \
            --emit-ref-confidence GVCF \
            -O "${OUTPUT_GVCF_DIR}/${SAMPLE_NAME}_${CHROM}_HC_calls.g.vcf.gz" \
            -L "${CHROM}"

    Many thanks!!


    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Hi Yifei Dai,

    I think those warnings are because you have complete coverage drop out at some positions, which is to be expected for low coverage WGS data.  GATK can't calculate the annotations for AD and SB because there are no reads to count.  You could run DepthOfCoverage to confirm, but I'm honestly not that familiar with that particular tool.  I would probably open the bam in IGV and navigate to the first position reported in the log and look around.  To be clear, this isn't a problem, it's just a warning that at some positions there are no reads.

    0
    Comment actions Permalink
  • Avatar
    Yifei Dai

    Hi Laura Gauthier 

    Thanks for your reply! But I'm a bit confused that if it is no coverage and if it is done on a single sample, how do we know there is a genetic variant in the first place? As I do the haplotype calling for each lpWGS sample and parallel for different chromosomes. Could you please explain more? 

    Many thanks!

    Best regards,
    Yifei

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Hi Yifei Dai,

    This is admittedly a problem, and one that's rather difficult to address when the rate of heterozygosity is about 1/1000 bases (as in humans).  GenotypeGVCFs has an option to output all sites (with `-all-sites`), but that makes for some very large output files.  If you're only concerned about samples with no coverage where other samples in your cohort have variants, then this is what the GVCF format from HaplotypeCaller's reference confidence mode (the default in the single-sample pipeline) is meant to address: https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format

    -Laura

    0
    Comment actions Permalink
  • Avatar
    Yifei Dai

    Hi Laura Gauthier,

    Thank you so much for the clarification! I have another issue which also related to low-coverage data processing. Could you please elaborate on how the imputation step is done by using GATK? Is it finished while doing joint sample calling? Or should this step be done by using other tools? I looked it up on the forum but I didn't find any clue. So could you please tell me more about this or provide me some link that might be helpful? Thanks!

    Best regards,
    -Yifei

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Hi Yifei Dai,

    We like the U Michigan imputation server.  There's a Terra featured workspace here (https://app.terra.bio/#workspaces/warp-pipelines/Imputation) with some documentation and the corresponding WDL workflow.  I believe you do have to upload your VCFs, so if this is clinical data where that's a concern, then this might not work for you.

    -Laura

    0
    Comment actions Permalink
  • Avatar
    Yifei Dai

    Hi Laura Gauthier,

    Thanks for providing the information!

    Best, 
    Yifei

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk