Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs generates empty vcf.gz files

0

9 comments

  • Avatar
    Gökalp Çelik

    Hi  E Ra

    How were your bam files created? Which aligner did you use?

    0
    Comment actions Permalink
  • Avatar
    E Ra

    Hi Gokalp,

    I used BWA aligner. My code was:

    parallel -j $(CPU} bwa mem -t 32 "${ref_fasta}" {}_R1.fq.gz {}_R2.fq.gz "|" samtools view -b -S -h ">" {}.orig.bam ::: $(ls -1 *_R1.fq.gz | sed 's/_R1.fq.gz//')

    Thank you,

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi E Ra

    Looks like your imports to GenomicsDB is happening within a second which is highly unlikely to be true. Most certainly there is an issue with your GVCF files or calls. Can you share a couple variants by using 

    bcftools view -H <gvcffile> | head -n 4

    or

    zcat <gvcffile> | grep -v @ | head -n 4

     

    0
    Comment actions Permalink
  • Avatar
    E Ra

    Hi Gökalp,

    Below is the excerpt from a gvcfile

    bcftools view -H line1_BQSR_fromPass2_g.vcf.gz | head -n 4
    Chr1A   1       .       C       <NON_REF>       .       .       END=34  GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,25
    Chr1A   35      .       C       <NON_REF>       .       .       END=81  GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,65
    Chr1A   82      .       C       <NON_REF>       .       .       END=104 GT:DP:GQ:MIN_DP:PL      0/0:3:9:3:0,9,98
    Chr1A   105     .       C       <NON_REF>       .       .       END=181 GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,65
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    You seem to have all reference intervals. Do you have any variant sites present. Can you provide more lines for that?

    0
    Comment actions Permalink
  • Avatar
    E Ra

    Hi Gökalp,

    Here are two variant sites.

    Chr1A   15167   .       G       <NON_REF>       .       .       END=15224       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,32
    Chr1A   15225   .       C       <NON_REF>       .       .       END=15292       GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,49
    Chr1A   15293   .       C       A,<NON_REF>     37.31   .       DP=2;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;RAW_MQandDP=1458,2       GT:AD:DP:GQ:PL:SB       1/1:0,2,0:2:6:49,6,0,49,6,49:0,0,1,1
    Chr1A   15294   .       A       <NON_REF>       .       .       END=15314       GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,49
    Chr1A   15315   .       T       <NON_REF>       .       .       END=15315       GT:DP:GQ:MIN_DP:PL      0/0:2:3:2:0,3,45
    Chr1A   15316   .       C       <NON_REF>       .       .       END=15372       GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,64
    Chr1A   15373   .       G       <NON_REF>       .       .       END=15382       GT:DP:GQ:MIN_DP:PL      0/0:2:3:2:0,3,45
    Chr1A   15383   .       G       <NON_REF>       .       .       END=15461       GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,61
    Chr1A   15462   .       A       <NON_REF>       .       .       END=15530       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,12
    Chr1A   15531   .       T       <NON_REF>       .       .       END=15707       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    Chr1A   15708   .       C       <NON_REF>       .       .       END=15851       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,12
    Chr1A   15852   .       T       <NON_REF>       .       .       END=25438       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    Chr1A   25439   .       C       <NON_REF>       .       .       END=25553       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,15
    Chr1A   25554   .       A       <NON_REF>       .       .       END=25555       GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,49
    Chr1A   25556   .       G       <NON_REF>       .       .       END=25557       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    Chr1A   25558   .       T       <NON_REF>       .       .       END=25568       GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,36
    Chr1A   25569   .       G       <NON_REF>       .       .       END=25569       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    Chr1A   25570   .       C       <NON_REF>       .       .       END=25579       GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,36
    Chr1A   25580   .       C       <NON_REF>       .       .       END=25583       GT:DP:GQ:MIN_DP:PL      0/0:2:3:2:0,3,45
    Chr1A   25584   .       C       T,<NON_REF>     15.45   .       DP=1;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;RAW_MQandDP=1600,1       GT:AD:DP:GQ:PL:SB       1/1:0,1,0:1:3:25,3,0,25,3,25:0,0,1,0
    Chr1A   25585   .       C       <NON_REF>       .       .       END=25590       GT:DP:GQ:MIN_DP:PL      0/0:2:3:2:0,3,45
    Chr1A   25591   .       G       <NON_REF>       .       .       END=25591       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    Chr1A   25592   .       A       <NON_REF>       .       .       END=25596       GT:DP:GQ:MIN_DP:PL      0/0:2:3:1:0,3,31
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    Looks like your import process is somewhat incomplete. 

    Can you try importing one set of samples without using any parallel code that may interfere with the gatk operation?

     

    0
    Comment actions Permalink
  • Avatar
    E Ra

    Hi Gökalp,

    I am not sure I understand what you meant by importing samples. I tried the codes below and I still do have no variants in the resulting vcf.gz file. 

    My codes were:

    java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
        -jar ${prg}/gatk-package-4.5.0.0-local.jar GenomicsDBImport \
        --genomicsdb-workspace-path ${gDB}_Chr1A \
        --batch-size 32 \
        -L Chr1A \
        --sample-name-map ${map} \
        --tmp-dir ${tmp} \
        --reader-threads 6 \
        2>${log_dir}/Chr1A.gDBImport.log

    java -Xmx16g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
           -jar ${prg}/gatk-package-4.5.0.0-local.jar GenotypeGVCFs \
           --reference ${ref_fasta} \
           -V gendb://${gDB}_Chr1A \
           --output ${dir}/allSamples_joint_Chr1A.vcf.gz \
           2>${log_dir}/Chr1A.joint_GenotypeGVCFs.log

     

    The log file is below, it seems not to display any error message.

    11:24:12.380 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/Programs/gatk-4.5.0.0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    11:24:12.536 INFO  GenotypeGVCFs - ------------------------------------------------------------
    11:24:12.539 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.5.0.0
    11:24:12.539 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    11:24:12.539 INFO  GenotypeGVCFs - Executing as user1@agriserver.ca

     on Linux v5.4.0-150-generic amd64
    11:24:12.539 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
    11:24:12.540 INFO  GenotypeGVCFs - Start Date/Time: April 8, 2024 at 11:24:12 a.m. CST
    11:24:12.540 INFO  GenotypeGVCFs - ------------------------------------------------------------
    11:24:12.540 INFO  GenotypeGVCFs - ------------------------------------------------------------
    11:24:12.541 INFO  GenotypeGVCFs - HTSJDK Version: 4.1.0
    11:24:12.541 INFO  GenotypeGVCFs - Picard Version: 3.1.1
    11:24:12.541 INFO  GenotypeGVCFs - Built for Spark Version: 3.5.0
    11:24:12.541 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    11:24:12.542 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    11:24:12.542 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    11:24:12.542 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    11:24:12.542 INFO  GenotypeGVCFs - Deflater: IntelDeflater
    11:24:12.542 INFO  GenotypeGVCFs - Inflater: IntelInflater
    11:24:12.543 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
    11:24:12.543 INFO  GenotypeGVCFs - Requester pays: disabled
    11:24:12.543 INFO  GenotypeGVCFs - Initializing engine
    11:24:12.849 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
    11:24:12.919 INFO  NativeGenomicsDB - pid=52042 tid=52043 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
    11:24:12.919 INFO  NativeGenomicsDB - pid=52042 tid=52043 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
    11:24:12.919 INFO  NativeGenomicsDB - pid=52042 tid=52043 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
    11:24:12.951 INFO  GenotypeGVCFs - Done initializing engine
    11:24:13.000 INFO  ProgressMeter - Starting traversal
    11:24:13.001 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.0,Cpu time(s),0.0
    11:24:13.204 INFO  ProgressMeter -             unmapped              0.0                     0              0.0
    11:24:13.205 INFO  ProgressMeter - Traversal complete. Processed 0 total variants in 0.0 minutes.
    11:24:13.209 INFO  GenotypeGVCFs - Shutting down engine
    [April 8, 2024 at 11:24:13 a.m. CST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.01 minutes.
    Runtime.totalMemory()=1157627904
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi E Ra

    I meant the GenomicsDBImport step. Do you get any error messages during that step?

    When I looked at your previous logs this is what got my attention.

    13:13:58.222 INFO  GenomicsDBImport - Starting batch input file preload
    13:13:58.378 INFO  GenomicsDBImport - Finished batch preload
    13:13:58.381 INFO  GenomicsDBImport - Importing batch 1 with 32 samples
    13:13:59.096 INFO  GenomicsDBImport - Done importing batch 1/2

    Import of 32 samples takes only about less than a second to complete which is quite impossible considering you have many samples in there. It is possible that your samples need a CSI style VCF index and therefore import function does not work properly. 

    Unfortunately we do not currently support reading CSI style VCF index files. There is a workaround that you may want to use which is using glnexus to jointly genotype your GVCF files. We cannot provide any support for glnexus therefore your mileage may vary. 

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk