Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Mutect2 PoN - GenomicsDBImport creates empty DB

Answered
0

15 comments

  • Avatar
    Pamela Bretscher

    Hi Enrico Cocchi,

    It looks like there may be a problem with your VCF files that you are inputting to GenomicsDBImport. Did you take the VCF files directly from Mutect2 or did you modify them at all? Could you please try running ValidateVariants on your files? 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    Hi Pamela Bretscher, thank you so much for your reply! The VCF files are direct output from Mutect2 command (first command in my original post above), I did not modify them at all.

    I also tried ValidateVariants on my Mutect2 output VCF files, as you suggested, and here is the full command and LOG (I never used it so I'm unable to judge it properly): 

    ### COMMAND ###
    REF='/nfs/seqscratch09/AZ-IPF/reference/hs37d5.fa'
    DBSNP='/nfs/projects/CNV_WGS/mutect2/dependencies_hg37/Homo_sapiens_assembly19.dbsnp138.vcf'
    VCF='/nfs/projects/CNV_WGS/Mutetc2-PON-OUT/Roche-M/fetal0217D.Roche-M.mutect2.vcf'
    gatk ValidateVariants \
         --java-options "-Djava.io.tmpdir=/nfs/projects/CNV_WGS/CHIP-PON-DB/TMP-DIR" \
         -R $REF \
         -V $VCF \
         --dbsnp $DBSNP

    ### OUTPUT ###
    Using GATK jar /home/ec3408/GATK-4.2.0.0/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Djava.io.tmpdir=/nfs/projects/CNV_WGS/CHIP-PON-DB/TMP-DIR -jar /home/ec3408/GATK-4.2.0.0/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar ValidateVariants -R /nfs/seqscratch09/AZ-IPF/reference/hs37d5.fa -V /nfs/projects/CNV_WGS/Mutetc2-PON-OUT/Roche-M/fetal0217D.Roche-M.mutect2.vcf --dbsnp /nfs/projects/CNV_WGS/mutect2/dependencies_hg37/Homo_sapiens_assembly19.dbsnp138.vcf
    09:42:45.493 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec3408/GATK-4.2.0.0/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Jul 15, 2021 9:42:45 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    09:42:45.809 INFO ValidateVariants - ------------------------------------------------------------
    09:42:45.809 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.2.0.0
    09:42:45.809 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
    09:42:45.810 INFO ValidateVariants - Executing as ec3408@dev2.igm.cumc.columbia.edu on Linux v3.10.0-957.27.2.el7.x86_64 amd64
    09:42:45.810 INFO ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_222-b10
    09:42:45.810 INFO ValidateVariants - Start Date/Time: July 15, 2021 9:42:45 AM EDT
    09:42:45.810 INFO ValidateVariants - ------------------------------------------------------------
    09:42:45.810 INFO ValidateVariants - ------------------------------------------------------------
    09:42:45.811 INFO ValidateVariants - HTSJDK Version: 2.24.0
    09:42:45.811 INFO ValidateVariants - Picard Version: 2.25.0
    09:42:45.811 INFO ValidateVariants - Built for Spark Version: 2.4.5
    09:42:45.811 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    09:42:45.811 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    09:42:45.811 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    09:42:45.811 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    09:42:45.812 INFO ValidateVariants - Deflater: IntelDeflater
    09:42:45.812 INFO ValidateVariants - Inflater: IntelInflater
    09:42:45.812 INFO ValidateVariants - GCS max retries/reopens: 20
    09:42:45.812 INFO ValidateVariants - Requester pays: disabled
    09:42:45.812 INFO ValidateVariants - Initializing engine
    09:42:46.345 INFO FeatureManager - Using codec VCFCodec to read file file:///nfs/projects/CNV_WGS/mutect2/dependencies_hg37/Homo_sapiens_assembly19.dbsnp138.vcf
    09:42:46.602 INFO FeatureManager - Using codec VCFCodec to read file file:///nfs/projects/CNV_WGS/Mutetc2-PON-OUT/Roche-M/fetal0217D.Roche-M.mutect2.vcf
    09:42:46.641 INFO ValidateVariants - Done initializing engine
    09:42:46.642 INFO ProgressMeter - Starting traversal
    09:42:46.642 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    09:42:58.543 INFO ProgressMeter - 1:112020224 0.2 5000 25218.6
    09:43:11.075 INFO ProgressMeter - 2:31571241 0.4 11000 27012.6
    09:43:21.590 INFO ProgressMeter - 2:176945012 0.6 15000 25753.3
    09:43:31.769 INFO ProgressMeter - 3:49004723 0.8 19000 25262.0
    09:43:43.291 INFO ProgressMeter - 3:196869518 0.9 23000 24361.0
    09:43:53.697 INFO ProgressMeter - 4:122824210 1.1 26000 23264.8
    09:44:06.431 INFO ProgressMeter - 5:107716774 1.3 29000 21807.8
    09:44:18.578 INFO ProgressMeter - 6:87796161 1.5 36000 23494.9
    09:44:29.349 INFO ProgressMeter - 7:44611077 1.7 39000 22783.3
    09:44:40.436 INFO ProgressMeter - 8:17930772 1.9 43000 22672.5
    09:44:51.111 INFO ProgressMeter - 9:18889544 2.1 46000 22174.2
    09:45:01.820 INFO ProgressMeter - 10:28945223 2.3 50000 22193.1
    09:45:14.084 INFO ProgressMeter - 11:57582908 2.5 55000 22381.8
    09:45:24.269 INFO ProgressMeter - 12:56726518 2.6 60000 22838.9
    09:45:36.038 INFO ProgressMeter - 14:23450548 2.8 64000 22668.8
    09:45:47.023 INFO ProgressMeter - 15:79058298 3.0 69000 22951.4
    09:45:58.477 INFO ProgressMeter - 17:39341473 3.2 76000 23770.4
    09:46:08.710 INFO ProgressMeter - 19:16339715 3.4 82000 24348.4
    09:46:19.635 INFO ProgressMeter - 21:46047633 3.5 89000 25071.2
    09:46:28.756 INFO ProgressMeter - X:139099962 3.7 93314 25207.2
    09:46:28.756 INFO ProgressMeter - Traversal complete. Processed 93314 total variants in 3.7 minutes.
    09:46:28.756 INFO ValidateVariants - Shutting down engine
    [July 15, 2021 9:46:28 AM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 3.72 minutes.
    Runtime.totalMemory()=5972164608
    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Enrico Cocchi,

    There are some other members of the GATK team looking into why you are receiving this error. I will update you when they come to a solution.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    ok Pamela Bretscher, what error exactly are you talking about? The ValidateVariants output or the GenomicsDBImport empty DB?

    Looking forward to receive this answer to be able to proceed with my analysis

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Enrico Cocchi,

    Sorry, I should have clarified. We are looking into the reason for the empty DB and output of GenomicsDBImport. The output from ValidateVariants looks good.

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Enrico Cocchi,

    I spoke with some members of the GATK team today and they were looking at the "Invalid Character" errors when you run GenomicsDBImport. Could you check the VCF file and let me know what the type attribute for the AF field is in the VCF header? 

    Thanks,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    Hi Pamela Bretscher, here you go:

    ##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Enrico Cocchi,

    Thank you for providing this information. After confirming that the AF is a float field and talking with the GATK team, we think this may be some sort of bug with the GenomicsDB tool. I have created a GitHub ticket for the GenomicsDB engineers to look into and hopefully help solve your issue. Here is a link to the ticket: https://github.com/broadinstitute/gatk/issues/7362. I will reach out when I have any updates and please let me know if there is anything else I can do to help.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    ok, thank you Pamela Bretscher, what can I do to have it working in the meanwhile? Use a different version?

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Hello Enrico Cocchi, would it be possible to include a small vcf file highlighting the issue?

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    Hi Nalini Ganapati. You can see an example VCF here, but I'd like to ask again: this is taking much longer than expected, how can I solve this while you guys investigate the bug? Use a different GATK version etc ?

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Using GenomicsDBImport on the test vcf provided threw the following exception

    %:  ./gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' GenomicsDBImport -L 1:883600-883650 -V ex001.mutect2-output.vcf --genomicsdb-workspace-path try
    Caused by: htsjdk.tribble.TribbleException: An index is required, but none found., for input source: file://ex001.mutect2-output.vcf

    Got the example vcf to work by first block compressing and indexing or indexing directly -

    %: ~/bcftools/bcftools view -Oz -o sample.vcf.gz ex001.mutect2-output.vcf
    %: ~/bcftools/bcftools index -t sample.vcf.gz
    OR
    %: ./gatk IndexFeatureFile -I ex001.mutect2-output.vcf

     

    And then running GenomicsDBImport/SelectVariants on the resultant sample/workspace:

    %: ./gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' GenomicsDBImport -L 1:883600-883650 -V sample.vcf.gz --genomicsdb-workspace-path try
    %: ./gatk SelectVariants -V gendb://try -O test.vcf -R ~/vcfs/reference/hs37d5.fa

    Does this help?

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Enrico Cocchi,

    Thank you for your patience while the GATK team has worked to solve your problem with GenomicsDB. I hope the solution posted by Nalini allows you to continue with your analysis and please let me know if you have any additional questions.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    Hi Pamela Bretscher and Nalini Ganapati, thank you for your suggestion but this is not the issue. GATK Mutect2 automatically creates both a .idx and a .stats file for the generated VCF. I never got an "index-missing" error.

    In fact, if I repeat your passages on the isolated VCF (1. indexing with GATK IndexFeatureFile, 2. GenomicsDBImport and 3. SelectVariants) I keep get an empty file.

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Enrico Cocchi, we are not able to reproduce the issue. We do not get the [E::vcf_parse_format]  errors and genomicsdb seems to be correctly generated.

     ./gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' GenomicsDBImport -L 1:883600-883650 -V ~/gatk/mutect2_bug/ex001.mutect2-output.vcf --genomicsdb-workspace-path try
    Using GATK jar /nfs/home/nalini/gatk_4.2/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /nfs/home/nalini/gatk_4.2/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar GenomicsDBImport -L 1:883600-883650 -V /nfs/home/nalini/gatk/mutect2_bug/ex001.mutect2-output.vcf --genomicsdb-workspace-path try
    08:44:53.865 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nfs/home/nalini/gatk_4.2/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Jul 27, 2021 8:44:54 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    08:44:54.014 INFO GenomicsDBImport - ------------------------------------------------------------
    08:44:54.014 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.0.0
    08:44:54.014 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    08:44:54.014 INFO GenomicsDBImport - Executing as nalini on Linux v3.10.0-1160.31.1.el7.x86_64 amd64
    08:44:54.015 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_292-b10
    08:44:54.015 INFO GenomicsDBImport - Start Date/Time: July 27, 2021 8:44:53 AM PDT
    08:44:54.015 INFO GenomicsDBImport - ------------------------------------------------------------
    08:44:54.015 INFO GenomicsDBImport - ------------------------------------------------------------
    08:44:54.016 INFO GenomicsDBImport - HTSJDK Version: 2.24.0
    08:44:54.016 INFO GenomicsDBImport - Picard Version: 2.25.0
    08:44:54.016 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
    08:44:54.016 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    08:44:54.016 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    08:44:54.016 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    08:44:54.016 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    08:44:54.016 INFO GenomicsDBImport - Deflater: IntelDeflater
    08:44:54.017 INFO GenomicsDBImport - Inflater: IntelInflater
    08:44:54.017 INFO GenomicsDBImport - GCS max retries/reopens: 20
    08:44:54.017 INFO GenomicsDBImport - Requester pays: disabled
    08:44:54.017 INFO GenomicsDBImport - Initializing engine
    08:44:54.374 INFO IntervalArgumentCollection - Processing 51 bp from intervals
    08:44:54.375 INFO GenomicsDBImport - Done initializing engine
    08:44:54.630 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
    08:44:54.631 INFO GenomicsDBImport - Vid Map JSON file will be written to /nfs/home/nalini/gatk_4.2/gatk-4.2.0.0/try/vidmap.json
    08:44:54.631 INFO GenomicsDBImport - Callset Map JSON file will be written to /nfs/home/nalini/gatk_4.2/gatk-4.2.0.0/try/callset.json
    08:44:54.631 INFO GenomicsDBImport - Complete VCF Header will be written to /nfs/home/nalini/gatk_4.2/gatk-4.2.0.0/try/vcfheader.vcf
    08:44:54.631 INFO GenomicsDBImport - Importing to workspace - /nfs/home/nalini/gatk_4.2/gatk-4.2.0.0/try
    08:44:54.631 INFO ProgressMeter - Starting traversal
    08:44:54.631 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
    08:44:54.782 INFO GenomicsDBImport - Importing batch 1 with 1 samples
    08:44:54.908 INFO GenomicsDBImport - Done importing batch 1/1
    08:44:54.909 INFO ProgressMeter - unmapped 0.0 1 216.6
    08:44:54.910 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.0 minutes.
    08:44:54.910 INFO GenomicsDBImport - Import completed!
    08:44:54.910 INFO GenomicsDBImport - Shutting down engine
    [July 27, 2021 8:44:54 AM PDT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.02 minutes.
    Runtime.totalMemory()=2311585792
    Tool returned:
    true

    Tail of output vcf from running SelectVariants:

    ##source=GenomicsDBImport
    ##source=SelectVariants
    ##tumor_sample=ex001
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ex001
    1 883625 . A G . . AS_SB_TABLE=0,0|12,41;DP=54 GT:AD:AF:F1R2:F2R1:SB:DP ./.:
    0,53:0.981:0,26:0,26:0,0,12,41:53

    The [E::vcf_parse_format]  errors that you encounter are from htslib, so wondering if you have a libhtslib.so installed and if that is interfering with GenomicsDB. Some quick ways to check -

    • ldconfig -p | grep libhts
    • locate libhts
    • Check $LD_LIBRARY_PATH for libhts.so

    If any of these yield results, you may have to try a fresh, docker image of gatk - https://hub.docker.com/r/broadinstitute/gatk/ instead of a simple zip download of gatk while we continue to look for some solutions.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk