Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDBImport doesn't work with >2 samples

Answered
0

6 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Tomás Rodríguez,

    Thank you for writing into the GATK forum so that we can help you with this question! 

    The good news is I think the reason your first command has an error is with a simple typo. You have 1176.rm.vcf.gz instead of 1176.rn.vcf.gz. When you fix that, it should work!

    For your question about the intervals, yes, it is possible to specify all the chromosomes. There are multiple ways to do it and they are all explained in this article: Intervals and interval lists. Please let me know if you have any questions about that article and I can help out!

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Tomás Rodríguez

    Thank you for your quick reply!

    Embarrassingly, this was this issue. Apologies for that!

    I specified the correct sample names and intervals according to the support page you linked.

    Unfortunately, I thought that 8 hours would be more than enough time to run this job and it timed out. Is there a way to estimate the amount of time/resources I would need? Should I be concerned that combining 8 VCFs takes this long (maybe I haven't configured the batch option correctly)? Thanks again for the help. 

     

    gatk --java-options "-Xmx40g -Xms40g" GenomicsDBImport --genomicsdb-workspace-path g1out -L allchr.bed  --reader-threads 15 --sample-name-map g1
    Using GATK jar /share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx40g -Xms40g -jar /share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar GenomicsDBImport --genomicsdb-workspace-path g1out -L allchr.bed --reader-threads 15 --sample-name-map g1
    21:42:06.834 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Nov 16, 2022 9:42:06 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    21:42:06.996 INFO  GenomicsDBImport - ------------------------------------------------------------
    21:42:06.996 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.8.1
    21:42:06.996 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    21:42:06.996 INFO  GenomicsDBImport - Executing as tr52w@c40b08 on Linux v2.6.32-754.35.1.el6.x86_64 amd64
    21:42:06.996 INFO  GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_77-b03
    21:42:06.996 INFO  GenomicsDBImport - Start Date/Time: November 16, 2022 9:42:06 PM EST
    21:42:06.996 INFO  GenomicsDBImport - ------------------------------------------------------------
    21:42:06.996 INFO  GenomicsDBImport - ------------------------------------------------------------
    21:42:06.997 INFO  GenomicsDBImport - HTSJDK Version: 2.23.0
    21:42:06.997 INFO  GenomicsDBImport - Picard Version: 2.22.8
    21:42:06.997 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    21:42:06.997 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    21:42:06.997 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    21:42:06.997 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    21:42:06.997 INFO  GenomicsDBImport - Deflater: IntelDeflater
    21:42:06.997 INFO  GenomicsDBImport - Inflater: IntelInflater
    21:42:06.997 INFO  GenomicsDBImport - GCS max retries/reopens: 20
    21:42:06.997 INFO  GenomicsDBImport - Requester pays: disabled
    21:42:06.997 INFO  GenomicsDBImport - Initializing engine
    21:42:07.289 INFO  FeatureManager - Using codec BEDCodec to read file file:///project/umw/ATAC_output/bam/gatkout/allchr.bed
    21:42:07.294 INFO  IntervalArgumentCollection - Processing 3088269808 bp from intervals
    21:42:07.296 INFO  GenomicsDBImport - Done initializing engine
    21:42:07.519 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.3.0-e701905
    21:42:07.523 INFO  GenomicsDBImport - Vid Map JSON file will be written to /project/umw/ATAC_output/bam/gatkout/g1out/vidmap.json
    21:42:07.523 INFO  GenomicsDBImport - Callset Map JSON file will be written to /project/umw/ATAC_output/bam/gatkout/g1out/callset.json
    21:42:07.523 INFO  GenomicsDBImport - Complete VCF Header will be written to /project/umw/ATAC_output/bam/gatkout/g1out/vcfheader.vcf
    21:42:07.524 INFO  GenomicsDBImport - Importing to workspace - /project/umw/ATAC_output/bam/gatkout/g1out
    21:42:07.524 WARN  GenomicsDBImport - GenomicsDBImport cannot use multiple VCF reader threads for initialization when the number of intervals is greater than 1. Falling back to serial VCF reader initialization.
    21:42:07.524 INFO  ProgressMeter - Starting traversal
    21:42:07.524 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
    21:42:07.757 INFO  GenomicsDBImport - Importing batch 1 with 8 samples

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great to hear that it's working! And yes, GenomicsDBImport can be a time consuming step. We put together an article with all of our performance recommendations here: GenomicsDBImport usage and performance guidelines. Let me know if you have any questions about those!

    0
    Comment actions Permalink
  • Avatar
    Tomás Rodríguez

    Hi Genevieve,

    Thanks again for your replies. Unfortunately, my job could not complete after 36 hours. I'm new to using GATK and am getting the sense that I'm not using the workflow for its intended purpose. Was this designed to process very small intervals on single-replicate VCFs only? The extensive runtime and excessive memory usage seem like a red flag that I'm doing something wrong. I'd like to detect SNVs genome-wide across multiple replicates. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Was the job paused at 36 hours? Or was it in the middle of the run? You can share your program log and I can take a look to see if there were any issues. If it never started (like the program log you shared) there is a chance you do not have all the GenomicsDBImport requirements installed correctly. GenomicsDBImport is definitely meant for many VCFs and we have a lot of parameters to optimize GenomicsDBImport for your usage. Here are some options I recommend:

    1. If you want to run the import all at once, you can import your intervals separately to create multiple genomicsdb workspaces. After genotyping, these VCFs can be combined with MergeVCFs.
    2. If you want to run the import serially, you can import a few samples at a time, using the option --genomicsdb-update-workspace-path
    3. If you are running on a shared filesystem, I recommend the argument --genomicsdb-shared-posixfs-optimizations to true. 
    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Tomás,

    We haven't heard from you in a while so we're going to close out this ticket in our system. If you still require assistance, simply respond to this thread and we'll be happy to pick up where we left off!

    Kind regards,

    Genevieve​

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk