Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDBimport not importing all the batches

0

18 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Vinod Kumar, is there any chance you ran out of space in the temp directory while this was running? 

    0
    Comment actions Permalink
  • Avatar
    Vinod Kumar

    Hi Genevieve Brandt (she/her),

    Actually I am working on server so overall space is not  a problem. When I ran the script I had around 2TB space remaining. But still I don't know if temp directory is consuming a lot of space? Do you think this much space is sufficient for 850 sample (genome size 250MB)?

    I also ran the same script 2 times and everytime, it is just importing only 14 out of 17 batches, mean stopped at same point in both runs.

    Can I solve this issue by including less number of parallel intervals (--max-num-intervals-to-import-in-parallel 2) instead of 8 in my script?

    Is it necessary to mention this option (temp directory) in the script?

    Thanks,

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Vinod,

    Thanks for that information. See if decreasing the --max-num-intervals-to-import-in-parallel works. You may have some sort of maximum number of files open at once on your server that could be interfering with GenomicsDBImport.

    Let me know if that works and if not we can look into other options.

    Yes, the temp directory should be specified as an option in the GATK command line for optimization.

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Vinod Kumar

    Hi Genevieve Brandt (she/her),

    I repeated my analysis with --max-num-intervals-to-import-in-parallel 2 instead of 8 and the analysis stopped at same point, it only imported only 14 out of 17 batches like my previous analyses with 8 parallel intervals. This is now really frustrating, as it is taking a lot of time and finally I am not getting the final data store.

    I was also checking the temp space using this:

    df -h /tmp
    Filesystem     Size     Used     Avail     Use%     Mounted on
    tmpfs             126G    3.6M     126G     1%         /tmp

    I don't know what to do. I tried many things and could not find a way to solve the issue.

    How can I see the list of the samples which are actually imported in the GenomicsDB so I can look at where it is actually stopping every time. I looked into callset.json which contains all the samples provided in the sample_map file.

    I also posted the err log file in my first post, could you please see what are those lines at the end of the files. It looks like there are some errors but I am not understanding them correctly.

    Do you need something from me to see what's going on here?

    Thanks,

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Vinod Kumar

    I am sorry you are frustrated, thank you for providing me with all of this information. I am looking into this on my end as quickly as possible so that we can find a solution. 

    I have been using the error log from your first post to try to figure out the problem. What is going on here is that the process BlockGunzipper is not able to fully extract the contents of one of your compressed files, which is why I initially suspected an issue with the temp directory space.

    htsjdk.samtools.SAMFormatException: Did not inflate expected amount
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:147)

    I am going to reach out to my colleagues to determine if there is a way for us to find which specific file is causing this issue. There could also be an issue where one of your files is malformed. I will get more information and get back to you. Thank you for your patience.

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Vinod Kumar - I heard back from my team. We think this looks like a GKL error. Could you try running your command with --use-jdk-inflater. If that doesn't work, try also --use-jdk-deflater.  

    Let me know if this is successful.

    Another note to keep in mind with GenomicsDBImport, which I haven't mentioned yet, is to make sure that you set the Java Xmx/Xms values to no more than 80% or 90% of the available physical memory to leave room for the C/C++ libraries. I don't think this is a problem in your case but I wanted to mention it in case you had not considered that.

    0
    Comment actions Permalink
  • Avatar
    Vinod Kumar

    Hi Genevieve Brandt (she/her),

    Thank you very much for the responses. It looks our server is overloaded now, once I'll have the results, I'll come back to you.

    Thanks,

     

    0
    Comment actions Permalink
  • Avatar
    Vinod Kumar

    Hi Genevieve Brandt (she/her),

    I can't import all the chromosomes in one go so, I divided my entire analysis chromosome-wise and ran separate scripts for different chromosomes.   

    Now, I am almost finished the analysis by using -L $one chromosome at a time. But it was a hit and trial for me, some chromosomes worked without --use-jdk-inflater or deflater, some with --use-jdk-inflater and some with --use-jdk-deflater. But mostly analysis was stopping after 14 or 15 batches out of 17 batches (50 samples per batch). I've also compared the size of the chromosomes and I couldn't correlated them with failed and passed (imported all the batches) analyses.

    This is really too much work and I was in suspicion if this time it is going to work or not.

    From this separate analysis that there are two kinds of errors are produced from failed scripts:

    First error: I couldn't solve this error even after trying all three options. It is coming just for one chromosome.

    Using GATK jar /vol/biotools/share/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx50g -jar /vol/biotools/share/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar GenomicsDBImport --genomicsdb-workspace-path /prj/pflaphy-robot/genomicsDB/allsamples_chr1 --batch-size 50 -L /prj/pflaphy-robot/Felix_chr1.interval_list --use-jdk-inflater --sample-name-map /prj/pflaphy-robot/genoDB_allplates.sample_map --tmp-dir /prj/pflaphy-robot/genomicsDB/allsamples_temp --genomicsdb-shared-posixfs-optimizations true
    08:56:13.576 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/vol/biotools/share/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Feb 08, 2021 8:56:13 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    08:56:13.868 INFO GenomicsDBImport - ------------------------------------------------------------
    08:56:13.869 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.9.0
    08:56:13.869 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    08:56:13.869 INFO GenomicsDBImport - Executing as vkumar@suc01001 on Linux v5.4.0-47-generic amd64
    08:56:13.869 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v11.0.8+10-post-Ubuntu-0ubuntu120.04
    08:56:13.869 INFO GenomicsDBImport - Start Date/Time: February 8, 2021 at 8:56:13 AM CET
    08:56:13.870 INFO GenomicsDBImport - ------------------------------------------------------------
    08:56:13.870 INFO GenomicsDBImport - ------------------------------------------------------------
    08:56:13.871 INFO GenomicsDBImport - HTSJDK Version: 2.23.0
    08:56:13.871 INFO GenomicsDBImport - Picard Version: 2.23.3
    08:56:13.871 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    08:56:13.871 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    08:56:13.871 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    08:56:13.871 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    08:56:13.872 INFO GenomicsDBImport - Deflater: IntelDeflater
    08:56:13.872 INFO GenomicsDBImport - Inflater: JdkInflater
    08:56:13.873 INFO GenomicsDBImport - GCS max retries/reopens: 20
    08:56:13.873 INFO GenomicsDBImport - Requester pays: disabled
    08:56:13.873 INFO GenomicsDBImport - Initializing engine
    08:56:14.356 INFO FeatureManager - Using codec IntervalListCodec to read file file:///prj/pflaphy-robot/Felix_chr1.interval_list
    08:56:14.395 INFO IntervalArgumentCollection - Processing 27806075 bp from intervals
    08:56:14.398 INFO GenomicsDBImport - Done initializing engine
    08:56:15.319 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
    08:56:15.334 INFO GenomicsDBImport - Vid Map JSON file will be written to /prj/pflaphy-robot/genomicsDB/allsamples_chr1/vidmap.json
    08:56:15.334 INFO GenomicsDBImport - Callset Map JSON file will be written to /prj/pflaphy-robot/genomicsDB/allsamples_chr1/callset.json
    08:56:15.335 INFO GenomicsDBImport - Complete VCF Header will be written to /prj/pflaphy-robot/genomicsDB/allsamples_chr1/vcfheader.vcf
    08:56:15.335 INFO GenomicsDBImport - Importing to workspace - /prj/pflaphy-robot/genomicsDB/allsamples_chr1
    08:56:15.335 INFO ProgressMeter - Starting traversal
    08:56:15.336 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
    08:56:18.769 INFO GenomicsDBImport - Importing batch 1 with 50 samples
    12:27:21.492 INFO ProgressMeter - chr1:1 211.1 1 0.0
    12:27:21.493 INFO GenomicsDBImport - Done importing batch 1/17
    12:27:29.438 INFO GenomicsDBImport - Importing batch 2 with 50 samples
    15:59:54.324 INFO ProgressMeter - chr1:1 423.6 2 0.0
    15:59:54.325 INFO GenomicsDBImport - Done importing batch 2/17
    16:00:16.191 INFO GenomicsDBImport - Importing batch 3 with 50 samples
    03:07:15.928 INFO ProgressMeter - chr1:1 1091.0 3 0.0
    03:07:15.930 INFO GenomicsDBImport - Done importing batch 3/17
    03:07:26.651 INFO GenomicsDBImport - Importing batch 4 with 50 samples
    06:38:30.390 INFO ProgressMeter - chr1:1 1302.3 4 0.0
    06:38:30.391 INFO GenomicsDBImport - Done importing batch 4/17
    06:38:41.336 INFO GenomicsDBImport - Importing batch 5 with 50 samples
    10:16:39.429 INFO ProgressMeter - chr1:1 1520.4 5 0.0
    10:16:39.430 INFO GenomicsDBImport - Done importing batch 5/17
    10:16:47.547 INFO GenomicsDBImport - Importing batch 6 with 50 samples
    13:49:39.219 INFO ProgressMeter - chr1:1 1733.4 6 0.0
    13:49:39.220 INFO GenomicsDBImport - Done importing batch 6/17
    13:49:49.366 INFO GenomicsDBImport - Importing batch 7 with 50 samples
    17:29:35.040 INFO ProgressMeter - chr1:1 1953.3 7 0.0
    17:29:35.041 INFO GenomicsDBImport - Done importing batch 7/17
    17:29:45.156 INFO GenomicsDBImport - Importing batch 8 with 50 samples
    20:59:14.608 INFO ProgressMeter - chr1:1 2163.0 8 0.0
    20:59:14.609 INFO GenomicsDBImport - Done importing batch 8/17
    20:59:24.900 INFO GenomicsDBImport - Importing batch 9 with 50 samples
    00:31:45.349 INFO ProgressMeter - chr1:1 2375.5 9 0.0
    00:31:45.351 INFO GenomicsDBImport - Done importing batch 9/17
    00:31:47.132 INFO GenomicsDBImport - Importing batch 10 with 50 samples
    04:04:47.652 INFO ProgressMeter - chr1:1 2588.5 10 0.0
    04:04:47.653 INFO GenomicsDBImport - Done importing batch 10/17
    04:04:57.703 INFO GenomicsDBImport - Importing batch 11 with 50 samples
    07:32:53.234 INFO ProgressMeter - chr1:1 2796.6 11 0.0
    07:32:53.235 INFO GenomicsDBImport - Done importing batch 11/17
    07:33:03.058 INFO GenomicsDBImport - Importing batch 12 with 50 samples
    11:12:01.061 INFO ProgressMeter - chr1:1 3015.8 12 0.0
    11:12:01.062 INFO GenomicsDBImport - Done importing batch 12/17
    11:12:12.095 INFO GenomicsDBImport - Importing batch 13 with 50 samples
    14:40:58.513 INFO ProgressMeter - chr1:1 3224.7 13 0.0
    14:40:58.514 INFO GenomicsDBImport - Done importing batch 13/17
    14:41:09.444 INFO GenomicsDBImport - Importing batch 14 with 50 samples
    18:26:09.530 INFO ProgressMeter - chr1:1 3449.9 14 0.0
    18:26:09.531 INFO GenomicsDBImport - Done importing batch 14/17
    18:26:20.343 INFO GenomicsDBImport - Importing batch 15 with 50 samples
    22:23:19.723 INFO ProgressMeter - chr1:1 3687.1 15 0.0
    22:23:19.724 INFO GenomicsDBImport - Done importing batch 15/17
    22:23:29.623 INFO GenomicsDBImport - Importing batch 16 with 50 samples
    01:22:19.590 INFO GenomicsDBImport - Shutting down engine
    [February 11, 2021 at 1:22:19 AM CET] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 3,866.10 minutes.
    Runtime.totalMemory()=4873781248
    htsjdk.samtools.util.RuntimeIOException: java.util.zip.DataFormatException: invalid stored block lengths
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:161)
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:96)
    at htsjdk.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:550)
    at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:532)
    at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
    at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
    at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
    at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:241)
    at htsjdk.tribble.readers.TabixReader.readLine(TabixReader.java:215)
    at htsjdk.tribble.readers.TabixReader.access$300(TabixReader.java:48)
    at htsjdk.tribble.readers.TabixReader$IteratorImpl.next(TabixReader.java:434)
    at htsjdk.tribble.readers.TabixIteratorLineReader.readLine(TabixIteratorLineReader.java:46)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.readNextRecord(TabixFeatureReader.java:170)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.next(TabixFeatureReader.java:205)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.next(TabixFeatureReader.java:149)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport$1$NoMnpIterator.next(GenomicsDBImport.java:851)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport$1$NoMnpIterator.next(GenomicsDBImport.java:842)
    at org.genomicsdb.importer.GenomicsDBImporterStreamWrapper.next(GenomicsDBImporterStreamWrapper.java:110)
    at org.genomicsdb.importer.GenomicsDBImporter.doSingleImport(GenomicsDBImporter.java:580)
    at org.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:703)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
    Caused by: java.util.zip.DataFormatException: invalid stored block lengths
    at java.base/java.util.zip.Inflater.inflateBytesBytes(Native Method)
    at java.base/java.util.zip.Inflater.inflate(Inflater.java:385)
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:145)
    ... 23 more

    Second Error: I could solve this error by using inflater and deflater options suggested by you.

    ...Just end of the error log 

    01:18:32.301 INFO GenomicsDBImport - Done importing batch 13/17
    01:18:37.772 INFO GenomicsDBImport - Importing batch 14 with 50 samples
    01:24:07.663 INFO GenomicsDBImport - Shutting down engine
    [February 11, 2021 at 1:24:07 AM CET] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 752.13 minutes.
    Runtime.totalMemory()=3271557120
    htsjdk.samtools.SAMFormatException: Did not inflate expected amount
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:147)
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:96)
    at htsjdk.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:550)
    at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:532)
    at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
    at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
    at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
    at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:241)
    at htsjdk.tribble.readers.TabixReader.readLine(TabixReader.java:215)
    at htsjdk.tribble.readers.TabixReader.access$300(TabixReader.java:48)
    at htsjdk.tribble.readers.TabixReader$IteratorImpl.next(TabixReader.java:434)
    at htsjdk.tribble.readers.TabixIteratorLineReader.readLine(TabixIteratorLineReader.java:46)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.readNextRecord(TabixFeatureReader.java:170)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.next(TabixFeatureReader.java:205)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.next(TabixFeatureReader.java:149)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport$1$NoMnpIterator.next(GenomicsDBImport.java:851)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport$1$NoMnpIterator.next(GenomicsDBImport.java:842)
    at org.genomicsdb.importer.GenomicsDBImporterStreamWrapper.next(GenomicsDBImporterStreamWrapper.java:110)
    at org.genomicsdb.importer.GenomicsDBImporter.doSingleImport(GenomicsDBImporter.java:580)
    at org.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:703)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

       

     

    Do you have a clue if we can avoid this problem in future?

    Thanks,

         

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Vinod Kumar for your updates. I am so sorry this issue has been causing so much frustration. I will get you more information as soon as possible about 1) why this issue came up so that you can avoid it in the future and 2) how to solve the first error.

    Let me know if you have any other questions I can address.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Vinod Kumar,

    For your first error (the one that is not solved), there is a possibility that the chromosome that failed has a problem with its VCFs. Could you run ValidateVariants just to determine adherence to the VCF format with gatk ValidateVariants -V cohort.vcf.gz? Let us know what you find from this and we will have more information about the issue.

    For the second error (solved with --use-jdk-inflater or --use-jdk-deflater), this could be a bug in GKL. We don't maintain GKL but we can reach out to the people who do. Would you be able to submit a bug report with all the files necessary for one of your chromosomes that has this problem? Please let me know the file folder name once you have uploaded.

    Thank you for your patience as we look into this.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Jacob Wang

    Hi Genevieve Brandt (she/her)

        I encountered the same error when using HaplotypeCaller. As Pamela Bretscher suggested I tried using those two options but it seemed that it didn't work in my case. Do you have any other solutions for this bug?

       Thank you very much in advance!

       Best regards!

    WANG

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Jacob Wang,

    I am working with Pamela on your other post, so I don't have any alternative suggestions at this time. Please let us know if you have other questions you want us to take a look at as well.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Jacob Wang

    Hi  Genevieve Brandt (she/her)

        Thank you all the same. I hope it could be solved in the future version.

    0
    Comment actions Permalink
  • Avatar
    Jacob Wang

    Hi, Genevieve Brandt (she/her)

    I think I and Pamela Bretscher have solved the problem. I re-run both the steps of BQSR and HaplotypeCaller with the --use-jdk-inflater or --use-jdk-deflater options, and got gvcf files in correct size with all the chromosomes called. (Only re-run HaplotypeCaller with the options did not work.)

    It seems that with default setting of Intel inflater/deflater, BQSR could generate an complete bam file, but some blocks may actually have compression errors. Therefore when HaplotypeCaller tries to read those blocks, the program terminated.

    I think some other people have also encountered the same problem without an solution. Shall we report this bug somewhere?

    Best regards!

    WANG

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    I'll also post the issue ticket here for reference Jacob Wang https://github.com/broadinstitute/gatk/issues/7582

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you for the follow up and reporting this bug Jacob Wang!

    0
    Comment actions Permalink
  • Avatar
    Jacob Wang

    Hi, Genevieve Brandt (she/her)

    It's my pleasure to work with the GATK team and be able to contribute. I just ran the following steps (from merging gvcfs to annotation) using those gvcfs in the past weekend. No problem happened. I think for now I can say the problem has been fixed. I will post my solution in the git page for others to reference.

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great news! I'm glad you were able to find the solution!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk