Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Java large integer formatting error in GnarlyGenotyper

0

7 comments

  • Avatar
    Gökalp Çelik

    Hi Ryan Collins

    The value that tool throws an error is certainly outside of Java Integer bounds (-2147483648 to 2147483647) therefore it cannot parse that value to an integer correctly. Unfortunately Java does not have unsigned integer therefore this value somehow ended up in your GenomicsDB from a particular variant context. It may be beneficial for you to extract chr21 (may be a subset of chr21 looking at the logs somewhere around base 8220033 could work) of the GenomicsDB as GVCF (if you can which for that many samples it may be very problematic but this value seems to be in the INFO field therefore you may extract it as sites only data.) using SelectVariants tool and see where this value actually comes from. Once you get to pinpoint the location you may be able to skip that site and continue genotyping the rest of chr21 normally and deal with this particular site specifically.

    I hope this helps.

    Regards. 

    1
    Comment actions Permalink
  • Avatar
    Ryan Collins

    Hi Gökalp Çelik,

    Thanks so much for your reply and detailed suggestions!

    I really like your proposed strategy and want to implement it, but I am having some trouble getting SelectVariants to run on a local copy of the GenomicsDB.

    Any chance you can help point out what I'm doing wrong? This is being run using GATK-4.3.0.0 on the NIH AllOfUs Researcher Workbench. I don't have root permissions to install a different version of GATK in this environment and cannot move data outside of the environment, so am hoping the difference in version (4.3 here vs. 4.6 in my original issue) isn't a huge problem.

    Here's my invocation of SelectVariants:

    gatk SelectVariants \
      --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx3000m' \
      -R Homo_sapiens_assembly38.fasta \
      -V gendb://genomicsdb \
      --intervals "chr21:8220030-8220688" \
      -O problem_region.vcf.gz

    And here's the stack trace:

    Using GATK jar /etc/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx3000m -jar /etc/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar SelectVariants -R Homo_sapiens_assembly38.fasta -V gendb://genomicsdb --intervals chr21:8220030-8220688 -O problem_region.vcf.gz
    20:42:33.548 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    20:42:33.707 INFO  SelectVariants - ------------------------------------------------------------
    20:42:33.708 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.3.0.0
    20:42:33.708 INFO  SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
    20:42:33.708 INFO  SelectVariants - Executing as jupyter@47c18480bb61 on Linux v6.1.112+ amd64
    20:42:33.709 INFO  SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.20.1+1-post-Ubuntu-0ubuntu120.04
    20:42:33.709 INFO  SelectVariants - Start Date/Time: March 24, 2025 at 8:42:33 PM UTC
    20:42:33.709 INFO  SelectVariants - ------------------------------------------------------------
    20:42:33.709 INFO  SelectVariants - ------------------------------------------------------------
    20:42:33.710 INFO  SelectVariants - HTSJDK Version: 3.0.1
    20:42:33.711 INFO  SelectVariants - Picard Version: 2.27.5
    20:42:33.711 INFO  SelectVariants - Built for Spark Version: 2.4.5
    20:42:33.711 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    20:42:33.711 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    20:42:33.711 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    20:42:33.711 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    20:42:33.711 INFO  SelectVariants - Deflater: IntelDeflater
    20:42:33.712 INFO  SelectVariants - Inflater: IntelInflater
    20:42:33.712 INFO  SelectVariants - GCS max retries/reopens: 20
    20:42:33.712 INFO  SelectVariants - Requester pays: disabled
    20:42:33.712 INFO  SelectVariants - Initializing engine
    20:42:34.439 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
    20:42:34.482 INFO  SelectVariants - Shutting down engine
    [March 24, 2025 at 8:42:34 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.02 minutes.
    Runtime.totalMemory()=274726912
    ***********************************************************************

    A USER ERROR has occurred: Couldn't create GenomicsDBFeatureReader

    ***********************************************************************
    org.broadinstitute.hellbender.exceptions.UserException: Couldn't create GenomicsDBFeatureReader
            at org.broadinstitute.hellbender.engine.FeatureDataSource.getGenomicsDBFeatureReader(FeatureDataSource.java:463)
            at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:365)
            at org.broadinstitute.hellbender.engine.FeatureDataSource.<init>(FeatureDataSource.java:319)
            at org.broadinstitute.hellbender.engine.FeatureDataSource.<init>(FeatureDataSource.java:291)
            at org.broadinstitute.hellbender.engine.VariantWalker.initializeDrivingVariants(VariantWalker.java:58)
            at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:67)
            at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:726)
            at org.broadinstitute.hellbender.engine.VariantWalker.onStartup(VariantWalker.java:45)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
            at org.broadinstitute.hellbender.Main.main(Main.java:289)
    Caused by: java.io.IOException: GenomicsDB JNI Error: FileBasedVidMapperException : contig_info_dict.HasMember("tiledb_column_offset") && contig_info_dict["tiledb_column_offset"].IsInt64()
            at org.genomicsdb.reader.GenomicsDBQueryStream.jniGenomicsDBInit(Native Method)
            at org.genomicsdb.reader.GenomicsDBQueryStream.<init>(GenomicsDBQueryStream.java:209)
            at org.genomicsdb.reader.GenomicsDBQueryStream.<init>(GenomicsDBQueryStream.java:182)
            at org.genomicsdb.reader.GenomicsDBQueryStream.<init>(GenomicsDBQueryStream.java:91)
            at org.genomicsdb.reader.GenomicsDBFeatureReader.generateHeadersForQuery(GenomicsDBFeatureReader.java:200)
            at org.genomicsdb.reader.GenomicsDBFeatureReader.<init>(GenomicsDBFeatureReader.java:85)
            at org.broadinstitute.hellbender.engine.FeatureDataSource.getGenomicsDBFeatureReader(FeatureDataSource.java:460)
            ... 13 more

     

    Is insufficient memory the issue? I can re-try this on a beefier VM if you think that might be helpful. Or does this tool need write permissions to the root directory (which I also don't have in this environment)?

    Any input would be greatly appreciated!

    Thanks again,

    Ryan

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    Native library versions are different but I may need to relay this to the team to see if there are changes that may break this export. In the meantime if you are able to get your hands onto 4.6.1.0 somehow that would be great. I am not familiar with AoU workbench and I will try to get someones opinion with better hands on with that space. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Hi Ryan Collins!

    The secondary issue you're seeing with the JNI issue might be because of the GenomicsDB version mismatch. Can you try running the SelectVariants as a workflow with the 4.6.1.0 version?

    55K samples isn't *that* many... Are these 30X samples? Where did you get those GVCFs?  Did you run reblocking as in the biggest practices? That's a lot of alleles. We definitely recommend running ReblockGvcf on each GVCF to get rid of low confidence indel alleles and pare down the total alternate alleles, which should speed things up.  Our version of the ReblockGvcf workflow is here: https://github.com/broadinstitute/warp/blob/develop/pipelines/broad/dna_seq/germline/joint_genotyping/reblocking/ReblockGVCF.wdl. Although this may not fix the MAX_INT error.  Mapping quality has been the historic offender there.  You could trying specifying `-XA RMSMappingQuality` in the GnarlyCommand, although that would mean you don't have that annotation available for filtering.  I know there was a fix for MQ MAX_INT, but I thought it was a version way previous to 4.6. I'll do a little digging.

     

    Best,

    Laura

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    OMG it's the QUAL!  Your stack trace leads me to https://github.com/broadinstitute/gatk/blob/ebd8a275bea63b34e55f29e89a0a34aea78cce43/src/main/java/org/broadinstitute/hellbender/tools/walkers/gnarlyGenotyper/GnarlyGenotyperEngine.java#L86. This must be a super high depth site for some reason.  For SNPs, the QUAL is roughly 30 x #alt reads, so if you had 30X samples that would be 3 million samples, which is at lot more than 55K obviously.  The QUAL can be elevated for indels because that 30 (which comes from the base quality for a good SNP) can be higher, but more than 10X higher would surprise me. So I suspect it's depth. I can't think of a good way to get around this other than excluding the site like Gokalp suggested. And it seems like it's tough to figure out exactly where, but that region of the genome looks pretty low complexity in IGV. I'm having trouble finding our tried-and-true low complexity BED, but you might have your own.  Ours comes from Heng Li according to the following:

    #The file was created using the symmetric DUST algorithm:
    #~hengli/minimap/sdust -t30 hs38DH.fa > sdust30.pre.bed
    #awk '{print $1"\t"($2-1)"\t"$3}' sdust30.pre.bed > sdust30.bed
    #The second command adds 1bp to the 5'-end as in VCF, an insertion is usually placed 1bp ahead of LCR.

    I would just exclude the union of LCR regions and sites in the log, maybe plus 500bp or so.  Hand wavy, but it will get you unstuck.

    Good luck,

    Laura

    0
    Comment actions Permalink
  • Avatar
    Ryan Collins

    Amazing, you are my hero Laura Gauthier!!! I nearly Slacked you about this but didn't want to bother you unnecessarily. That said, I am glad this route linked me up to the original source herself after all!

    Re: your original question about the input data, these were genomes we processed as part of a cancer-focused case-control WGS project drawing from All of Us (~37k), Hartwig Medical Foundation (~3.3k), various NCI-related WGS projects (~2.8k), some Dana-Farber patient populations (~2.1k), and controls from TOPMed/other misc dbGaP (~9.4k). We realigned mostly everything using the same pipeline (based roughly on WARP/functional equivalence, with some tweaks), generated gVCFs using the same GATK version, and reblocked everything with the following arguments:

    -drop-low-quals \
    -rgq-threshold 10 \
    -do-qual-approx \
    --floor-blocks -GQB 20 -GQB 30 -GQB 40

    Re: the culprit — interesting! I was wondering if it could be some metric like QUAL that might be getting aggregated across a bunch of samples leading to an unreasonably large integer.

    Good to know that LCR filtering is a recommended option here; for this specific issue I was able to just excise a 100bp interval around the problematic record as Gokalp suggested. I didn't encounter this issue on chr19, chr21, or chrY, but if I hit it again on other chromosomes I might just default to the LCR filter strategy genome-wide.

    Thanks again for your help & input! I really appreciate it!

    Hope all's well with you,

    Ryan

     

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    So happy to hear you were successful!  We've solved a surprising number of issues over the years by throwing away poorly behaved LCR regions. Always a good tool to have in one's back pocket.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk