GetPileupSummary and Mutect2 have an issue with symlinked large files (>150GB)
REQUIRED for all errors and issues:
a) GATK version used:
- 4.5.0.0 -> Docker version
b) Exact command used:
- gatk GetPileupSummaries -I /PATH/TO/SYMLINK/SAMPLE_NAME.bam -V /PATH/external_data/small_exac_common_3.vcf -L /PATH/external_data/small_exac_common_3.vcf -O /PATH/TO/Pileup/SAMPLE_NAME.pileup.table
c) Entire program log:
09:01:24.271 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:01:24.417 INFO GetPileupSummaries - ------------------------------------------------------------
09:01:24.421 INFO GetPileupSummaries - The Genome Analysis Toolkit (GATK) v4.5.0.0
09:01:24.421 INFO GetPileupSummaries - For support and documentation go to https://software.broadinstitute.org/gatk/
09:01:24.421 INFO GetPileupSummaries - Executing as cluster@cluster on Linux v4.18.0-372.9.1.el8.x86_64 amd64
09:01:24.421 INFO GetPileupSummaries - Java runtime: OpenJDK 64-Bit Server VM v17.0.9+9-Ubuntu-122.04
09:01:24.422 INFO GetPileupSummaries - Start Date/Time: April 2, 2024 at 9:01:24 AM GMT
09:01:24.422 INFO GetPileupSummaries - ------------------------------------------------------------
09:01:24.422 INFO GetPileupSummaries - ------------------------------------------------------------
09:01:24.422 INFO GetPileupSummaries - HTSJDK Version: 4.1.0
09:01:24.423 INFO GetPileupSummaries - Picard Version: 3.1.1
09:01:24.423 INFO GetPileupSummaries - Built for Spark Version: 3.5.0
09:01:24.423 INFO GetPileupSummaries - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:01:24.423 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:01:24.423 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:01:24.424 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:01:24.424 INFO GetPileupSummaries - Deflater: IntelDeflater
09:01:24.424 INFO GetPileupSummaries - Inflater: IntelInflater
09:01:24.424 INFO GetPileupSummaries - GCS max retries/reopens: 20
09:01:24.424 INFO GetPileupSummaries - Requester pays: disabled
09:01:24.425 INFO GetPileupSummaries - Initializing engine
09:01:24.599 INFO FeatureManager - Using codec VCFCodec to read file file:///PATH/external_data/small_exac_common_3.vcf
09:01:24.681 INFO FeatureManager - Using codec VCFCodec to read file file:///PATH/external_data/small_exac_common_3.vcf
09:01:25.252 INFO IntervalArgumentCollection - Processing 60040 bp from intervals
09:01:25.291 INFO GetPileupSummaries - Done initializing engine
09:01:25.321 INFO ProgressMeter - Starting traversal
09:01:25.322 INFO ProgressMeter - Current Locus Elapsed Minutes Loci Processed Loci/Minute
09:01:25.375 INFO GetPileupSummaries - Shutting down engine
[April 2, 2024 at 9:01:25 AM GMT] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=161480704
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at java.base/sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:1029)
at htsjdk.samtools.MemoryMappedFileBuffer.<init>(MemoryMappedFileBuffer.java:23)
at htsjdk.samtools.AbstractBAMFileIndex.<init>(AbstractBAMFileIndex.java:64)
at htsjdk.samtools.CachingBAMFileIndex.<init>(CachingBAMFileIndex.java:56)
at htsjdk.samtools.BAMFileReader.getIndex(BAMFileReader.java:420)
at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:947)
at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:628)
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:550)
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:417)
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:130)
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.<init>(SamReaderQueryingIterator.java:69)
at org.broadinstitute.hellbender.engine.ReadsPathDataSource.prepareIteratorsForTraversal(ReadsPathDataSource.java:413)
at org.broadinstitute.hellbender.engine.ReadsPathDataSource.iterator(ReadsPathDataSource.java:336)
at java.base/java.lang.Iterable.spliterator(Iterable.java:101)
at org.broadinstitute.hellbender.utils.Utils.stream(Utils.java:1182)
at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:384)
at org.broadinstitute.hellbender.engine.LocusWalker.getAlignmentContextIterator(LocusWalker.java:174)
at org.broadinstitute.hellbender.engine.LocusWalker.traverse(LocusWalker.java:149)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1098)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
at org.broadinstitute.hellbender.Main.main(Main.java:306)
I tried this with:
- a) symlinked large file >150GB (fails)
- a.2)Different function works (getSampleName)
- b) absolute_path large file (works)
- c) smaller symlinked file < 100GB (works)
I checked for different inputs, adding the index with --read-index.
From my point of view it looks like a size issue, since the samples are sequenced and processed identically, and size is the only issue.
Phyically moving these files is not vaiable (couple TB) for the full cohort.
New folder structure required for workflow managment system.
GATK agnostic Workaround, pass always the real path instead of the symlink - might be cumbersom to integrate at every step necessary and a fix would be appreciated ;)
Best,
Daniel
-
Hi Daniel
Are you using docker to run your GATK workflows? If yes mounting the folder containing input files should be able to solve the problem. Why do you need to use symlinks? File systems do different stuff about symlinks and sometimes they don't work as expected with many different tools.
Can you elaborate more on the need for the symlinks?
-
Hi Gökalp,
Yes and no - I use the docker image provided at dockerhub and run it with apptainer.
Mounting the necessary directories does not seem to be the issue (it works on the smaller files but not the larger ones), same symlinking strategy and same directories.
I need the files as inputs for a workflow I have written in snakemake, which had a different file structure.
I will check with my Admins if their is something "wired" happening to the symlinks after a certain file size.
As mentioned before, I can just grab the real path (os.path.realpath) in snakemake to circumvent this issue, but I was still wondering why this is. -
After some digging and help, I figured out that changing the path to something shorter actually allowed me to run the tool!
Could you comment on a maximum path lengtg, as in number of characters? -
Hi again. Our code does not have a limit per se however usually file systems and OS kernel's have limits for how long symlinks could be. That could be the reason why your code runs with shorter symlinks but not with longer ones.
The original error message is from the java itself not GATK. Looks like the nio is trying to readthrough a file but unable to get the correct data or gets a bunch of junk bytes through the link therefore it throws this error.
In short this does not look like a GATK issue but most likely a filesystem and/or Java issue.
I hope this helps.
-
Thank you for your help!
This makes sense and I now know how to avoid it ;)
Cheers!
Daniel
Please sign in to leave a comment.
5 comments