GatherVcfs - File number limit?
If you are seeing an error, please provide(REQUIRED) :
a) GATK version used:
Picard version: Version:4.2.2.0
b) Exact command used:
picard GatherVcfs --VERBOSITY DEBUG -I file_list.txt -O combined.vcf.gz |& tee gathervcfs.log
c) Entire error log:
There was a problem with gathering the INPUT.htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to create BasicFeatureReader using feature file , for input source: file:///files/vcf_4079.vcf.gz
The issue:
I have ran HaplotypeCaller for each linkage group separately and wish to merge them back together using GatherVcfs but keep receiving the above error.
It would appear there is nothing wrong with the file, and it doesn't seem to be related to memory availability, or the size of the files but very specifically the number of files (somewhere around 4080).
Say I have
1.vcf.gz
2.vcf.gz
3.vcf.gz
...
4080.vcf.gz
4081.vcf.gz
4082.vcf.gz
...
When I run GatherVcfs I'll receive the error for 4080.vcf.gz but if I was to remove a random file before this in the list, for example 3.vcf.gz (a large file relating to chromosome) from the list of files I want to merge I'd instead receive the error for 4081.vcf.gz and so on.
If I remove 3.vcf.gz and everything after 4080.vcf.gz it will run fine and produced the merged file.
Is this a known issue? I can see nothing in the documentation that states it will only work for a limited number of files, but it would have been really useful to know prior to getting to this stage in the analysis.
-
Hi rokkineste,
It does appear that there is some sort of limit on the number of files you are allowed to combine, but this is not a limit that is present in GatherVcfs. It's possible that this is a limit on file handles set on your machine. Could you provide the full stack trace/output error message? Alternatively, a workaround for this would be to combine the files in groups (maybe a few hundred at a time) and then combine those groups together.
Kind regards,
Pamela
-
Hi Pamela,
Thanks for the response, a limit related to file handles would make sense (although the error message is very unhelpful). Unfortunately, I'm running these on a cluster so I don't know in advance what the file handle limits are, they can differ between nodes.
Strangely though, even if I set the --MAX_RECORDS_IN_RAM argument to a higher value I get the same error at exactly the same file, which doesn't seem to make much sense. I've pasted the complete error message below the "WARN IntelInflater - Zero Bytes Written : 0" repeats about 8000 times but I'm presuming it isn't important as it doesn't appear at all if I invoke GatherVcfs directly using Picard rather than through GATK.
Merging into smaller groups a couple of thousand at a time and then doing a final merge does work, but if there was a more elegant solution it would be nice to find it.
Using GATK jar /user/bin/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx4g -jar /user/bin/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar GatherVcfs --VERBOSITY DEBUG --MAX_RECORDS_IN_RAM 10000000 -I file_list.txt -O combined.vcf.gz
09:52:21.728 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/user/bin/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Nov 23 09:52:21 GMT 2021] GatherVcfs --INPUT /user/file_list.txt --OUTPUT /usercombined.vcf.gz --VERBOSITY DEBUG --MAX_RECORDS_IN_RAM 10000000 --REORDER_INPUT_BY_FIRST_VARIANT false --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --CREATE_INDEX true --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Nov 23, 2021 9:52:21 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Tue Nov 23 09:52:21 GMT 2021] Executing as user@hpc on Linux 3.10.0-1160.42.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_302-b08; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.2.2.0
INFO 2021-11-23 09:52:21 GatherVcfs Checking inputs.
INFO 2021-11-23 09:53:23 GatherVcfs Checking file headers and first records to ensure compatibility.
09:53:31.516 WARN IntelInflater - Zero Bytes Written : 0
09:53:31.761 WARN IntelInflater - Zero Bytes Written : 0
...
ERROR 2021-11-23 10:13:36 GatherVcfs There was a problem with gathering the INPUT.htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to create BasicFeatureReader using feature file , for input source: file:///files/vcf_4079.vcf.gz
[Tue Nov 23 10:13:36 GMT 2021] picard.vcf.GatherVcfs done. Elapsed time: 21.25 minutes.
Runtime.totalMemory()=1970274304
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Tool returned:
1 -
Hi rokkineste,
Thank you for providing this output. It does seem that this may be due to some sort of limit on your machine. Unfortunately, I can't really advise on how to increase file handles on your machine as this may not be something you can control and isn't an issue with the GATK tool itself. I would say that the best workaround would still be to do the merging in groups. If this is too tedious or you would like to look further into a better solution, the GATK team can try to look further into this error message to see if it can be resolved on your machine.
Kind regards,
Pamela
Please sign in to leave a comment.
3 comments