Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GatherVcfs - File number limit?

0

3 comments

  • Avatar
    Pamela Bretscher

    Hi rokkineste,

    It does appear that there is some sort of limit on the number of files you are allowed to combine, but this is not a limit that is present in GatherVcfs. It's possible that this is a limit on file handles set on your machine. Could you provide the full stack trace/output error message? Alternatively, a workaround for this would be to combine the files in groups (maybe a few hundred at a time) and then combine those groups together.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    rokkineste

    Hi Pamela,

    Thanks for the response, a limit related to file handles would make sense (although the error message is very unhelpful). Unfortunately, I'm running these on a cluster so I don't know in advance what the file handle limits are, they can differ between nodes.

    Strangely though, even if I set the --MAX_RECORDS_IN_RAM argument to a higher value I get the same error at exactly the same file, which doesn't seem to make much sense.  I've pasted the complete error message below the "WARN IntelInflater - Zero Bytes Written : 0" repeats about 8000 times but I'm presuming it isn't important as it doesn't appear at all if I invoke GatherVcfs directly using Picard rather than through GATK.

    Merging into smaller groups a couple of thousand at a time and then doing a final merge does work, but if there was a more elegant solution it would be nice to find it.

     


    Using GATK jar /user/bin/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx4g -jar /user/bin/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar GatherVcfs --VERBOSITY DEBUG --MAX_RECORDS_IN_RAM 10000000 -I file_list.txt -O combined.vcf.gz
    09:52:21.728 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/user/bin/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Tue Nov 23 09:52:21 GMT 2021] GatherVcfs --INPUT /user/file_list.txt --OUTPUT /usercombined.vcf.gz --VERBOSITY DEBUG --MAX_RECORDS_IN_RAM 10000000 --REORDER_INPUT_BY_FIRST_VARIANT false --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --CREATE_INDEX true --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    Nov 23, 2021 9:52:21 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    [Tue Nov 23 09:52:21 GMT 2021] Executing as user@hpc on Linux 3.10.0-1160.42.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_302-b08; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.2.2.0
    INFO 2021-11-23 09:52:21 GatherVcfs Checking inputs.
    INFO 2021-11-23 09:53:23 GatherVcfs Checking file headers and first records to ensure compatibility.
    09:53:31.516 WARN IntelInflater - Zero Bytes Written : 0
    09:53:31.761 WARN IntelInflater - Zero Bytes Written : 0
    ...
    ERROR 2021-11-23 10:13:36 GatherVcfs There was a problem with gathering the INPUT.htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to create BasicFeatureReader using feature file , for input source: file:///files/vcf_4079.vcf.gz
    [Tue Nov 23 10:13:36 GMT 2021] picard.vcf.GatherVcfs done. Elapsed time: 21.25 minutes.
    Runtime.totalMemory()=1970274304
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Tool returned:
    1

     

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi rokkineste,

    Thank you for providing this output. It does seem that this may be due to some sort of limit on your machine. Unfortunately, I can't really advise on how to increase file handles on your machine as this may not be something you can control and isn't an issue with the GATK tool itself. I would say that the best workaround would still be to do the merging in groups. If this is too tedious or you would like to look further into a better solution, the GATK team can try to look further into this error message to see if it can be resolved on your machine.

    Kind regards,

    Pamela

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk