GATK tools like many other bioinformatics software rely on temporary files created during execution. However unlike other tools that may use the immediate output folder for temporary files such as samtools or bcftools, GATK and related tools rely on the Java VM's temporary folder setting directly (There are some exceptions but hang on).
This guide is intended to show you how your analyses might be affected and what kind of warning/error messages you may encounter if your temporary folder is not setup properly for a local GATK execution.
Temporary Folder setup for the current Java VM
To see the common system properties for your local installed java VM you can run the following command.
java -XshowSettings:properties --version 2&1 >/dev/null | grep tmpdir
Java will print the summary information of its installation details and some other useful information such as the default temporary folder to be used by any Java application.
java.io.tmpdir = /tmp
Bingo! You have located the default temporary folder for your java VM. But it isn't over. Depending on your local configuration this temporary folder may be at a different location however by default it is located at the root of your system drive.
1. Can't I just use the default temporary folder?
Short answer is yes and no.
Long answer it depends on the job and the local setup you have and here are some of the show stoppers of using the default temporary folder
-
Default temporary folder location is full or close to being full so only small tasks not requiring excess temporary files can be run without much hassle.
/tmp
folder is normally located in the root sector of the boot drive and you may be able check how much space is left on the boot drive by using thedf
command. The response should be like thisFilesystem 1K-blocks Used Available Use% Mounted on tmpfs 6573684 2464 6571220 1% /run /dev/nvme0n1p2 479082224 141299368 313373412 32% / tmpfs 32868412 109264 32759148 1% /dev/shm tmpfs 5120 4 5116 1% /run/lock /dev/nvme0n1p1 523244 6216 517028 2% /boot/efi
As it can be seen in the result the root partition designated as
/
has32%
used space. Depending on the system installation this can be even more therefore leaving you only a fraction of the whole drive. If you have to work with 80GBs of temporary data but your root partition has only 10GBs of space left then your analysis will fail with theNot enough space
error. -
Sysadmin or the OS distribution defaults limits what can be done at the default temporary folder. If the default temporary folder is only available for root access for execution or completely banned from executing files due to
noexec
mount parameter, GATK and related tools may not use this folder to extract and execute its dependent native libraries or scripts. To check this you can use the following command to check the mount status of your/tmp
folder.
$ mount | grep noexec
If you see the following line as a part of the response from the command, then your `/tmp` folder is mounted as `noexec` which means GATK and related tools cannot use that folder as temporary folder anymore.
/tmp on /tmp type none (rw,noexec,nosuid,bind)
Unless you are the administrator on the system you cannot make any changes to this option therefore your only choice will be to designate another folder for this purpose.
2. Examples of how to setup a proper/improper temporary folder for local execution
In order to make sure that you have permissions or ownership for a temporary folder, first check your UID
by typing echo $UID
. The response will be an integer value indicating the user id for the current user.
$ echo $UID 1000
Current users UID
is 1000
and whenever you check any folders for ownership or permissions this number should be handy to understand your relationship to that folder.
Before getting into the main issue here is a quick explanation of what folder permission marks are and how to interpret them.
r = Read permission w = Write permission x = Execute permission - = Denial of permission
drwxrw-r-- 2 root root ...
----------
| | | |
| | | |
| | | Other permissions (r--)
| | |
| | Group permissions (rw-)
| |
| Owner permissions (rwx)
|
Type (- file or d folder)
These permission marks are available to each item under the file system indicating your interactibility with these items. Permissions are granted under 3 separate levels. Owner permissions, group permissions and other permissions. Owner permissions are granted for the owner of the file or folder. Group permissions are granted to all users under the same group not necessarily to be the same with the owner permissions. Other permissions are granted to all users who are neither a part of the same group nor the owner of the file or folder. Next to those permissions you will see the name or the UID of the owner and the name or GID of the use group who has these permissions. To visualize these permissions you may type ls -l
in the command line and see the result. Here is the result of the command at the root /
folder (Redacted to show the temporary folder only).
total 2097240 ... drwxrwxrwt 26 root root 12288 Aug 27 13:22 tmp ...
As you can see /tmp
folder has been granted all read, write and execute permissions to everyone in this case. But if you see anything different than drwxrwxrwx
next to the /tmp
folder then it is no use forcing it to be used by GATK and related tools. For the purpose of this article I have prepared a bunch of temporary folders outside of /tmp
just to show you the interaction of GATK with those folders.
$ ls -l -rw-rw-r-- 1 user 47G Apr 9 00:24 bamfile.bam -rw-rw-r-- 1 user 8.8M Apr 9 00:24 bamfile.bam.bai drw------- 2 user 4.0K Aug 11 21:41 tmp1 --> read, write but no execute to owner but no other permissions to anyone drwxrwxr-x 2 user 4.0K Aug 11 21:46 tmp2 --> read, write and execute to owner and group but only read and execute to others d-w------- 2 user 4.0K Aug 12 12:37 tmp3 --> write but no read and execute to owner but no other permissions to anyone drwxrw-rw- 2 root 4.0K Aug 12 12:42 tmp4 --> read, write and execute only for root but no execute to group and other users. d--x------ 2 user 4.0K Aug 12 12:46 tmp5 --> execute but not read and write dr-------- 2 user 4.0K Aug 15 13:45 tmp6 --> read only
Folder tmp2
is the one with all read, write and execute permissions are available to UID=1000
which is the current user. Folder tmp4
also has all permissions set but only for the root
user. For the sake of simplicity we will only focus on the some of the folders but results are pretty much applicable to all different scenarios with missing permissions.
Any java application uses the default temporary folder setup by the JVM unless it is modified by the application or the JVM parameters. The classic way of modifying the default temporary folder for java VM is to use the following parameter.
-Djava.io.tmpdir=/path/to/temporary/folder
For regular java applications such as picard this option can be set directly from the java parameters like below
java -Djava.io.tmpdir=/path/to/temporary/folder -jar picard.jar ....
For gatk
it can be added as a parameter to --java-options
switch
gatk --java-options "-Djava.io.tmpdir=/path/to/temporary/folder" ....
Similarly GATK native tools also have a command line parameter called --tmp-dir
. This parameter can be used to set the temporary folder instead of modifying the java VM parameters. Picard tools under gatk does not use this command line parameter currently but instead they have --TMP_DIR
parameter available for their usage. If you do not wish to fiddle with seperate parameter settings between 2 different sets of tools, you may go with the default java VM parameter to modify the temporary folder as it affects both tool sets equally. Below table summarizes how you may change the default temporary folder for GATK and related tools to use.
Tool | Way to setup temporary folder |
---|---|
gatk (anytool) |
gatk AnyToolName --java-options "-Djava.io.tmpdir=/path/to/tmp" |
gatk (non-picard tools) |
gatk ToolName --tmp-dir /path/to/tmp |
gatk (picard tools) |
gatk PicardToolName --TMP_DIR /path/to/tmp |
picard |
java -Djava.io.tmpdir=/path/to/tmp -jar picard.jar ToolName |
picard (alternate) |
java -jar picard.jar ToolName --TMP_DIR /path/to/tmp |
Let's say we want to run HaplotypeCaller on this sample bam file using any of the candidate temporary folders. Let's start with tmp2
using the command
gatk --java-options "-Djava.io.tmpdir=tmp2" HaplotypeCaller -I bamfile.bam -R reference.fasta -O vcffile.g.vcf -ERC GVCF ...
The default log will fill with proper INFO and WARNING messages depending on the tool and native libraries will be executed properly.
...
14:18:10.779 INFO HaplotypeCaller - Deflater: IntelDeflater
14:18:10.779 INFO HaplotypeCaller - Inflater: IntelInflater
...
14:18:10.919 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
14:18:10.921 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
14:18:10.927 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
14:18:10.927 INFO IntelPairHmm - Available threads: 128
14:18:10.927 INFO IntelPairHmm - Requested threads: 4
14:18:10.927 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
...
Simply browsing the tmp2
folder will show a number of .so
or .dylib
files (depending on the platform) indicating the proper extraction of native libraries.
$tmp2 ls libgkl_compression9143036155557837406.so libgkl_pairhmm_omp2829763236087040659.so libgkl_utils10106411819927432556.so
If you try using another folder without the execute permission this is what happens
gatk --java-options "-Djava.io.tmpdir=tmp1" HaplotypeCaller -I bamfile.bam -R reference.fasta -O vcffile.g.vcf -ERC GVCF ...
And here it goes...
...
15:22:27.273 INFO HaplotypeCaller - Deflater: JdkDeflater
15:22:27.273 INFO HaplotypeCaller - Inflater: JdkInflater
15:22:27.273 INFO HaplotypeCaller - GCS max retries/reopens: 20
15:22:27.273 INFO HaplotypeCaller - Requester pays: disabled
15:22:27.273 INFO HaplotypeCaller - Initializing engine
15:22:27.303 WARN IntelInflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
15:22:27.304 WARN IntelInflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
15:22:27.369 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
15:22:27.369 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (Permission denied)
15:22:27.369 WARN IntelPairHmm - Intel GKL Utils not loaded
15:22:27.369 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
15:22:27.369 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
15:22:27.369 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (Permission denied)
15:22:27.369 WARN IntelPairHmm - Intel GKL Utils not loaded
15:22:27.369 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
15:22:27.384 INFO ProgressMeter - Starting traversal
15:22:27.384 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
...
As you can see execution still goes on however this time without any native libraries due to lack of execute permission on folder tmp1
. Same result occurs when you try to set a temporary folder with all permissions set but not for your user so be careful.
Finally lets try another folder with missing permissions such as write permission.
gatk --java-options "-Djava.io.tmpdir=tmp6" HaplotypeCaller -I bamfile.bam -R reference.fasta -O vcffile.g.vcf -ERC GVCF ...
And here is the result.
***********************************************************************
A USER ERROR has occurred: Failure working with the tmp directory tmp6/. Try changing the tmp dir with with --tmp-dir on the command line. Exact error was should exist and have read/write access
***********************************************************************
Since GATK cannot write to this folder, it cannot execute properly therefore throws the message above. A similar error message will be thrown if the designated folder does not exist.
3. Additional things to consider running GATK locally from local installation or docker
Things discussed above also applies to the gatk runs from the official docker image with a few add-ons
-
Execution of GATK from the official docker image requires that the designated temporary folder is accessible by the container with mount options such as
-v /path/to/temporary/folder:/temporary/folder
or located inside any of the accessible mounted folder. -
If you must run the docker container under another user (non-root) then the proper temporary folder setup is mandatory as the gatk docker image temporary folder is only available for the root user. You will observe similar logs as shown above unless a proper temporary folder is set. As an alternative you may try running docker in rootless mode (See https://docs.docker.com/engine/security/rootless/).
-
External tools and scripts called by GATK such as CNV or CNN may require additional settings for temporary folder assignments. You will observe a stacktrace similar to the one below if you don't have a proper temporary folder setup for those tools (Redacted).
Traceback (most recent call last):
File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/configdefaults.py", line 1856, in filter_compiledir
os.makedirs(path, 0o770) # read-write-execute for user and group
File "/opt/miniconda/envs/gatk/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/opt/miniconda/envs/gatk/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/root/.theano'
...
java.lang.RuntimeException: A required Python package ("gcnvkernel") could not be imported into the Python environment. This tool requires that the GATK Python environment is properly established and activated. Please refer to GATK README.md file for instructions on setting up the GATK Python environment.Above example was for the
theano
library of the CNV analysis tools and it requiresTHEANO_FLAGS
environment parameter to be set before execution.export THEANO_FLAGS="base_compiledir=/path/to/temporary/folder"
The same setting should be added to the docker execution command via the
-e
switchdocker run -e THEANO_FLAGS="base_compiledir=/path/to/temporary/folder"...
0 comments
Please sign in to leave a comment.