PathSeqBuildKmers – GATK

Builds set of host reference k-mers

Category Metagenomics

Overview

Produce a set of k-mers from the given host reference. The output file from this tool is required to run the PathSeq pipeline.

The tool works by scanning the reference one position at a time. It takes the k-mer (k-base subsequence) starting at each consecutive position and adds it to a set. By default, the set is stored as a hash table.

Users also have the option to represent the k-mers set using a Bloom filter by specifying a non-zero value for the --bloom-false-positive-probability parameter. This uses less memory than the default hash set but also can produce false positives. In other words, when asked whether a non-host k-mer exists in the set, it will incorrectly say yes with a probability, p. The user can specify p so that the probability of incorrectly subtracting a non-host read is negligibly small. For p = 0.0001 and read length of 151 bases, the probability of the PathSeq incorrectly subtracting a non-host read is < 1.5%, but the amount of memory used is reduced 4-fold compared to a hash table. For this reason, Bloom filters are generally recommended.

Note that the file formats used for storing these k-mer data structures are only readable by the PathSeq tools.

Input

An indexed host reference in FASTA format

Output

A set of the k-mers in the reference

Usage examples

Builds a hash table of every k-mer (k = 31) in the reference. Each k-mer is masked at the 16th position.

 gatk PathSeqBuildKmers  \
   --reference host_reference.fasta \
   --output host_reference.hss \
   --kmer-mask 16 \
   --kmer-size 31

Builds a Bloom filter with false positive probability p < 0.001.

 gatk PathSeqBuildKmers  \
   --reference host_reference.fasta \
   --output host_reference.hss \
   --bloom-false-positive-probability 0.001 \
   --kmer-mask 16 \
   --kmer-size 31

Notes

For most references, the Java VM will run out of memory with the default settings. The Java heap size limit should be set at least 20x the size of the reference (less if building a Bloom filter). For example, for a 3 GB reference set the limit to 60 GB by adding --java-options "-Xmx60g" to the command.

PathSeqBuildKmers specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s)	Default value	Summary
Required Arguments
--output -O	null	File for k-mer set output. Extension will be automatically added if not present (.hss for hash set or .bfi for Bloom filter)
--reference -R	null	Reference FASTA file path on local disk
Optional Tool Arguments
--arguments_file	[]	read one or more arguments files and add them to the command line
--bloom-false-positive-probability -P	0.0	If non-zero, creates a Bloom filter with this false positive probability
--gcs-max-retries -gcs-retries	20	If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--help -h	false	display the help message
--kmer-mask -M	""	Comma-delimited list of base indices (starting with 0) to mask in each k-mer
--kmer-size -SZ	31	K-mer size, must be odd and less than 32
--kmer-spacing -SP	1	Spacing between successive k-mers
--version	false	display the version number for this tool
Optional Common Arguments
--gatk-config-file	null	A configuration file to use with the GATK.
--QUIET	false	Whether to suppress job-summary info on System.err.
--TMP_DIR	[]	Undocumented option
--use-jdk-deflater -jdk-deflater	false	Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater -jdk-inflater	false	Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity	INFO	Control verbosity of logging.
Advanced Arguments
--showHidden	false	display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--arguments_file / NA

read one or more arguments files and add them to the command line

List[File] []

--bloom-false-positive-probability / -P

If non-zero, creates a Bloom filter with this false positive probability

Note that the provided argument is used as an upper limit on the probability, and the actual false positive probability may be less.

double 0.0 [ [ 0 0.001 ] 1 ] ]

--gatk-config-file / NA

A configuration file to use with the GATK.

String null

--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int 20 [ [ -∞ ∞ ] ]

--help / -h

display the help message

boolean false

--kmer-mask / -M

Comma-delimited list of base indices (starting with 0) to mask in each k-mer
K-mer masking allows mismatches to occur at one or more specified positions. Masking the middle base is recommended to enhance host read detection.

String ""

--kmer-size / -SZ

K-mer size, must be odd and less than 32
Reducing the k-mer length will increase the number of host reads subtracted in the filtering phase of the pipeline, but it may also increase the number of non-host (i.e. microbial) reads that are incorrectly subtracted. Note that changing the length of the k-mer does not affect memory usage.

int 31 [ [ 1 31 ] ]

--kmer-spacing / -SP

Spacing between successive k-mers
The k-mer set size can be reduced by only storing k-mers starting at every n bases in the reference. By default every k-mer, starting at consecutive bases in the reference, is stored.

int 1 [ [ 1 ∞ ] ]

--output / -O

File for k-mer set output. Extension will be automatically added if not present (.hss for hash set or .bfi for Bloom filter)

R String null

--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean false

--reference / -R

Reference FASTA file path on local disk

R String null

--showHidden / -showHidden

display hidden arguments

boolean false

--TMP_DIR / NA

Undocumented option

List[File] []

--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean false

--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean false

--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel INFO

--version / NA

display the version number for this tool

boolean false

Return to top

GATK version 4.0.8.1 built at 25-41-2019 10:41:16.

Genome Analysis Toolkit

Need Help?

Community Forum

Articles in this section

PathSeqBuildKmers Follow

Category Metagenomics

Overview

Input

Output

Usage examples

Builds a hash table of every k-mer (k = 31) in the reference. Each k-mer is masked at the 16th position.

Builds a Bloom filter with false positive probability p < 0.001.

Notes

PathSeqBuildKmers specific arguments

Argument details

--arguments_file / NA

--bloom-false-positive-probability / -P

--gatk-config-file / NA

--gcs-max-retries / -gcs-retries

--help / -h

--kmer-mask / -M

--kmer-size / -SZ

--kmer-spacing / -SP

--output / -O

--QUIET / NA

--reference / -R

--showHidden / -showHidden

--TMP_DIR / NA

--use-jdk-deflater / -jdk-deflater

--use-jdk-inflater / -jdk-inflater

--verbosity / -verbosity

--version / NA

0 comments