Builds set of host reference k-mers
OverviewProduce a set of k-mers from the given host reference. The output file from this tool is required to run the PathSeq pipeline.
The tool works by scanning the reference one position at a time. It takes the k-mer (k-base subsequence) starting at each consecutive position and adds it to a set. By default, the set is stored as a hash table.
Users also have the option to represent the k-mers set using a Bloom filter by specifying a non-zero value for the --bloom-false-positive-probability parameter. This uses less memory than the default hash set but also can produce false positives. In other words, when asked whether a non-host k-mer exists in the set, it will incorrectly say yes with a probability, p. The user can specify p so that the probability of incorrectly subtracting a non-host read is negligibly small. For p = 0.0001 and read length of 151 bases, the probability of the PathSeq incorrectly subtracting a non-host read is < 1.5%, but the amount of memory used is reduced 4-fold compared to a hash table. For this reason, Bloom filters are generally recommended.
Note that the file formats used for storing these k-mer data structures are only readable by the PathSeq tools.
- An indexed host reference in FASTA format
- A set of the k-mers in the reference
Builds a hash table of every k-mer (k = 31) in the reference. Each k-mer is masked at the 16th position.
gatk PathSeqBuildKmers \ --reference host_reference.fasta \ --output host_reference.hss \ --kmer-mask 16 \ --kmer-size 31
Builds a Bloom filter with false positive probability p < 0.001.
gatk PathSeqBuildKmers \ --reference host_reference.fasta \ --output host_reference.hss \ --bloom-false-positive-probability 0.001 \ --kmer-mask 16 \ --kmer-size 31
For most references, the Java VM will run out of memory with the default settings. The Java heap size limit should be set at least 20x the size of the reference (less if building a Bloom filter). For example, for a 3 GB reference set the limit to 60 GB by adding --java-options "-Xmx60g" to the command.
PathSeqBuildKmers specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
|Argument name(s)||Default value||Summary|
|null||File for k-mer set output. Extension will be automatically added if not present (.hss for hash set or .bfi for Bloom filter)|
|null||Reference FASTA file path on local disk|
|Optional Tool Arguments|
||||read one or more arguments files and add them to the command line|
|0.0||If non-zero, creates a Bloom filter with this false positive probability|
|20||If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection|
|false||display the help message|
|""||Comma-delimited list of base indices (starting with 0) to mask in each k-mer|
|31||K-mer size, must be odd and less than 32|
|1||Spacing between successive k-mers|
||false||display the version number for this tool|
|Optional Common Arguments|
||null||A configuration file to use with the GATK.|
||false||Whether to suppress job-summary info on System.err.|
|false||Whether to use the JdkDeflater (as opposed to IntelDeflater)|
|false||Whether to use the JdkInflater (as opposed to IntelInflater)|
||INFO||Control verbosity of logging.|
||false||display hidden arguments|
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--arguments_file / NA
read one or more arguments files and add them to the command line
If non-zero, creates a Bloom filter with this false positive probability
Note that the provided argument is used as an upper limit on the probability, and the actual false positive probability may be less.
double 0.0 [ [ 0 0.001 ] 1 ] ]
--gatk-config-file / NA
A configuration file to use with the GATK.
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--help / -h
display the help message
--kmer-mask / -M
Comma-delimited list of base indices (starting with 0) to mask in each k-mer
K-mer masking allows mismatches to occur at one or more specified positions. Masking the middle base is recommended to enhance host read detection.
--kmer-size / -SZ
K-mer size, must be odd and less than 32
Reducing the k-mer length will increase the number of host reads subtracted in the filtering phase of the pipeline, but it may also increase the number of non-host (i.e. microbial) reads that are incorrectly subtracted. Note that changing the length of the k-mer does not affect memory usage.
int 31 [ [ 1 31 ] ]
--kmer-spacing / -SP
Spacing between successive k-mers
The k-mer set size can be reduced by only storing k-mers starting at every n bases in the reference. By default every k-mer, starting at consecutive bases in the reference, is stored.
int 1 [ [ 1 ∞ ] ]
--output / -O
File for k-mer set output. Extension will be automatically added if not present (.hss for hash set or .bfi for Bloom filter)
R String null
--QUIET / NA
Whether to suppress job-summary info on System.err.
--reference / -R
Reference FASTA file path on local disk
R String null
--showHidden / -showHidden
display hidden arguments
--TMP_DIR / NA
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
--version / NA
display the version number for this tool
GATK version 188.8.131.52 built at 25-41-2019 10:41:16.