Train a CNN model for filtering variants
Category Variant Filtering
Overview
Train a Convolutional Neural Network (CNN) for filtering variants. This tool expects requires training data generated by CNNVariantWriteTensors.Inputs
- data-dir The training data created by CNNVariantWriteTensors.
- The --tensor-type argument determines what types of tensors the model will expect. Set it to "reference" for 1D tensors or "read_tensor" for 2D tensors.
Outputs
- output-dir The model weights file and semantic configuration json are saved here. This default to the current working directory.
- model-name The name for your model.
Usage example
Train a 1D CNN on Reference Tensors
gatk CNNVariantTrain \ -tensor-type reference \ -input-tensor-dir my_tensor_folder \ -model-name my_1d_model
Train a 2D CNN on Read Tensors
gatk CNNVariantTrain \ -input-tensor-dir my_tensor_folder \ -tensor-type read-tensor \ -model-name my_2d_model
CNNVariantTrain specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--input-tensor-dir |
Directory of training tensors to create. | ||
Optional Tool Arguments | |||
--annotation-shortcut |
false | Shortcut connections on the annotation layers. | |
--annotation-units |
16 | Number of units connected to the annotation input layer | |
--arguments_file |
read one or more arguments files and add them to the command line | ||
--conv-batch-normalize |
false | Batch normalize convolution layers | |
--conv-dropout |
0.0 | Dropout rate in convolution layers | |
--conv-height |
5 | Height of convolution kernels | |
--conv-layers |
List of number of filters to use in each convolutional layer | ||
--conv-width |
5 | Width of convolution kernels | |
--epochs |
10 | Maximum number of training epochs. | |
--fc-batch-normalize |
false | Batch normalize fully-connected layers | |
--fc-dropout |
0.0 | Dropout rate in fully-connected layers | |
--fc-layers |
List of number of filters to use in each fully-connected layer | ||
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--gcs-project-for-requester-pays |
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed. | ||
--help -h |
false | display the help message | |
--image-dir |
Path where plots and figures are saved. | ||
--model-name |
variant_filter_model | Name of the model to be trained. | |
--output-dir |
./ | Directory where models will be saved, defaults to current working directory. | |
--padding |
valid | Padding for convolution layers, valid or same | |
--spatial-dropout |
false | Spatial dropout on convolution layers | |
--tensor-type |
reference | Type of tensors to use as input reference for 1D reference tensors and read_tensor for 2D tensors. | |
--training-steps |
10 | Number of training steps per epoch. | |
--validation-steps |
2 | Number of validation steps per epoch. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--gatk-config-file |
A configuration file to use with the GATK. | ||
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--tmp-dir |
Temp directory to use. | ||
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--annotation-set |
best_practices | Which set of annotations to use. | |
--channels-last |
true | Store the channels in the last axis of tensors, tensorflow->true, theano->false | |
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--annotation-set / -annotation-set
Which set of annotations to use.
String best_practices
--annotation-shortcut / -annotation-shortcut
Shortcut connections on the annotation layers.
boolean false
--annotation-units / -annotation-units
Number of units connected to the annotation input layer
int 16 [ [ -∞ ∞ ] ]
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--channels-last / -channels-last
Store the channels in the last axis of tensors, tensorflow->true, theano->false
boolean true
--conv-batch-normalize / -conv-batch-normalize
Batch normalize convolution layers
boolean false
--conv-dropout / -conv-dropout
Dropout rate in convolution layers
float 0.0 [ [ -∞ ∞ ] ]
--conv-height / -conv-height
Height of convolution kernels
int 5 [ [ -∞ ∞ ] ]
--conv-layers / -conv-layers
List of number of filters to use in each convolutional layer
List[Integer] []
--conv-width / -conv-width
Width of convolution kernels
int 5 [ [ -∞ ∞ ] ]
--epochs / -epochs
Maximum number of training epochs.
int 10 [ [ 0 ∞ ] ]
--fc-batch-normalize / -fc-batch-normalize
Batch normalize fully-connected layers
boolean false
--fc-dropout / -fc-dropout
Dropout rate in fully-connected layers
float 0.0 [ [ -∞ ∞ ] ]
--fc-layers / -fc-layers
List of number of filters to use in each fully-connected layer
List[Integer] []
--gatk-config-file
A configuration file to use with the GATK.
String null
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
String ""
--help / -h
display the help message
boolean false
--image-dir / -image-dir
Path where plots and figures are saved.
String null
--input-tensor-dir / -input-tensor-dir
Directory of training tensors to create.
R String null
--model-name / -model-name
Name of the model to be trained.
String variant_filter_model
--output-dir / -output-dir
Directory where models will be saved, defaults to current working directory.
String ./
--padding / -padding
Padding for convolution layers, valid or same
String valid
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--showHidden / -showHidden
display hidden arguments
boolean false
--spatial-dropout / -spatial-dropout
Spatial dropout on convolution layers
boolean false
--tensor-type / -tensor-type
Type of tensors to use as input reference for 1D reference tensors and read_tensor for 2D tensors.
The --tensor-type argument is an enumerated type (TensorType), which can have one of the following values:
- reference
- 1 Hot encoding of a reference sequence.
- read_tensor
- Read tensor are 3D tensors spanning aligned reads, sites and channels. The maximum number of reads is a hyper-parameter typically set to 128. There are 15 channels in the read tensor. They correspond to the reference sequence data (4), read sequence data (4), insertions and deletions (2) read flags (4) and mapping quality (1).
TensorType reference
--tmp-dir
Temp directory to use.
GATKPath null
--training-steps / -training-steps
Number of training steps per epoch.
int 10 [ [ 0 ∞ ] ]
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--validation-steps / -validation-steps
Number of validation steps per epoch.
int 2 [ [ 0 ∞ ] ]
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.2.2.0-SNAPSHOT built at Thu, 19 Aug 2021 09:49:28 -0700.
0 comments
Please sign in to leave a comment.