Trains a model for scoring variant calls based on site-level annotations
Category Variant Filtering
Overview
Trains a model for scoring variant calls based on site-level annotations.This tool is primarily intended to be used as the second step in a variant-filtering workflow that supersedes the {@link VariantRecalibrator} workflow. Given training (and optionally, calibration) sets of site-level annotations produced by {@link ExtractVariantAnnotations}, this tool can be used to train a model for scoring variant calls. For each variant type (i.e., SNP or INDEL) specified using the "--mode" argument, the tool outputs files that are either: 1) serialized scorers, each of which persists to disk a function for computing scores given subsequent annotations, or 2) HDF5 files containing a set of scores, each corresponding to training, calibration, and unlabeled sets, as appropriate.
The model files produced by this tool can in turn be provided along with a VCF file to the {@link ScoreVariantAnnotations} tool, which assigns a score to each call (with a lower score indicating that a call is more likely to be an artifact and should perhaps be filtered). Each score can also be converted to a corresponding sensitivity with respect to a calibration set, if the latter is available.
Modeling approaches
This tool can perform modeling using either a positive-only approach or a positive-unlabeled approach. In a positive-only approach, the annotation-space distribution of training sites is used to learn a function for converting annotations for subsequent sites into a score; typically, higher scores correspond to regions of annotation space that are more densely populated by training sites. In contrast, a positive-unlabeled approach attempts to additionally use unlabeled sites to better learn not only these regions of annotation space populated by training sites, but also those that are populated by sites that may be drawn from a different distribution.
A positive-only approach is likely to perform well in cases where a sufficient number of reliable training sites is available. In contrast, if 1) only a small number of reliable training sites is available, and/or 2) the reliability of the training sites is questionable (e.g., the sites may be contaminated by a non-negligible number of sequencing artifacts), then a positive-unlabeled approach may be beneficial. Further note that although {@link VariantRecalibrator} (which this tool supplants) has typically been used to implement a naive positive-unlabeled approach, a positive-only approach likely suffices in many use cases.
If a positive-only approach has been specified, then if training sites of the variant type are available:
- 1) A positive model is trained using these training sites and is serialized to file,
- 2) Scores for these training sites are generated using the positive model and output to a file,
- 3) If calibration sites of the variant type are available, scores for these calibration sites are generated using the positive model and output to a file.
Modeling backends
This tool allows the use of different backends for modeling and scoring. See also below for instructions for using a custom, user-provided implementation.
Python isolation-forest backend
This backend uses scikit-learn modules to train models and scoring functions using the isolation-forest method for anomaly detection. Median imputation of missing annotation values is performed before applying the method.
This backend can be selected by specifying "--model-backend PYTHON_IFOREST" and is also currently the the default backend. It is implemented by the script at src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py, which requires that the argparse, h5py, numpy, sklearn, and dill packages be present in the Python environment; users may wish to simply use the provided GATK conda environment to ensure that the correct versions of all packages are available. See the IsolationForest documentation here as appropriate for the version of scikit-learn used in your Python environment. The hyperparameters documented there can be specified using the "--hyperparameters-json" argument; see src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest-hyperparameters.json for an example and the default values.
Note that HDF5 files may be viewed using hdfview or loaded in Python using PyTables or h5py.
Calibration sets
The choice of calibration set will determine the conversion between model scores and calibration-set sensitivities. Ideally, the calibration set should be comprised of a unbiased sample from the full distribution of true sites in annotation space; the score-sensitivity conversion can roughly be thought of as a mapping from sensitivities in [0, 1] to a contour of this annotation-space distribution. In practice, any biases in the calibration set (e.g., if it consists of high quality, previously filtered calls, which may be biased towards the high density regions of the full distribution) will be reflected in the conversion and should be taken into consideration when interpreting calibration-set sensitivities.
Inputs
- Labeled-annotations HDF5 file (.annot.hdf5). Annotation data and metadata for labeled sites are stored in the HDF5 directory structure given in the documentation for the {@link ExtractVariantAnnotations} tool. In typical usage, both the "training" and "calibration" labels would be available for non-empty sets of sites of the requested variant type.
- (Optional) Unlabeled-annotations HDF5 file (.unlabeled.annot.hdf5). Annotation data and metadata for unlabeled sites are stored in the HDF5 directory structure given in the documentation for the {@link ExtractVariantAnnotations} tool. If provided, a positive-unlabeled modeling approach will be used.
- Variant types (i.e., SNP and/or INDEL) for which to train models. Logic for determining variant type was retained from {@link VariantRecalibrator}; see {@link VariantType}. A separate model will be trained for each variant type and separate sets of outputs with corresponding tags in the filenames (i.e., "snp" or "indel") will be produced. Alternatively, the tool can be run twice, once for each variant type; this may be useful if one wishes to use different argument values or modeling approaches.
- (Optional) Model backend. The Python isolation-forest backend is currently the default backend. A custom backend can also be specified in conjunction with the "--python-script" argument.
- (Optional) Model hyperparameters JSON file. This file can be used to specify backend-specific hyperparameters in JSON format, which is to be consumed by the modeling script. This is required if a custom backend is used.
- (Optional) Calibration-set sensitivity threshold. The same threshold will be used for both SNP and INDEL variant types. If different thresholds are desired, the tool can be twice, once for each variant type.
- Output prefix. This is used as the basename for output files.
Outputs
The following outputs are produced for each variant type specified by the "--mode" argument and are delineated by type-specific tags in the filename of each output, which take the form of {output-prefix}.{variant-type}.{file-suffix}. For example, scores for the SNP calibration set will be output to the {output-prefix}.snp.calibrationScores.hdf5 file.
- Training-set positive-model scores HDF5 file (.trainingScores.hdf5).
- Positive-model serialized scorer file. (.scorer.pkl for the default PYTHON_IFOREST model backend).
- (Optional) Calibration-set scores HDF5 file (.calibrationScores.hdf5). This is only output if a calibration set is provided.
Usage examples
Train SNP and INDEL models using the default Python IsolationForest model backend with a positive-only approach, given an input labeled-annotations HDF5 file generated by {@link ExtractVariantAnnotations} that contains labels for both training and calibration sets, producing the outputs 1) train.snp.scorer.pkl, 2) train.snp.trainingScores.hdf5, and 3) train.snp.calibrationScores.hdf5, as well as analogous files for the INDEL model. Note that the "--mode" arguments are made explicit here, although both SNP and INDEL modes are selected by default.
gatk TrainVariantAnnotationsModel \ --annotations-hdf5 extract.annot.hdf5 \ --mode SNP \ --mode INDEL \ -O train
Custom modeling/scoring backends (ADVANCED)
The primary modeling functionality performed by this tool is accomplished by a "modeling backend" whose fundamental contract is to take an input HDF5 file containing an annotation matrix for sites of a single variant type (i.e., SNP or INDEL) (as well as an analogous HDF5 file for unlabeled sites, if a positive-unlabeled modeling approach has been specified) and to output a serialized scorer for that variant type. Rather than using one of the available, implemented backends, advanced users may provide their own backend via the "--python-script" argument. See documentation in the modeling and scoring interfaces ({@link VariantAnnotationsModel} and {@link VariantAnnotationsScorer}, respectively), as well as the default Python IsolationForest implementation at {@link PythonVariantAnnotationsModel} and src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py.
Extremely advanced users could potentially substitute their own implementation for the entire {@link TrainVariantAnnotationsModel} tool, while still making use of the up/downstream {@link ExtractVariantAnnotations} and {@link ScoreVariantAnnotations} tools. To do so, one would additionally have to implement functionality for subsetting training/calibration sets by variant type, calling modeling backends as appropriate, and scoring calibration sets.
@author Samuel Lee <slee@broadinstitute.org>TrainVariantAnnotationsModel specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--annotations-hdf5 |
HDF5 file containing annotations extracted with ExtractVariantAnnotations. | ||
--output -O |
Output prefix. | ||
Optional Tool Arguments | |||
--arguments_file |
read one or more arguments files and add them to the command line | ||
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--gcs-project-for-requester-pays |
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed. | ||
--help -h |
false | display the help message | |
--hyperparameters-json |
JSON file containing hyperparameters. Optional if the PYTHON_IFOREST backend is used (if not specified, a default set of hyperparameters will be used); otherwise required. | ||
--mode |
[SNP, INDEL] | Variant types for which to train models. Duplicate values will be ignored. | |
--model-backend |
PYTHON_IFOREST | Backend to use for training models. JAVA_BGMM will use a pure Java implementation (ported from Python scikit-learn) of the Bayesian Gaussian Mixture Model. PYTHON_IFOREST will use the Python scikit-learn implementation of the IsolationForest method and will require that the corresponding Python dependencies are present in the environment. PYTHON_SCRIPT will use the script specified by the python-script argument. See the tool documentation for more details. | |
--python-script |
Python script used for specifying a custom scoring backend. If provided, model-backend must also be set to PYTHON_SCRIPT. | ||
--unlabeled-annotations-hdf5 |
HDF5 file containing annotations extracted with ExtractVariantAnnotations. If specified, a positive-unlabeled modeling approach will be used; otherwise, a positive-only modeling approach will be used. | ||
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--gatk-config-file |
A configuration file to use with the GATK. | ||
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--tmp-dir |
Temp directory to use. | ||
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--annotations-hdf5
HDF5 file containing annotations extracted with ExtractVariantAnnotations.
R File null
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--gatk-config-file
A configuration file to use with the GATK.
String null
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
String ""
--help / -h
display the help message
boolean false
--hyperparameters-json
JSON file containing hyperparameters. Optional if the PYTHON_IFOREST backend is used (if not specified, a default set of hyperparameters will be used); otherwise required.
File null
--mode
Variant types for which to train models. Duplicate values will be ignored.
The --mode argument is an enumerated type (List[VariantType]), which can have one of the following values:
- SNP
- INDEL
List[VariantType] [SNP, INDEL]
--model-backend
Backend to use for training models. JAVA_BGMM will use a pure Java implementation (ported from Python scikit-learn) of the Bayesian Gaussian Mixture Model. PYTHON_IFOREST will use the Python scikit-learn implementation of the IsolationForest method and will require that the corresponding Python dependencies are present in the environment. PYTHON_SCRIPT will use the script specified by the python-script argument. See the tool documentation for more details.
The --model-backend argument is an enumerated type (VariantAnnotationsModelBackend), which can have one of the following values:
- JAVA_BGMM
- PYTHON_IFOREST
- Use the script at org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py
- PYTHON_SCRIPT
- Use a user-provided script.
VariantAnnotationsModelBackend PYTHON_IFOREST
--output / -O
Output prefix.
R String null
--python-script
Python script used for specifying a custom scoring backend. If provided, model-backend must also be set to PYTHON_SCRIPT.
File null
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--showHidden / -showHidden
display hidden arguments
boolean false
--tmp-dir
Temp directory to use.
GATKPath null
--unlabeled-annotations-hdf5
HDF5 file containing annotations extracted with ExtractVariantAnnotations. If specified, a positive-unlabeled modeling approach will be used; otherwise, a positive-only modeling approach will be used.
File null
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.6.0.0-33-gdffedfb built at Wed, 23 Oct 2024 21:44:48 -0400.
0 comments
Please sign in to leave a comment.