Trains a model for scoring variant calls based on site-level annotations
Category Variant Filtering
Overview
Trains a model for scoring variant calls based on site-level annotations.This tool is intended to be used as the second step in a variant-filtering workflow that supersedes the VariantRecalibrator workflow. Given training (and optionally, calibration) sets of site-level annotations produced by ExtractVariantAnnotations, this tool can be used to train a model for scoring variant calls. For each variant type (i.e., SNP or INDEL) specified using the MODE_LONG_NAME argument, the tool outputs files that are either: 1) serialized scorers, each of which persists to disk a function for computing scores given subsequent annotations, or 2) HDF5 files containing a set of scores, each corresponding to training, calibration, and unlabeled sets, as appropriate.
The model files produced by this tool can in turn be provided along with a VCF file to the ScoreVariantAnnotations tool, which assigns a score to each call (with a lower score indicating that a call is more likely to be an artifact and should perhaps be filtered). Each score can also be converted to a corresponding sensitivity with respect to a calibration set, if the latter is available.
Modeling approaches
This tool can perform modeling using either a positive-only approach or a positive-negative approach. In a positive-only approach, the annotation-space distribution of training sites is used to learn a function for converting annotations for subsequent sites into a score; typically, higher scores correspond to regions of annotation space that are more densely populated by training sites. In contrast, a positive-negative approach attempts to additionally use unlabeled sites to better identify regions of annotation space that correspond to low scores against the original, positive-only model (with the assumption being that unlabeled sites are more likely to populate such regions than are training sites). A second, negative model can then be trained, and the resulting scores (which are presumably higher in regions of annotation space that are less densely populated by the original training sites) can be subtracted from the original scores to produce a final score. (Note that this positive-negative approach could be considered as a single iteration of a more general approach typically referred to as positive-unlabeled learning.)
A positive-only approach is likely to perform well in cases where a sufficient number of reliable training sites is available. In contrast, if 1) only a small number of reliable training sites is available, and/or 2) the reliability of the training sites is questionable (e.g., the sites may be contaminated by a non-negigible number of sequencing artifacts), then a positive-negative approach may be beneficial. However, note that the positive-negative approach introduces an additional hyperparameter---the threshold that determines the selection of sites for training the negative model, controlled by the CALIBRATION_SENSITIVITY_THRESHOLD_LONG_NAME argument---which may require tuning. Further note that although VariantRecalibrator (which this tool supplants) has typically been used to implement a positive-negative approach, a positive-only approach likely suffices in many use cases.
If a positive-only approach has been specified, then if training sites of the variant type are available:
- 1) A positive model is trained using these training sites and is serialized to file,
- 2) Scores for these training sites are generated using the positive model and output to a file,
- 3) If calibration sites of the variant type are available, scores for these calibration sites are generated using the positive model and output to a file.
- 4) The calibration scores generated from the positive model are used to convert the calibration-sensitivity threshold into a score threshold,
- 5) Training sites with scores below the score threshold are selected for training a negative model,
- 6) Scores for unlabeled sites are generated using the positive model and output to a file,
- 7) Unlabeled sites with scores below the score threshold are selected for training a negative model,
- 8) A negative model is trained using these selected training and unlabeled sites and is serialized to file,
- 9) Scores for calibration sites are generated using the positive-negative model and overwritten in the existing file.
Modeling backends
This tool allows the use of different backends for modeling and scoring. See also below for instructions for using a custom, user-provided implementation.
Python isolation-forest backend
This backend uses scikit-learn modules to train models and scoring functions using the isolation-forest method for anomaly detection. Median imputation of missing annotation values is performed before applying the method.
This backend can be selected by specifying PYTHON_IFOREST to the MODEL_BACKEND_LONG_NAME argument and is also currently the the default backend. It is implemented by the script at src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py, which requires that the argparse, h5py, numpy, sklearn, and dill packages be present in the Python environment; users may wish to simply use the provided GATK conda environment to ensure that the correct versions of all packages are available. See the IsolationForest documentation here as appropriate for the version of scikit-learn used in your Python environment. The hyperparameters documented there can be specified using the HYPERPARAMETERS_JSON_LONG_NAME argument; see src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest-hyperparameters.json for an example and the default values.
Note that HDF5 files may be viewed using hdfview or loaded in Python using PyTables or h5py.
Calibration sets
The choice of calibration set will determine the conversion between model scores and calibration-set sensitivities. Ideally, the calibration set should be comprised of a unbiased sample from the full distribution of true sites in annotation space; the score-sensitivity conversion can roughly be thought of as a mapping from sensitivities in [0, 1] to a contour of this annotation-space distribution. In practice, any biases in the calibration set (e.g., if it consists of high quality, previously filtered calls, which may be biased towards the high density regions of the full distribution) will be reflected in the conversion and should be taken into consideration when interpreting calibration-set sensitivities.
Inputs
- Labeled-annotations HDF5 file (.annot.hdf5). Annotation data and metadata for labeled sites are stored in the HDF5 directory structure given in the documentation for the ExtractVariantAnnotations tool. In typical usage, both the LabeledVariantAnnotationsData#TRAINING_LABEL and LabeledVariantAnnotationsData#CALIBRATION_LABEL labels would be available for non-empty sets of sites of the requested variant type.
- (Optional) Unlabeled-annotations HDF5 file (.unlabeled.annot.hdf5). Annotation data and metadata for unlabeled sites are stored in the HDF5 directory structure given in the documentation for the ExtractVariantAnnotations tool. If provided, a positive-negative modeling approach (similar to that used in VariantRecalibrator will be used.
- Variant types (i.e., SNP and/or INDEL) for which to train models. Logic for determining variant type was retained from VariantRecalibrator; see VariantType. A separate model will be trained for each variant type and separate sets of outputs with corresponding tags in the filenames (i.e., "snp" or "indel") will be produced. Alternatively, the tool can be run twice, once for each variant type; this may be useful if one wishes to use different argument values or modeling approaches.
- (Optional) Model backend. The Python isolation-forest backend is currently the default backend. A custom backend can also be specified in conjunction with the PYTHON_SCRIPT_LONG_NAME argument.
- (Optional) Model hyperparameters JSON file. This file can be used to specify backend-specific hyperparameters in JSON format, which is to be consumed by the modeling script. This is required if a custom backend is used.
- (Optional) Calibration-set sensitivity threshold. The same threshold will be used for both SNP and INDEL variant types. If different thresholds are desired, the tool can be twice, once for each variant type.
- Output prefix. This is used as the basename for output files.
Outputs
The following outputs are produced for each variant type specified by the MODE_LONG_NAME argument and are delineated by type-specific tags in the filename of each output, which take the form of {output-prefix}.{variant-type}.{file-suffix}. For example, scores for the SNP calibration set will be output to the {output-prefix}.snp.calibrationScores.hdf5 file.
- Training-set positive-model scores HDF5 file (.trainingScores.hdf5).
- Positive-model serialized scorer file. (.scorer.pkl for the default PYTHON_IFOREST model backend).
- (Optional) Unlabeled-set positive-model scores HDF5 file (.unlabeledScores.hdf5). This is only output if a positive-negative modeling approach is used.
- (Optional) Calibration-set scores HDF5 file (.calibrationScores.hdf5). This is only output if a calibration set is provided. If a positive-only modeling approach is used, scores will be generated from the positive model; if a positive-negative modeling approach is used, scores will be generated from the positive-negative model.
- (Optional) Negative-model serialized scorer file. (.negative.scorer.pkl for the default PYTHON_IFOREST model backend). This is only output if a positive-negative modeling approach is used.
Usage examples
Train SNP and INDEL models using the default Python IsolationForest model backend with a positive-only approach, given an input labeled-annotations HDF5 file generated by ExtractVariantAnnotations that contains labels for both training and calibration sets, producing the outputs 1) train.snp.scorer.pkl, 2) train.snp.trainingScores.hdf5, and 3) train.snp.calibrationScores.hdf5, as well as analogous files for the INDEL model. Note that the MODE_LONG_NAME arguments are made explicit here, although both SNP and INDEL modes are selected by default.
gatk TrainVariantAnnotationsModel \ --annotations-hdf5 extract.annot.hdf5 \ --mode SNP \ --mode INDEL \ -O train
Train SNP and INDEL models using the default Python IsolationForest model backend with a positive-negative approach (using a calibration-sensitivity threshold of 0.95 to select sites for training the negative model), given an input labeled-annotations HDF5 file that contains labels for both training and calibration sets and an input unlabeled-annotations HDF5 file (with both HDF5 files generated by ExtractVariantAnnotations), producing the outputs 1) train.snp.scorer.pkl, 2) train.snp.negative.scorer.pkl, 3) train.snp.trainingScores.hdf5, 4) train.snp.calibrationScores.hdf5, and 5) train.snp.unlabeledScores.hdf5, as well as analogous files for the INDEL model. Note that the MODE_LONG_NAME arguments are made explicit here, although both SNP and INDEL modes are selected by default.
gatk TrainVariantAnnotationsModel \ --annotations-hdf5 extract.annot.hdf5 \ --unlabeled-annotations-hdf5 extract.unlabeled.annot.hdf5 \ --mode SNP \ --mode INDEL \ --calibration-sensitivity-threshold 0.95 \ -O train
Custom modeling/scoring backends (ADVANCED)
The primary modeling functionality performed by this tool is accomplished by a "modeling backend" whose fundamental contract is to take an input HDF5 file containing an annotation matrix for sites of a single variant type (i.e., SNP or INDEL) and to output a serialized scorer for that variant type. Rather than using one of the available, implemented backends, advanced users may provide their own backend via the PYTHON_SCRIPT_LONG_NAME argument. See documentation in the modeling and scoring interfaces (VariantAnnotationsModel and VariantAnnotationsScorer, respectively), as well as the default Python IsolationForest implementation at PythonSklearnVariantAnnotationsModel and src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py.
Extremely advanced users could potentially substitute their own implementation for the entire TrainVariantAnnotationsModel tool, while still making use of the up/downstream ExtractVariantAnnotations and ScoreVariantAnnotations tools. To do so, one would additionally have to implement functionality for subsetting training/calibration sets by variant type, calling modeling backends as appropriate, and scoring calibration sets.
TrainVariantAnnotationsModel specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--annotations-hdf5 |
HDF5 file containing annotations extracted with ExtractVariantAnnotations. | ||
--output -O |
Output prefix. | ||
Optional Tool Arguments | |||
--arguments_file |
read one or more arguments files and add them to the command line | ||
--calibration-sensitivity-threshold |
Calibration-sensitivity threshold that determines which sites will be used for training the negative model in the positive-unlabeled modeling approach. Increasing this will decrease the corresponding positive-model score threshold; sites with scores below this score threshold will be used for training the negative model. Thus, this parameter should typically be chosen to be close to 1, so that sites that score highly according to the positive model will not be used to train the negative model. The unlabeled-annotations-hdf5 argument must be specified in conjunction with this argument. If separate thresholds for SNP and INDEL models are desired, run the tool separately for each mode with its respective threshold. | ||
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--gcs-project-for-requester-pays |
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed. | ||
--help -h |
false | display the help message | |
--hyperparameters-json |
JSON file containing hyperparameters. Optional if the PYTHON_IFOREST backend is used (if not specified, a default set of hyperparameters will be used); otherwise required. | ||
--mode |
[SNP, INDEL] | Variant types for which to train models. Duplicate values will be ignored. | |
--model-backend |
PYTHON_IFOREST | Backend to use for training models. JAVA_BGMM will use a pure Java implementation (ported from Python scikit-learn) of the Bayesian Gaussian Mixture Model. PYTHON_IFOREST will use the Python scikit-learn implementation of the IsolationForest method and will require that the corresponding Python dependencies are present in the environment. PYTHON_SCRIPT will use the script specified by the python-script argument. See the tool documentation for more details. | |
--python-script |
Python script used for specifying a custom scoring backend. If provided, model-backend must also be set to PYTHON_SCRIPT. | ||
--unlabeled-annotations-hdf5 |
HDF5 file containing annotations extracted with ExtractVariantAnnotations. If specified with calibration-sensitivity-threshold, a positive-unlabeled modeling approach will be used; otherwise, a positive-only modeling approach will be used. | ||
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--gatk-config-file |
A configuration file to use with the GATK. | ||
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--tmp-dir |
Temp directory to use. | ||
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--annotations-hdf5
HDF5 file containing annotations extracted with ExtractVariantAnnotations.
R File null
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--calibration-sensitivity-threshold
Calibration-sensitivity threshold that determines which sites will be used for training the negative model in the positive-unlabeled modeling approach. Increasing this will decrease the corresponding positive-model score threshold; sites with scores below this score threshold will be used for training the negative model. Thus, this parameter should typically be chosen to be close to 1, so that sites that score highly according to the positive model will not be used to train the negative model. The unlabeled-annotations-hdf5 argument must be specified in conjunction with this argument. If separate thresholds for SNP and INDEL models are desired, run the tool separately for each mode with its respective threshold.
Double null
--gatk-config-file
A configuration file to use with the GATK.
String null
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
String ""
--help / -h
display the help message
boolean false
--hyperparameters-json
JSON file containing hyperparameters. Optional if the PYTHON_IFOREST backend is used (if not specified, a default set of hyperparameters will be used); otherwise required.
File null
--mode
Variant types for which to train models. Duplicate values will be ignored.
The --mode argument is an enumerated type (List[VariantType]), which can have one of the following values:
- SNP
- INDEL
List[VariantType] [SNP, INDEL]
--model-backend
Backend to use for training models. JAVA_BGMM will use a pure Java implementation (ported from Python scikit-learn) of the Bayesian Gaussian Mixture Model. PYTHON_IFOREST will use the Python scikit-learn implementation of the IsolationForest method and will require that the corresponding Python dependencies are present in the environment. PYTHON_SCRIPT will use the script specified by the python-script argument. See the tool documentation for more details.
The --model-backend argument is an enumerated type (VariantAnnotationsModelBackend), which can have one of the following values:
- JAVA_BGMM
- PYTHON_IFOREST
- Use the script at org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py
- PYTHON_SCRIPT
- Use a user-provided script.
VariantAnnotationsModelBackend PYTHON_IFOREST
--output / -O
Output prefix.
R String null
--python-script
Python script used for specifying a custom scoring backend. If provided, model-backend must also be set to PYTHON_SCRIPT.
File null
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--showHidden / -showHidden
display hidden arguments
boolean false
--tmp-dir
Temp directory to use.
GATKPath null
--unlabeled-annotations-hdf5
HDF5 file containing annotations extracted with ExtractVariantAnnotations. If specified with calibration-sensitivity-threshold, a positive-unlabeled modeling approach will be used; otherwise, a positive-only modeling approach will be used.
File null
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.3.0.0 built at Wed, 12 Oct 2022 21:04:44 -0400.
0 comments
Please sign in to leave a comment.