Calls copy-number variants in germline samples given their counts and the output of DetermineGermlineContigPloidy
Category Copy Number Variant Discovery
Overview
Calls copy-number variants in germline samples given their counts and the corresponding output of DetermineGermlineContigPloidy. The former should be either HDF5 or TSV count files generated by CollectFragmentCounts.Introduction
Reliable detection of copy-number variation (CNV) from read-depth ("coverage" or "counts") data such as whole exome sequencing (WES), whole genome sequencing (WGS), and custom gene sequencing panels requires a comprehensive model to account for technical variation in library preparation and sequencing. The Bayesian model and the associated inference scheme implemented in GermlineCNVCaller includes provisions for inferring and explaining away much of the technical variation. Furthermore, CNV calling confidence is automatically adjusted for each sample and genomic interval.
The parameters of the probabilistic model for read-depth bias and variance estimation (hereafter, "the coverage model") can be automatically inferred by GermlineCNVCaller by providing a cohort of germline samples sequenced using the same sequencing platform and library preparation protocol (in case of WES, the same capture kit). We refer to this run mode of the tool as the COHORT mode. The number of samples required for the COHORT mode depends on several factors such as the sequencing depth, tissue type/quality and similarity in the cohort, and the stringency of following the library preparation and sequencing protocols. For WES and WGS samples, we recommend including at least 30 samples.
The parametrized coverage model can be used for CNV calling on future case samples provided that they are strictly compatible with the cohort used to generate the model parameters (in terms of tissue type(s), library preparation and sequencing protocols). We refer to this mode as the CASE run mode. There is no lower limit on the number of samples for running GermlineCNVCaller in the case mode.
In both tool run modes, GermlineCNVCaller requires karyotyping and global read-depth information for all samples. Such information can be automatically generated by running DetermineGermlineContigPloidy on all samples, and passed on to GermlineCNVCaller by providing the ploidy output calls using the argument contig-ploidy-calls. The ploidy state of a contig is used as the baseline ("default") copy-number state of all intervals contained in that contig for the corresponding sample. All other copy-number states are treated as alternative states and get equal shares from the total alternative state probability (set using the p-alt argument).
Python environment setup
The computation done by this tool, aside from input data parsing and validation, is performed outside of the Java Virtual Machine and using the gCNV computational python module, namely gcnvkernel. It is crucial that the user has properly set up a python conda environment with gcnvkernel and its dependencies installed. If the user intends to run GermlineCNVCaller using one of the official GATK Docker images, the python environment is already set up. Otherwise, the environment must be created and activated as described in the main GATK README.md file.
Tool run modes
- COHORT mode:
-
The tool will be run in the COHORT mode using the argument run-mode COHORT. In this mode, coverage model parameters are inferred simultaneously with the CNV states. Depending on available memory, it may be necessary to run the tool over a subset of all intervals, which can be specified by -L. The specified intervals must be present in all of the input count files. The output will contain two subdirectories, one ending with "-model" and the other with "-calls".
The model subdirectory contains a snapshot of the inferred parameters of the coverage model, which may be used later for CNV calling in one or more similarly-sequenced samples as mentioned earlier. Optionally, the path to a previously obtained coverage model parametrization can be provided via the model argument in the COHORT mode, in which case, the provided parameters will be only used for model initialization and a new parametrization will be generated based on the input count files. Furthermore, the genomic intervals are set to those used in creating the previous parameterization and interval-related arguments will be ignored. Note that the newly obtained parametrization ultimately reflects the input count files from the last run, regardless of whether or not an initialization parameter set is given. If the users wishes to model coverage data from two or more cohorts simultaneously, all of the input counts files must be given to the tool at once.
The calls subdirectory contains sequentially-ordered subdirectories for each sample, each listing various sample-specific estimated quantities such as the probability of various copy-number states for each interval, a parametrization of the GC curve, sample-specific unexplained variance, read-depth, and loadings of coverage bias factors.
- CASE mode:
-
The tool will be run in the CASE mode using the argument run-mode CASE. The path to a previously obtained coverage model parameter bundle must be provided via the model argument in this mode. The modeled intervals are specified by the parameter bundle, all interval-related arguments are ignored in this mode, and all model intervals must be present in all of the input count files. The tool output in the CASE mode is only the "-calls" subdirectory and is organized similarly to the COHORT mode.
Note that at the moment, this tool does not automatically verify the compatibility of the provided parametrization with the provided count files. Model compatibility may be assessed a posteriori by inspecting the magnitude of sample-specific unexplained variance of each sample, and asserting that they lie within the same range as those obtained from the cohort used to generate the parametrization. This manual step is expected to be made automatic in the future.
Important Remarks
- Choice of hyperparameters:
The quality of coverage model parametrization and the sensitivity/precision of germline CNV calling are sensitive to the choice of model hyperparameters, including the prior probability of alternative copy-number states (set using the p-alt argument), prevalence of active (i.e. CNV-rich) intervals (set via the p-active argument), the coherence length of CNV events and active/silent domains across intervals (set using the cnv-coherence-length and class-coherence-length arguments, respectively), and the typical scale of interval- and sample-specific unexplained variance (set using the interval-psi-scale and sample-psi-scale arguments, respectively). It is crucial to note that these hyperparameters are not universal and must be tuned for each sequencing protocol and properly set at runtime.
- Running GermlineCNVCaller on a subset of intervals:
As mentioned earlier, it may be necessary to run the tool over a subset of all intervals depending on available memory. The number of intervals must be large enough to include a genomically diverse set of regions for reliable inference of the GC bias curve, as well as other bias factors. For WES and WGS, we recommend no less than 10000 consecutive intervals spanning at least 10 - 50 mb.
- Memory requirements for the python subprocess ("gcnvkernel"):
The computation done by this tool, for the most part, is performed outside of JVM and via a spawned python subprocess. The Java heap memory is only used for loading sample counts and preparing raw data for the python subprocess. The user must ensure that the machine has enough free physical memory for spawning and executing the python subprocess. Generally speaking, the resource requirements of this tool scale linearly with each of the number of samples, the number of modeled intervals, the highest copy number state, the number of bias factors, and the number of knobs on the GC curve. For example, the python subprocess requires approximately 16GB of physical memory for modeling 10000 intervals for 100 samples, with 16 maximum bias factors, maximum copy-number state of 10, and explicit GC bias modeling.
Usage examples
COHORT mode:
gatk GermlineCNVCaller \ --run-mode COHORT \ -L intervals.interval_list \ --contig-ploidy-calls path_to_contig_ploidy_calls --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --output output_dir \ --output-prefix normal_cohort_run
CASE mode:
gatk GermlineCNVCaller \ --run-mode CASE \ -L intervals.interval_list \ --contig-ploidy-calls path_to_contig_ploidy_calls --model previous_model_path \ --input normal_1.counts.hdf5 \ ... \ --output output_dir \ --output-prefix normal_case_run
GermlineCNVCaller specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--contig-ploidy-calls |
null | Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy). | |
--input |
[] | Input read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L must be contained; if none are specified, then intervals must be identical and in the same order for all samples. | |
--output |
null | Output directory. | |
--output-prefix |
null | Prefix for output filenames. | |
--run-mode |
null | Tool run-mode. | |
Optional Tool Arguments | |||
--active-class-padding-hybrid-mode |
50000 | If copy-number-posterior-expectation-mode is set to HYBRID, CNV-active intervals determined at any time will be padded by this value (in the units of bp) in order to obtain the set of intervals on which copy number posterior expectation is performed exactly. | |
--adamax-beta-1 |
0.9 | Adamax optimizer first moment estimation forgetting factor. | |
--adamax-beta-2 |
0.99 | Adamax optimizer second moment estimation forgetting factor. | |
--annotated-intervals |
null | Input annotated-interval file containing annotations for GC content in genomic intervals (output of AnnotateIntervals). All intervals specified via -L must be contained. This input should not be provided if an input denoising-model directory is given (the latter already contains the annotated-interval file). | |
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--caller-admixing-rate |
0.75 | Admixing ratio of new and old called posteriors (between 0 and 1; larger values implies using more of the new posterior and less of the old posterior). | |
--caller-update-convergence-threshold |
0.001 | Maximum tolerated calling update size for convergence. | |
--class-coherence-length |
10000.0 | Coherence length of CNV-active and CNV-silent domains (in the units of bp). | |
--cnv-coherence-length |
10000.0 | Coherence length of CNV events (in the units of bp). | |
--convergence-snr-averaging-window |
500 | Averaging window for calculating training signal-to-noise ratio (SNR) for convergence checking. | |
--convergence-snr-countdown-window |
10 | The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence. | |
--convergence-snr-trigger-threshold |
0.1 | The SNR threshold to be reached before triggering the convergence countdown. | |
--copy-number-posterior-expectation-mode |
HYBRID | The strategy for calculating copy number posterior expectations in the coverage denoising model. | |
--depth-correction-tau |
10000.0 | Precision of read depth pinning to its global value. | |
--disable-annealing |
false | (advanced) Disable annealing. | |
--disable-caller |
false | (advanced) Disable caller. | |
--disable-sampler |
false | (advanced) Disable sampler. | |
--enable-bias-factors |
true | Enable discovery of bias factors. | |
--gc-curve-standard-deviation |
1.0 | Prior standard deviation of the GC curve from flat. | |
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--help -h |
false | display the help message | |
--init-ard-rel-unexplained-variance |
0.1 | Initial value of ARD prior precisions relative to the scale of interval-specific unexplained variance. | |
--initial-temperature |
2.0 | Initial temperature (for DA-ADVI). | |
--interval-merging-rule -imr |
ALL | Interval merging rule for abutting intervals | |
--interval-psi-scale |
0.001 | Typical scale of interval-specific unexplained variance. | |
--intervals -L |
[] | One or more genomic intervals over which to operate | |
--learning-rate |
0.05 | Adamax optimizer learning rate. | |
--log-emission-samples-per-round |
50 | Log emission samples drawn per round of sampling. | |
--log-emission-sampling-median-rel-error |
0.005 | Maximum tolerated median relative error in log emission sampling. | |
--log-emission-sampling-rounds |
10 | Log emission maximum sampling rounds. | |
--log-mean-bias-standard-deviation |
0.1 | Standard deviation of log mean bias. | |
--mapping-error-rate |
0.01 | Typical mapping error rate. | |
--max-advi-iter-first-epoch |
100 | Maximum ADVI iterations in the first epoch. | |
--max-advi-iter-subsequent-epochs |
100 | Maximum ADVI iterations in subsequent epochs. | |
--max-bias-factors |
5 | Maximum number of bias factors. | |
--max-calling-iters |
10 | Maximum number of internal self-consistency iterations within each calling step. | |
--max-copy-number |
5 | Highest allowed copy-number state. | |
--max-training-epochs |
50 | Maximum number of training epochs. | |
--min-training-epochs |
10 | Minimum number of training epochs. | |
--model |
null | Input denoising-model directory. In the COHORT mode, this argument is optional and if provided,a new model will be built using this input model to initialize. In the CASE mode, the denoising model parameters set to this input model and therefore, this argument is required. | |
--num-gc-bins |
20 | Number of bins used to represent the GC-bias curves. | |
--num-thermal-epochs |
20 | Number of thermal epochs (for DA-ADVI). | |
--p-active |
0.01 | Prior probability of treating an interval as CNV-active (in a CNV-active domains, all copy-number states are equally likely to be called). | |
--p-alt |
1.0E-6 | Total prior probability of alternative copy-number states (the reference copy-number is set to the contig integer ploidy) | |
--sample-psi-scale |
1.0E-4 | Typical scale of sample-specific correction to the unexplained variance. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--exclude-intervals -XL |
[] | One or more genomic intervals to exclude from processing | |
--gatk-config-file |
null | A configuration file to use with the GATK. | |
--interval-exclusion-padding -ixp |
0 | Amount of padding (in bp) to add to each interval you are excluding. | |
--interval-padding -ip |
0 | Amount of padding (in bp) to add to each interval you are including. | |
--interval-set-rule -isr |
UNION | Set merging approach to use for combining interval inputs | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--TMP_DIR |
[] | Undocumented option | |
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--active-class-padding-hybrid-mode / NA
If copy-number-posterior-expectation-mode is set to HYBRID, CNV-active intervals determined at any time will be padded by this value (in the units of bp) in order to obtain the set of intervals on which copy number posterior expectation is performed exactly.
int 50000 [ [ -∞ ∞ ] ]
--adamax-beta-1 / NA
Adamax optimizer first moment estimation forgetting factor.
double 0.9 [ [ 0 1 ] ]
--adamax-beta-2 / NA
Adamax optimizer second moment estimation forgetting factor.
double 0.99 [ [ 0 1 ] ]
--annotated-intervals / NA
Input annotated-interval file containing annotations for GC content in genomic intervals (output of AnnotateIntervals). All intervals specified via -L must be contained. This input should not be provided if an input denoising-model directory is given (the latter already contains the annotated-interval file).
File null
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--caller-admixing-rate / NA
Admixing ratio of new and old called posteriors (between 0 and 1; larger values implies using more of the new posterior and less of the old posterior).
double 0.75 [ [ 0 ∞ ] ]
--caller-update-convergence-threshold / NA
Maximum tolerated calling update size for convergence.
double 0.001 [ [ 0 ∞ ] ]
--class-coherence-length / NA
Coherence length of CNV-active and CNV-silent domains (in the units of bp).
double 10000.0 [ [ 0 ∞ ] ]
--cnv-coherence-length / NA
Coherence length of CNV events (in the units of bp).
double 10000.0 [ [ 0 ∞ ] ]
--contig-ploidy-calls / NA
Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy).
R String null
--convergence-snr-averaging-window / NA
Averaging window for calculating training signal-to-noise ratio (SNR) for convergence checking.
int 500 [ [ 0 ∞ ] ]
--convergence-snr-countdown-window / NA
The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.
int 10 [ [ 0 ∞ ] ]
--convergence-snr-trigger-threshold / NA
The SNR threshold to be reached before triggering the convergence countdown.
double 0.1 [ [ 0 ∞ ] ]
--copy-number-posterior-expectation-mode / NA
The strategy for calculating copy number posterior expectations in the coverage denoising model.
The --copy-number-posterior-expectation-mode argument is an enumerated type (CopyNumberPosteriorExpectationMode), which can have one of the following values:
- MAP
- EXACT
- HYBRID
CopyNumberPosteriorExpectationMode HYBRID
--depth-correction-tau / NA
Precision of read depth pinning to its global value.
double 10000.0 [ [ 0 ∞ ] ]
--disable-annealing / NA
(advanced) Disable annealing.
boolean false
--disable-caller / NA
(advanced) Disable caller.
boolean false
--disable-sampler / NA
(advanced) Disable sampler.
boolean false
--enable-bias-factors / NA
Enable discovery of bias factors.
boolean true
--exclude-intervals / -XL
One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite).
This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the
command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals
(e.g. -XL myFile.intervals).
List[String] []
--gatk-config-file / NA
A configuration file to use with the GATK.
String null
--gc-curve-standard-deviation / NA
Prior standard deviation of the GC curve from flat.
double 1.0 [ [ 0 ∞ ] ]
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--help / -h
display the help message
boolean false
--init-ard-rel-unexplained-variance / NA
Initial value of ARD prior precisions relative to the scale of interval-specific unexplained variance.
double 0.1 [ [ 0 ∞ ] ]
--initial-temperature / NA
Initial temperature (for DA-ADVI).
double 2.0 [ [ 0 ∞ ] ]
--input / NA
Input read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L must be contained; if none are specified, then intervals must be identical and in the same order for all samples.
R List[File] []
--interval-exclusion-padding / -ixp
Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a
padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-merging-rule / -imr
Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not
actually overlap) into a single continuous interval. However you can change this behavior if you want them to be
treated as separate intervals instead.
The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:
- ALL
- OVERLAPPING_ONLY
IntervalMergingRule ALL
--interval-padding / -ip
Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a
padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-psi-scale / NA
Typical scale of interval-specific unexplained variance.
double 0.001 [ [ 0 ∞ ] ]
--interval-set-rule / -isr
Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can
change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to
perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule
INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will
always be merged using UNION).
Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.
The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:
- UNION
- Take the union of all intervals
- INTERSECTION
- Take the intersection of intervals (the subset that overlaps all intervals specified)
IntervalSetRule UNION
--intervals / -L
One or more genomic intervals over which to operate
List[String] []
--learning-rate / NA
Adamax optimizer learning rate.
double 0.05 [ [ 0 ∞ ] ]
--log-emission-samples-per-round / NA
Log emission samples drawn per round of sampling.
int 50 [ [ 0 ∞ ] ]
--log-emission-sampling-median-rel-error / NA
Maximum tolerated median relative error in log emission sampling.
double 0.005 [ [ 0 ∞ ] ]
--log-emission-sampling-rounds / NA
Log emission maximum sampling rounds.
int 10 [ [ 0 ∞ ] ]
--log-mean-bias-standard-deviation / NA
Standard deviation of log mean bias.
double 0.1 [ [ 0 ∞ ] ]
--mapping-error-rate / NA
Typical mapping error rate.
double 0.01 [ [ 0 ∞ ] ]
--max-advi-iter-first-epoch / NA
Maximum ADVI iterations in the first epoch.
int 100 [ [ 0 ∞ ] ]
--max-advi-iter-subsequent-epochs / NA
Maximum ADVI iterations in subsequent epochs.
int 100 [ [ 0 ∞ ] ]
--max-bias-factors / NA
Maximum number of bias factors.
int 5 [ [ 0 ∞ ] ]
--max-calling-iters / NA
Maximum number of internal self-consistency iterations within each calling step.
int 10 [ [ 0 ∞ ] ]
--max-copy-number / NA
Highest allowed copy-number state.
int 5 [ [ 0 ∞ ] ]
--max-training-epochs / NA
Maximum number of training epochs.
int 50 [ [ 0 ∞ ] ]
--min-training-epochs / NA
Minimum number of training epochs.
int 10 [ [ 0 ∞ ] ]
--model / NA
Input denoising-model directory. In the COHORT mode, this argument is optional and if provided,a new model will be built using this input model to initialize. In the CASE mode, the denoising model parameters set to this input model and therefore, this argument is required.
String null
--num-gc-bins / NA
Number of bins used to represent the GC-bias curves.
int 20 [ [ 1 ∞ ] ]
--num-thermal-epochs / NA
Number of thermal epochs (for DA-ADVI).
int 20 [ [ 0 ∞ ] ]
--output / NA
Output directory.
R String null
--output-prefix / NA
Prefix for output filenames.
R String null
--p-active / NA
Prior probability of treating an interval as CNV-active (in a CNV-active domains, all copy-number states are equally likely to be called).
double 0.01 [ [ 0 ∞ ] ]
--p-alt / NA
Total prior probability of alternative copy-number states (the reference copy-number is set to the contig integer ploidy)
double 1.0E-6 [ [ 0 ∞ ] ]
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--run-mode / NA
Tool run-mode.
The --run-mode argument is an enumerated type (RunMode), which can have one of the following values:
- COHORT
- CASE
R RunMode null
--sample-psi-scale / NA
Typical scale of sample-specific correction to the unexplained variance.
double 1.0E-4 [ [ 0 ∞ ] ]
--showHidden / -showHidden
display hidden arguments
boolean false
--TMP_DIR / NA
Undocumented option
List[File] []
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.0.1.0 built at 27-28-2019 10:28:59.
0 comments
Please sign in to leave a comment.