Calls copy-number variants in germline samples given their counts and the output of DetermineGermlineContigPloidy.
Category Copy Number Variant Discovery
Overview
Calls copy-number variants in germline samples given their counts and the corresponding output of DetermineGermlineContigPloidy. The former should be either HDF5 or TSV count files generated by CollectFragmentCounts.Introduction
Reliable detection of copy-number variation (CNV) from read-depth ("coverage" or "counts") data such as whole exome sequencing (WES), whole genome sequencing (WGS), and gene panel coverage profiles requires a comprehensive model of library preparation and sequencing biases. The Bayesian model and the associated inference scheme implemented in GermlineCNVCaller includes provisions for inferring and explaining away much of the technical variation and automatically determining CNV calling confidence along the genome.
The parameters of the probabilistic model for read-depth bias and variance (hereafter, "the coverage model") can be automatically inferred by GermlineCNVCaller by providing a cohort of germline samples sequenced using the same sequencing platform and library preparation protocol (in case of WES, the same capture kit). We refer to this mode as the COHORT mode. The number of samples required for the COHORT mode depends on many factors such as the quality of sequenced samples and the stringency of following the library preparation and sequencing protocols. For WES and WGS, we recommend including at least 30 samples.
The parametrized coverage model can be used for CNV detection on future case samples provided that they are strictly compatible (in terms of library preparation and sequencing protocol) with the cohort used to generate the model parameters. We refer to this mode as the CASE mode. There is no lower limit on the number of samples for running GermlineCNVCaller in the case mode.
In both modes, the output calls of DetermineGermlineContigPloidy are required for all samples. The germline contig ploidy estimates are used for choosing the baseline copy-number state (in particular, for sex chromosomes).
Tool run modes
- COHORT mode:
-
The tool will be run in the COHORT mode via passing the argument
--run-mode COHORT
. In this mode, coverage model parameters are inferred simultaneously with the CNV events. Depending on available memory, it may be necessary to run the tool over a subset of all intervals, which can be specified by -L and must be present in all of the count files. The output will contain two subdirectories, one ending with "-model" and the other with "-calls".The model subdirectory contains the inferred parameters of the coverage model, which may be used later for CNV calling in one or more similarly-sequenced samples. If a previously obtained coverage model parameter bundle is provided via
--model <previous_model_path>
in this mode, those parameters will only be used for initialization and a new parameter bundle will be generated based on the provided cohort. Furthermore, the range of genomic intervals is set to the range used for creating the previous parameter bundle and interval-related arguments will be ignored.The calls subdirectory contains one subdirectory for each sample, listing various sample-specific quantities such as the probability of various copy-number states for each interval, the GC curve, sample-specific unexplained variance, read depth, and loadings of various coverage bias factors.
- CASE mode:
-
The tool will be run in the CASE mode via passing the argument
--run-mode CASE
. The path to a previously obtained coverage model parameter bundle must be provided via--model <previous_model_path>
. The range of genomic intervals is set to the range used for creating the parameter bundle and interval-related arguments will be ignored. The output of the CASE mode is only the "-calls" subdirectory.
Important Remarks
- Choice of hyperparameters:
The quality of inferred coverage model parameters and germline CNV events is sensitive to the choice of model hyperparameters, such as the prior probability of alternative copy-number states, prevalence of active regions, the coherence length of CNV events and active/silent domains, and the typical scale of interval- and sample-specific unexplained variance. These hyperparameters are not universal and must be properly tuned for each sequencing protocol.
- Running GermlineCNVCaller on a subset of intervals:
As mentioned earlier, it may be necessary to run the tool over a subset of all intervals depending on available memory. The number of intervals must be large enough to include a contextually diverse set of regions for reliable inference of the GC bias curve, as well as other bias factors. For WES and WGS, we recommend no less than 10000 consecutive intervals spanning at least 10 - 50 mb.
- Memory Requirements for the python subprocess ("gcnvkernel"):
The computation done by this tool, for the most part, is performed outside of JVM and via a spawned python subprocess. The Java heap memory is only used for loading sample counts and preparing raw data for the python subprocess. The user must ensure that the machine has enough free physical memory for spawning and executing the python subprocess. Generally speaking, the resource requirements of this tool scale linearly with each of the number of samples, the number of modeled intervals, the highest copy number state, the number of bias factors, and the number of knobs on the GC curve. For example, the python subprocess requires approximately 16gb for RAM for modeling 10000 intervals for 100 samples, with 16 maximum bias factors and explicit GC bias modeling.
Usage examples
COHORT mode:
gatk GermlineCNVCaller \ --run-mode COHORT \ -L intervals.interval_list \ --contig-ploidy-calls path_to_contig_ploidy_calls --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --output output_dir \ --output-prefix normal_cohort_run
CASE mode:
gatk GermlineCNVCaller \ --run-mode CASE \ -L intervals.interval_list \ --contig-ploidy-calls path_to_contig_ploidy_calls --model previous_model_path \ --input normal_1.counts.hdf5 \ ... \ --output output_dir \ --output-prefix normal_case_run
GermlineCNVCaller specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--contig-ploidy-calls |
null | Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy). | |
--input |
[] | Input read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L must be contained; if none are specified, then intervals must be identical and in the same order for all samples. | |
--output |
null | Output directory. | |
--output-prefix |
null | Prefix for output filenames. | |
--run-mode |
null | Tool run-mode. | |
Optional Tool Arguments | |||
--active-class-padding-hybrid-mode |
50000 | If copy-number-posterior-expectation-mode is set to hybrid, pad active intervals determined at any time by this value (in the units of bp) in order to obtain the set of intervals on which copy number posterior expectation is performed exactly. | |
--adamax-beta-1 |
0.9 | Adamax optimizer first moment estimation forgetting factor. | |
--adamax-beta-2 |
0.99 | Adamax optimizer second moment estimation forgetting factor. | |
--annotated-intervals |
null | Input annotated-interval file containing annotations for GC content in genomic intervals (output of AnnotateIntervals). All intervals specified via -L must be contained. This input should not be provided if an input denoising-model directory is given (the latter already contains the annotated-interval file). | |
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--caller-admixing-rate |
0.75 | Admixing ratio of new and old caller posteriors (between 0 and 1; higher means using more of the new posterior) | |
--caller-update-convergence-threshold |
0.001 | Maximum tolerated calling update size for convergence. | |
--class-coherence-length |
10000.0 | Coherence length of CNV class domains (in the units of bp). | |
--cnv-coherence-length |
10000.0 | Coherence length of CNV events (in the units of bp). | |
--convergence-snr-averaging-window |
500 | Averaging window for calculating training SNR for evaluating convergence. | |
--convergence-snr-countdown-window |
10 | The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence. | |
--convergence-snr-trigger-threshold |
0.1 | The SNR threshold to be reached for triggering convergence. | |
--copy-number-posterior-expectation-mode |
HYBRID | The strategy for calculating copy number posterior expectations in the denoising model. | |
--depth-correction-tau |
10000.0 | Precision of read depth pinning to its global value. | |
--disable-annealing |
false | (advanced) Disable annealing. | |
--disable-caller |
false | (advanced) Disable caller. | |
--disable-sampler |
false | (advanced) Disable sampler. | |
--enable-bias-factors |
true | Enable discovery of bias factors. | |
--gc-curve-standard-deviation |
1.0 | Prior standard deviation of the GC curve from flat. | |
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--help -h |
false | display the help message | |
--init-ard-rel-unexplained-variance |
0.1 | Initial value of ARD prior precision relative to the typical interval-specific unexplained variance scale. | |
--initial-temperature |
2.0 | Initial temperature (for DA-ADVI). | |
--interval-merging-rule -imr |
ALL | Interval merging rule for abutting intervals | |
--interval-psi-scale |
0.001 | Typical scale of interval-specific unexplained variance. | |
--intervals -L |
[] | One or more genomic intervals over which to operate | |
--learning-rate |
0.05 | Adamax optimizer learning rate. | |
--log-emission-samples-per-round |
50 | Log emission samples drawn per round of sampling. | |
--log-emission-sampling-median-rel-error |
0.005 | Maximum tolerated median relative error in log emission sampling. | |
--log-emission-sampling-rounds |
10 | Log emission maximum sampling rounds. | |
--log-mean-bias-standard-deviation |
0.1 | Standard deviation of log mean bias. | |
--mapping-error-rate |
0.01 | Typical mapping error rate. | |
--max-advi-iter-first-epoch |
100 | Maximum ADVI iterations in the first epoch. | |
--max-advi-iter-subsequent-epochs |
100 | Maximum ADVI iterations in the subsequent epochs. | |
--max-bias-factors |
5 | Maximum number of bias factors. | |
--max-calling-iters |
10 | Maximum number of calling internal self-consistency iterations. | |
--max-copy-number |
5 | Highest considered copy-number. | |
--max-training-epochs |
50 | Maximum number of training epochs. | |
--min-training-epochs |
10 | Minimum number of training epochs. | |
--model |
null | Input denoising-model directory. In the COHORT mode, this argument is optional and if provided,a new model will be built using this input model to initialize. In the CASE mode, the denoising model parameters set to this input model and therefore, this argument is required. | |
--num-gc-bins |
20 | Number of knobs on the GC curves. | |
--num-thermal-epochs |
20 | Number of thermal epochs (for DA-ADVI). | |
--p-active |
0.01 | Prior probability of treating an interval as CNV-active | |
--p-alt |
1.0E-6 | Prior probability of alt copy-number with respect to contig baseline state in the reference copy number. | |
--sample-psi-scale |
1.0E-4 | Typical scale of sample-specific unexplained variance. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--exclude-intervals -XL |
[] | One or more genomic intervals to exclude from processing | |
--gatk-config-file |
null | A configuration file to use with the GATK. | |
--interval-exclusion-padding -ixp |
0 | Amount of padding (in bp) to add to each interval you are excluding. | |
--interval-padding -ip |
0 | Amount of padding (in bp) to add to each interval you are including. | |
--interval-set-rule -isr |
UNION | Set merging approach to use for combining interval inputs | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--TMP_DIR |
[] | Undocumented option | |
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--active-class-padding-hybrid-mode / NA
If copy-number-posterior-expectation-mode is set to hybrid, pad active intervals determined at any time by this value (in the units of bp) in order to obtain the set of intervals on which copy number posterior expectation is performed exactly.
int 50000 [ [ -∞ ∞ ] ]
--adamax-beta-1 / NA
Adamax optimizer first moment estimation forgetting factor.
double 0.9 [ [ 0 1 ] ]
--adamax-beta-2 / NA
Adamax optimizer second moment estimation forgetting factor.
double 0.99 [ [ 0 1 ] ]
--annotated-intervals / NA
Input annotated-interval file containing annotations for GC content in genomic intervals (output of AnnotateIntervals). All intervals specified via -L must be contained. This input should not be provided if an input denoising-model directory is given (the latter already contains the annotated-interval file).
File null
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--caller-admixing-rate / NA
Admixing ratio of new and old caller posteriors (between 0 and 1; higher means using more of the new posterior)
double 0.75 [ [ 0 ∞ ] ]
--caller-update-convergence-threshold / NA
Maximum tolerated calling update size for convergence.
double 0.001 [ [ 0 ∞ ] ]
--class-coherence-length / NA
Coherence length of CNV class domains (in the units of bp).
double 10000.0 [ [ 0 ∞ ] ]
--cnv-coherence-length / NA
Coherence length of CNV events (in the units of bp).
double 10000.0 [ [ 0 ∞ ] ]
--contig-ploidy-calls / NA
Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy).
R String null
--convergence-snr-averaging-window / NA
Averaging window for calculating training SNR for evaluating convergence.
int 500 [ [ 0 ∞ ] ]
--convergence-snr-countdown-window / NA
The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.
int 10 [ [ 0 ∞ ] ]
--convergence-snr-trigger-threshold / NA
The SNR threshold to be reached for triggering convergence.
double 0.1 [ [ 0 ∞ ] ]
--copy-number-posterior-expectation-mode / NA
The strategy for calculating copy number posterior expectations in the denoising model.
The --copy-number-posterior-expectation-mode argument is an enumerated type (CopyNumberPosteriorExpectationMode), which can have one of the following values:
- MAP
- EXACT
- HYBRID
CopyNumberPosteriorExpectationMode HYBRID
--depth-correction-tau / NA
Precision of read depth pinning to its global value.
double 10000.0 [ [ 0 ∞ ] ]
--disable-annealing / NA
(advanced) Disable annealing.
boolean false
--disable-caller / NA
(advanced) Disable caller.
boolean false
--disable-sampler / NA
(advanced) Disable sampler.
boolean false
--enable-bias-factors / NA
Enable discovery of bias factors.
boolean true
--exclude-intervals / -XL
One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite).
This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the
command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals
(e.g. -XL myFile.intervals).
List[String] []
--gatk-config-file / NA
A configuration file to use with the GATK.
String null
--gc-curve-standard-deviation / NA
Prior standard deviation of the GC curve from flat.
double 1.0 [ [ 0 ∞ ] ]
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--help / -h
display the help message
boolean false
--init-ard-rel-unexplained-variance / NA
Initial value of ARD prior precision relative to the typical interval-specific unexplained variance scale.
double 0.1 [ [ 0 ∞ ] ]
--initial-temperature / NA
Initial temperature (for DA-ADVI).
double 2.0 [ [ 0 ∞ ] ]
--input / NA
Input read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L must be contained; if none are specified, then intervals must be identical and in the same order for all samples.
R List[File] []
--interval-exclusion-padding / -ixp
Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a
padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-merging-rule / -imr
Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not
actually overlap) into a single continuous interval. However you can change this behavior if you want them to be
treated as separate intervals instead.
The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:
- ALL
- OVERLAPPING_ONLY
IntervalMergingRule ALL
--interval-padding / -ip
Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a
padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-psi-scale / NA
Typical scale of interval-specific unexplained variance.
double 0.001 [ [ 0 ∞ ] ]
--interval-set-rule / -isr
Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can
change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to
perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule
INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will
always be merged using UNION).
Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.
The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:
- UNION
- Take the union of all intervals
- INTERSECTION
- Take the intersection of intervals (the subset that overlaps all intervals specified)
IntervalSetRule UNION
--intervals / -L
One or more genomic intervals over which to operate
List[String] []
--learning-rate / NA
Adamax optimizer learning rate.
double 0.05 [ [ 0 ∞ ] ]
--log-emission-samples-per-round / NA
Log emission samples drawn per round of sampling.
int 50 [ [ 0 ∞ ] ]
--log-emission-sampling-median-rel-error / NA
Maximum tolerated median relative error in log emission sampling.
double 0.005 [ [ 0 ∞ ] ]
--log-emission-sampling-rounds / NA
Log emission maximum sampling rounds.
int 10 [ [ 0 ∞ ] ]
--log-mean-bias-standard-deviation / NA
Standard deviation of log mean bias.
double 0.1 [ [ 0 ∞ ] ]
--mapping-error-rate / NA
Typical mapping error rate.
double 0.01 [ [ 0 ∞ ] ]
--max-advi-iter-first-epoch / NA
Maximum ADVI iterations in the first epoch.
int 100 [ [ 0 ∞ ] ]
--max-advi-iter-subsequent-epochs / NA
Maximum ADVI iterations in the subsequent epochs.
int 100 [ [ 0 ∞ ] ]
--max-bias-factors / NA
Maximum number of bias factors.
int 5 [ [ 0 ∞ ] ]
--max-calling-iters / NA
Maximum number of calling internal self-consistency iterations.
int 10 [ [ 0 ∞ ] ]
--max-copy-number / NA
Highest considered copy-number.
int 5 [ [ 0 ∞ ] ]
--max-training-epochs / NA
Maximum number of training epochs.
int 50 [ [ 0 ∞ ] ]
--min-training-epochs / NA
Minimum number of training epochs.
int 10 [ [ 0 ∞ ] ]
--model / NA
Input denoising-model directory. In the COHORT mode, this argument is optional and if provided,a new model will be built using this input model to initialize. In the CASE mode, the denoising model parameters set to this input model and therefore, this argument is required.
String null
--num-gc-bins / NA
Number of knobs on the GC curves.
int 20 [ [ 1 ∞ ] ]
--num-thermal-epochs / NA
Number of thermal epochs (for DA-ADVI).
int 20 [ [ 0 ∞ ] ]
--output / NA
Output directory.
R String null
--output-prefix / NA
Prefix for output filenames.
R String null
--p-active / NA
Prior probability of treating an interval as CNV-active
double 0.01 [ [ 0 ∞ ] ]
--p-alt / NA
Prior probability of alt copy-number with respect to contig baseline state in the reference copy number.
double 1.0E-6 [ [ 0 ∞ ] ]
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--run-mode / NA
Tool run-mode.
The --run-mode argument is an enumerated type (RunMode), which can have one of the following values:
- COHORT
- CASE
R RunMode null
--sample-psi-scale / NA
Typical scale of sample-specific unexplained variance.
double 1.0E-4 [ [ 0 ∞ ] ]
--showHidden / -showHidden
display hidden arguments
boolean false
--TMP_DIR / NA
Undocumented option
List[File] []
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.0.0.0 built at 27-36-2019 11:36:13.
0 comments
Please sign in to leave a comment.