Determines the baseline contig ploidy for germline samples given counts data
Category Copy Number Variant Discovery
Overview
Determines the integer ploidy state of all contigs for germline samples given counts data. These should be either HDF5 or TSV count files generated by CollectReadCounts; TSV files may be compressed (e.g., with bgzip), but must then have filenames ending with the extension .gz. See the documentation for the input argument for details on enabling streaming of indexed count files from Google Cloud Storage.Introduction
Germline karyotyping is a frequently performed task in bioinformatics pipelines, e.g. for sex determination and aneuploidy identification. This tool uses counts data for germline karyotyping.
Performing germline karyotyping using counts data requires calibrating ("modeling") the technical coverage bias and variance for each contig. The Bayesian model and the associated inference scheme implemented in DetermineGermlineContigPloidy includes provisions for inferring and explaining away much of the technical variation. Furthermore, karyotyping confidence is automatically adjusted for individual samples and contigs.
Running DetermineGermlineContigPloidy is the first computational step in the GATK germline CNV calling pipeline. It provides a baseline ("default") copy-number state for each contig/sample with respect to which the probability of alternative states is allocated.
Python environment setup
The computation done by this tool, aside from input data parsing and validation, is performed outside of the Java Virtual Machine and using the gCNV computational python module, namely gcnvkernel. It is crucial that the user has properly set up a python conda environment with gcnvkernel and its dependencies installed. If the user intends to run DetermineGermlineContigPloidy using one of the official GATK Docker images, the python environment is already set up. Otherwise, the environment must be created and activated as described in the main GATK README.md file.
Advanced users may wish to set the THEANO_FLAGS
environment variable to override the GATK theano
configuration. For example, by running
THEANO_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk DetermineGermlineContigPloidy ...
, users can specify
the theano compilation directory (which is set to $HOME/.theano
by default). See theano documentation
at
https://theano-pymc.readthedocs.io/en/latest/library/config.html.
Tool run modes
This tool has two operation modes as described below:
- COHORT mode:
- If a ploidy model parameter path is not provided via the model argument, the tool will run in
the COHORT mode. In this mode, ploidy model parameters (e.g. coverage bias and variance for each contig) are
inferred, along with baseline contig ploidy states of each sample. It is possible to run the tool over a subset
of all intervals present in the input count files, which can be specified by -L; this can be used to pass a
filtered interval list produced by FilterIntervals to mask intervals from modeling. Intervals may also be
blacklisted using -XL. The specified intervals that result from resolving -L/-XL inputs must be exactly present
in all of the input count files.
A TSV file specifying prior probabilities for each integer ploidy state and for each contig is required in this mode and must be specified via the contig-ploidy-priors argument. The following shows an example of such a table:
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3 1 0.01 0.01 0.97 0.01 2 0.01 0.01 0.97 0.01 X 0.01 0.49 0.49 0.01 Y 0.50 0.50 0.00 0.00 Note that the contig names appearing under CONTIG_NAME column must match contig names in the input counts files, and all contigs appearing in the input counts files must have a corresponding entry in the priors table. The order of contigs is immaterial in the priors table. The highest ploidy state is determined by the prior table (3 in the above example). A ploidy state can be strictly forbidden by setting its prior probability to 0. For example, the Y contig in the above example can only assume 0 and 1 ploidy states.
If the provided count data only contains intervals from a single chromosome, the model degeneracy in this case will lead to no meaningful inference and the ploidy states will be resolved to the most likely ploidy states given by the ploidy prior table. As an example, autosomal contigs will assume ploidy 2, and X could assume either 1 or 2 depending on tie breakers.
The tool output in the COHORT mode will contain two subdirectories, one ending with "-model" and the other ending with "-calls". The model subdirectory contains the inferred parameters of the ploidy model, which may be used later on for karyotyping one or more similarly-sequenced samples (see below). The calls subdirectory contains one subdirectory for each sample, listing various sample-specific quantities such as the global read-depth, average ploidy, per-contig baseline ploidies, and per-contig coverage variance estimates.
- CASE mode:
- If a path containing previously inferred ploidy model parameters is provided via the
model argument, then the tool will run in the CASE mode. In this mode, the parameters of the ploidy
model are loaded from the provided directory and only sample-specific quantities are inferred. The modeled
intervals are then specified by a file contained in the model directory, all interval-related arguments are
ignored in this mode, and all model intervals must be present in all of the input count files. The tool output
in the CASE mode is only the "-calls" subdirectory and is organized similarly to the COHORT mode.
In the CASE mode, the contig ploidy prior table is taken directly from the provided model parameters path and must be not provided again.
Important Remarks
- Choice of hyperparameters:
The quality of ploidy model parametrization and the sensitivity/precision of germline karyotyping are sensitive to the choice of model hyperparameters, including standard deviation of mean contig coverage bias (set using the mean-bias-standard-deviation argument), mapping error rate (set using the mapping-error-rate argument), and the typical scale of contig- and sample-specific unexplained variance (set using the global-psi-scale and sample-psi-scale arguments, respectively). It is crucial to note that these hyperparameters are not universal and must be tuned for each sequencing protocol and properly set at runtime.
- Mosaicism and fractional ploidies:
The model underlying this tool assumes integer ploidy states (in contrast to fractional/variable ploidy states). Therefore, it is to be used strictly on germline samples and for the purpose of sex determination, autosomal aneuploidy detection, or as a part of the GATK germline CNV calling pipeline. The presence of large somatic events and mosaicism (e.g., sex chromosome loss and somatic trisomy) will naturally lead to unreliable results. We strongly recommended inspecting genotyping qualities (GQ) from the tool output and considering to drop low-GQ contigs in downstream analyses. Finally, given the Bayesian status of this tool, we suggest including as many high-quality germline samples as possible for ploidy model parametrizaton in the COHORT mode. This will downplay the role of questionable samples and will yield a more reliable estimation of genuine sequencing biases.
- Coverage-based germline karyotyping:
Accurate germline karyotyping requires incorporating SNP allele-fraction data and counts data in a unified probabilistic model and is beyond the scope of the present tool. The current implementation only uses counts data for karyotyping and while being fast, it may not provide the most reliable results.
- ALL
- OVERLAPPING_ONLY
- UNION
- Take the union of all intervals
- INTERSECTION
- Take the intersection of intervals (the subset that overlaps all intervals specified)
- ERROR
- WARNING
- INFO
- DEBUG
Usage examples
COHORT mode:
gatk DetermineGermlineContigPloidy \ --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --contig-ploidy-priors a_valid_ploidy_priors_table.tsv --output output_dir \ --output-prefix normal_cohort
COHORT mode (with optional interval filtering):
gatk DetermineGermlineContigPloidy \ -L intervals.interval_list \ -XL blacklist_intervals.interval_list \ --interval-merging-rule OVERLAPPING_ONLY \ --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --contig-ploidy-priors a_valid_ploidy_priors_table.tsv --output output_dir \ --output-prefix normal_cohort
CASE mode:
gatk DetermineGermlineContigPloidy \ --model a_valid_ploidy_model_dir --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --output output_dir \ --output-prefix normal_case
DetermineGermlineContigPloidy specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--input -I |
Input paths for read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L/-XL must be contained; if none are specified, then intervals must be identical and in the same order for all samples. If read-count files are given by Google Cloud Storage paths, have the extension .counts.tsv or .counts.tsv.gz, and have been indexed by IndexFeatureFile, only the specified intervals will be queried and streamed; this can reduce disk usage by avoiding the complete localization of all read-count files. | ||
--output -O |
Output directory. This will be created if it does not exist. | ||
--output-prefix |
Prefix for output filenames. | ||
Optional Tool Arguments | |||
--adamax-beta-1 |
0.9 | Adamax optimizer first moment estimation forgetting factor. | |
--adamax-beta-2 |
0.999 | Adamax optimizer second moment estimation forgetting factor. | |
--arguments_file |
read one or more arguments files and add them to the command line | ||
--caller-external-admixing-rate |
0.75 | Admixing ratio of new and old called posteriors (between 0 and 1; larger values implies using more of the new posterior and less of the old posterior) after convergence. | |
--caller-internal-admixing-rate |
0.75 | Admixing ratio of new and old called posteriors (between 0 and 1; larger values implies using more of the new posterior and less of the old posterior) for internal convergence loops. | |
--caller-update-convergence-threshold |
0.001 | Maximum tolerated calling update size for convergence. | |
--contig-ploidy-priors |
Input file specifying contig-ploidy priors. If only a single sample is specified, this input should not be provided. If multiple samples are specified, this input is required. | ||
--convergence-snr-averaging-window |
5000 | Averaging window for calculating training signal-to-noise ratio (SNR) for convergence checking. | |
--convergence-snr-countdown-window |
10 | The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence. | |
--convergence-snr-trigger-threshold |
0.1 | The SNR threshold to be reached before triggering the convergence countdown. | |
--disable-annealing |
false | (advanced) Disable annealing. | |
--disable-caller |
false | (advanced) Disable caller. | |
--disable-sampler |
false | (advanced) Disable sampler. | |
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--gcs-project-for-requester-pays |
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed. | ||
--global-psi-scale |
0.001 | Prior scale of contig coverage unexplained variance. If a single sample is provided, this input will be ignored. | |
--help -h |
false | display the help message | |
--initial-temperature |
2.0 | Initial temperature (for DA-ADVI). | |
--interval-merging-rule -imr |
ALL | Interval merging rule for abutting intervals | |
--intervals -L |
One or more genomic intervals over which to operate | ||
--learning-rate |
0.05 | Adamax optimizer learning rate. | |
--log-emission-samples-per-round |
2000 | Log emission samples drawn per round of sampling. | |
--log-emission-sampling-median-rel-error |
5.0E-4 | Maximum tolerated median relative error in log emission sampling. | |
--log-emission-sampling-rounds |
100 | Log emission maximum sampling rounds. | |
--mapping-error-rate |
0.01 | Typical mapping error rate. | |
--max-advi-iter-first-epoch |
1000 | Maximum ADVI iterations in the first epoch. | |
--max-advi-iter-subsequent-epochs |
1000 | Maximum ADVI iterations in subsequent epochs. | |
--max-calling-iters |
1 | Maximum number of internal self-consistency iterations within each calling step. | |
--max-training-epochs |
100 | Maximum number of training epochs. | |
--mean-bias-standard-deviation |
0.01 | Prior standard deviation of the contig-level mean coverage bias. If a single sample is provided, this input will be ignored. | |
--min-training-epochs |
20 | Minimum number of training epochs. | |
--model |
Input ploidy-model directory. If only a single sample is specified, this input is required. If multiple samples are specified, this input should not be provided. | ||
--num-thermal-advi-iters |
5000 | Number of thermal ADVI iterations (for DA-ADVI). | |
--sample-psi-scale |
1.0E-4 | Prior scale of the sample-specific correction to the coverage unexplained variance. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--exclude-intervals -XL |
One or more genomic intervals to exclude from processing | ||
--gatk-config-file |
A configuration file to use with the GATK. | ||
--interval-exclusion-padding -ixp |
0 | Amount of padding (in bp) to add to each interval you are excluding. | |
--interval-padding -ip |
0 | Amount of padding (in bp) to add to each interval you are including. | |
--interval-set-rule -isr |
UNION | Set merging approach to use for combining interval inputs | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--tmp-dir |
Temp directory to use. | ||
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--adamax-beta-1
Adamax optimizer first moment estimation forgetting factor.
double 0.9 [ [ 0 1 ] ]
--adamax-beta-2
Adamax optimizer second moment estimation forgetting factor.
double 0.999 [ [ 0 1 ] ]
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--caller-external-admixing-rate
Admixing ratio of new and old called posteriors (between 0 and 1; larger values implies using more of the new posterior and less of the old posterior) after convergence.
double 0.75 [ [ 0 ∞ ] ]
--caller-internal-admixing-rate
Admixing ratio of new and old called posteriors (between 0 and 1; larger values implies using more of the new posterior and less of the old posterior) for internal convergence loops.
double 0.75 [ [ 0 ∞ ] ]
--caller-update-convergence-threshold
Maximum tolerated calling update size for convergence.
double 0.001 [ [ 0 ∞ ] ]
--contig-ploidy-priors
Input file specifying contig-ploidy priors. If only a single sample is specified, this input should not be provided. If multiple samples are specified, this input is required.
File null
--convergence-snr-averaging-window
Averaging window for calculating training signal-to-noise ratio (SNR) for convergence checking.
int 5000 [ [ 0 ∞ ] ]
--convergence-snr-countdown-window
The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.
int 10 [ [ 0 ∞ ] ]
--convergence-snr-trigger-threshold
The SNR threshold to be reached before triggering the convergence countdown.
double 0.1 [ [ 0 ∞ ] ]
--disable-annealing
(advanced) Disable annealing.
boolean false
--disable-caller
(advanced) Disable caller.
boolean false
--disable-sampler
(advanced) Disable sampler.
boolean false
--exclude-intervals / -XL
One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite).
This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the
command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals
(e.g. -XL myFile.intervals).
List[String] []
--gatk-config-file
A configuration file to use with the GATK.
String null
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
String ""
--global-psi-scale
Prior scale of contig coverage unexplained variance. If a single sample is provided, this input will be ignored.
double 0.001 [ [ 0 ∞ ] ]
--help / -h
display the help message
boolean false
--initial-temperature
Initial temperature (for DA-ADVI).
double 2.0 [ [ 0 ∞ ] ]
--input / -I
Input paths for read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L/-XL must be contained; if none are specified, then intervals must be identical and in the same order for all samples. If read-count files are given by Google Cloud Storage paths, have the extension .counts.tsv or .counts.tsv.gz, and have been indexed by IndexFeatureFile, only the specified intervals will be queried and streamed; this can reduce disk usage by avoiding the complete localization of all read-count files.
R List[String] []
--interval-exclusion-padding / -ixp
Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a
padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-merging-rule / -imr
Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not
actually overlap) into a single continuous interval. However you can change this behavior if you want them to be
treated as separate intervals instead.
The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:
IntervalMergingRule ALL
--interval-padding / -ip
Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a
padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-set-rule / -isr
Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can
change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to
perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule
INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will
always be merged using UNION).
Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.
The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:
IntervalSetRule UNION
--intervals / -L
One or more genomic intervals over which to operate
List[String] []
--learning-rate
Adamax optimizer learning rate.
double 0.05 [ [ 0 ∞ ] ]
--log-emission-samples-per-round
Log emission samples drawn per round of sampling.
int 2000 [ [ 0 ∞ ] ]
--log-emission-sampling-median-rel-error
Maximum tolerated median relative error in log emission sampling.
double 5.0E-4 [ [ 0 ∞ ] ]
--log-emission-sampling-rounds
Log emission maximum sampling rounds.
int 100 [ [ 0 ∞ ] ]
--mapping-error-rate
Typical mapping error rate.
double 0.01 [ [ 0 1 ] ]
--max-advi-iter-first-epoch
Maximum ADVI iterations in the first epoch.
int 1000 [ [ 0 ∞ ] ]
--max-advi-iter-subsequent-epochs
Maximum ADVI iterations in subsequent epochs.
int 1000 [ [ 0 ∞ ] ]
--max-calling-iters
Maximum number of internal self-consistency iterations within each calling step.
int 1 [ [ 0 ∞ ] ]
--max-training-epochs
Maximum number of training epochs.
int 100 [ [ 0 ∞ ] ]
--mean-bias-standard-deviation
Prior standard deviation of the contig-level mean coverage bias. If a single sample is provided, this input will be ignored.
double 0.01 [ [ 0 ∞ ] ]
--min-training-epochs
Minimum number of training epochs.
int 20 [ [ 0 ∞ ] ]
--model
Input ploidy-model directory. If only a single sample is specified, this input is required. If multiple samples are specified, this input should not be provided.
File null
--num-thermal-advi-iters
Number of thermal ADVI iterations (for DA-ADVI).
int 5000 [ [ 0 ∞ ] ]
--output / -O
Output directory. This will be created if it does not exist.
R File null
--output-prefix
Prefix for output filenames.
R String null
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--sample-psi-scale
Prior scale of the sample-specific correction to the coverage unexplained variance.
double 1.0E-4 [ [ 0 ∞ ] ]
--showHidden / -showHidden
display hidden arguments
boolean false
--tmp-dir
Temp directory to use.
GATKPath null
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.3.0.0 built at Wed, 12 Oct 2022 21:04:44 -0400.
1 comment
I am trying to run the gCNV. But I am kind of stuck at running this tool, which requires me to give a --contig-ploidy-priors table. In the tutorial a the table is given for that example data.
But I need to try this on my Exome data and there is no explaination given about how to make this table. The contigs for my data is very long and it is not easy to create manually.
Please help me out about how to make this table.
Thanks
Please sign in to leave a comment.