The tutorial outlines steps in detecting germline copy number variants (gCNVs) and illustrates two workflow modes--cohort mode and case mode. The cohort mode simultaneously generates a cohort model and calls CNVs for the cohort samples. The case mode analyzes a single sample against an already constructed cohort model. The same workflow steps apply to both targeted exome and whole genome sequencing (WGS) data. The workflow is able to call both rare and common events and intelligently handles allosomal ploidies, i.e. cohorts of mixed male and female samples.
For the cohort mode, the general recommendation is at least a hundred samples to start. Researchers should expect to tune workflow parameters from the provided defaults. In particular, GermlineCNVCaller's default inference parameters are conservatively set for efficient run times.
The figure diagrams the workflow tools. Section 1 creates an intervals list and counts read alignments overlapping the intervals. Section 2 shows optional but recommended cohort mode steps to annotate intervals with covariates for use in filtering intervals as well as for use in explicit modeling. The section also removes outlier counts intervals. Section 3 generates global baseline observations for the data and models and calls the ploidy of each contig. Section 4 is at the heart of the workflow and models per-interval copy number. Because of the high intensity of compute model fitting requires, the section shows how to analyze data in parts. Finally, Section 5 calls per-sample copy number events per interval and per segment. Results are in VCF format.
► A highly recommended whitepaper detailing the methods is in the gatk GitHub repository's docs/CNV directory.
► For pipelined workflows, see the gatk GitHub repository's scripts/cnv wdl. Be sure to obtain a tagged version of the script, e.g. v4.1.0.0, following instructions in Section 4 of Article #23405.
► This workflow is not appropriate for bulk tumor samples, as it infers absolute copy numbers. For somatic copy number alteration calling, see Tutorial #11682.
Article #11687 visualizes the results in IGV and provides followup discussion. Towards data exploration, here are two illustrative Jupyter Notebook reports that dissect the results.
-
Notebook #11685 shows an approach to measuring concordance of sample NA19017 gCNV calls to 1000 Genomes Project truth set calls using tutorial
chr20sub
small data. - Notebook #11686 examines gCNV callset annotations using larger data, namely chr20 gCNV results from the tutorial's 24-sample cohort.
Jump to a section
-
Collect raw counts data with PreprocessIntervals and CollectFragmentCounts
☞ 1.1 How do I view HDF5 format data? - (Optional) Annotate intervals with features and subset regions of interest with FilterIntervals
- Call autosomal and allosomal contig ploidy with DetermineGermlineContigPloidy
-
Model interval copy number for a cohort or case sample with GermlineCNVCaller
☞ 4.1 How do I increase the sensitivity of detection?
☞ 4.2 How do I make interval lists for scattering? - Call copy number segments and consolidate sample results with PostprocessGermlineCNVCalls
Tools involved
- GATK 4.1.0.0
- Workflow tools DetermineGermlineContigPloidy, GermlineCNVCaller and PostprocessGermlineCNVCalls require a Python environment with specific packages, e.g. the gCNV computational python module gcnvkernel. See Article #12836 for instructions on setting up and managing the environment with the user-conda. Once the conda environment is set up, e.g. with
conda env create -f gatkcondaenv.yml
, activate it withsource activate gatk
orconda activate gatk
before running the tool. Alternatively, use the broadinstitute/gatk Docker, which activates the Python environment by default. Allocation of at least 8GB memory to Docker is recommended for the tutorial commands. See Article #11090 for instructions to launch a Docker container.
Download example data
The tutorial provides example small WGS data sourced from the 1000 Genomes Project. Cohort mode illustrations use 24 samples, while case mode illustrations analyze one sample against a cohort model made from the remaining 23 samples. The tutorial uses a fraction of the workflow's recommended hundred samples for ease of illustration. Furthermore, commands in each step use one of three differently sized intervals lists for efficiency. Coverage data are from the entirety of chr20, chrX and chrY. So although a step may analyze a subset of regions, it is possible to instead analyze all three contigs in case or cohort modes.
Download tutorial_11684.tar.gz either from the GoogleDrive or from the FTP site. The bundle includes data for Notebook #11685 and Notebook #11686. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. The example data is from the 1000 Genomes project Phase 3 aligned to GRCh38.
1. Collect raw counts data with PreprocessIntervals and CollectReadCounts
PreprocessIntervals pads exome targets and bins WGS intervals. Binning refers to creating equally sized intervals across the reference. For example, 1000 base binning would define chr1:1-1000 as the first bin. Because counts of reads on reference N
bases are not meaningful, the tool automatically excludes bins with all N
s. For GRCh38 chr1, non-N sequences start at base 10,001, so the first few bin become:
For WGS data, bin entirety of reference, e.g. with 1000 base intervals.
gatk PreprocessIntervals \ -R ~/ref/Homo_sapiens_assembly38.fasta \ --padding 0 \ -imr OVERLAPPING_ONLY \ -O grch38.preprocessed.interval_list
This produces a Picard-style intervals list of 1000 base bins.
For exome data, pad target regions, e.g. with 250 bases.
gatk PreprocessIntervals \ -R ~/ref/Homo_sapiens_assembly38.fasta \ -L targets.interval_list \ --bin-length 0 \ -imr OVERLAPPING_ONLY \ -O targets.preprocessed.interval_list
This produces a Picard-style intervals list of exome target regions padded by 250 bases on either side.
For the tutorial, bins three contigs.
The contigs in gcnv-chr20XY-contig.list
subset the reference to chr20, chrX and chrY.
gatk PreprocessIntervals \ -R ref/Homo_sapiens_assembly38.fasta \ --padding 0 \ -L gcnv-chr20XY-contig.list \ -imr OVERLAPPING_ONLY \ -O chr20XY.interval_list
This generates a Picard-style intervals list with 242,549 intervals. The file has a header section with @
header lines and a five-column body. See Article #11009 for a description of the columns.
Comments on select parameters
- For WGS, the default 1000
--bin-length
is the recommended starting point for typical 30x data. Be sure to set--padding 0
to disable padding outside of given genomic regions. Bin size should correlate with depth of coverage, e.g. lower coverage data should use larger bin size while higher coverage data can support smaller bin size. The size of the bin defines the resolution of CNV calls. The factors to consider in sizing include how noisy the data is, average coverage depth and how even coverage is across the reference. - For targeted exomes, provide the exome capture kit's target intervals with
-L
, set--bin-length 0
to disable binning and pad the intervals with--padding 250
or other desired length. - Provide intervals to exclude from analysis with
--exclude-intervals
or-XL
, e.g. centromeric regions. Consider using this option especially if data is aligned to a reference other than GRCh38. The workflow enables excluding regions later again using-XL
. A frugal strategy is to collect read counts using the entirety of intervals and then to exclude undesirable regions later at the FilterIntervals step (section 2), the DetermineGermlineContigPloidy step (section 3), at the GermlineCNVCaller step (section 5) and/or post-calling.
CollectReadCounts tabulates the raw integer counts of reads overlapping an interval. The tutorial has already collected read counts ahead of time for the three contigs--chr20, chrX and chrY. Here, we collect read counts on small data.
Count reads per bin using CollectReadCounts
gatk CollectReadCounts \ -L chr20sub.interval_list \ -R ref/Homo_sapiens_assembly38.fasta \ -imr OVERLAPPING_ONLY \ -I NA19017.chr20sub.bam \ --format TSV \ -O NA19017.tsv
This generates a TSV format table of read counts.
Comments on select parameters
- The tutorial generates text-based TSV (tab-separated-value) format data instead of the default HDF5 format by adding
--format TSV
to the command. Omit this option to generate the default HDF5 format. Downstream tools process HDF5 format more efficiently. - Here and elsewhere in the workflow, set
--interval-merging-rule
(-imr
) toOVERLAPPING_ONLY
, to prevent the tool from merging abutting intervals. - The tool employs a number of engine-level read filters. Of note are NotDuplicateReadFilter and MappingQualityReadFilter. This means the tool excludes reads marked as duplicate and excludes reads with mapping quality less than 10. Change the mapping quality threshold with the
--minimum-mapping-quality
option.
After the SAM format header section, denoted by lines starting with @
, the body of the data has a column header line followed by read counts for every interval.
☞ 1.1 How do I view HDF5 format data?
See Article #11508 for an overview of the format and instructions on how to navigate the data with external application HDFView. The article illustrates features of the format using data generated in another tutorial.
2. (Optional) Annotate intervals with features and subset regions of interest with FilterIntervals
The steps in this section pertain to the cohort mode.
Researchers may desire to subset the intervals that GermlineCNVCaller will analyze, either to exclude potentially problematic regions or to retain only regions of interest. For example one may wish to exclude regions where all samples in a large cohort have copy number zero. Filtering intervals can be especially impactful for analyses that utilize references other than GRCh38 or that are based on sequencing technologies affected by sequence context, e.g. targeted exomes. The tutorial data is WGS data aligned to GRCh38, and the gCNV workflow can process the entirety of the data, without the need for any interval filtering.
Towards deciding which regions to exclude, AnnotateIntervals labels the given intervals with GC content and additionally with mappability and segmental duplication content if given the respective optional resource files. FilterIntervals then subsets the intervals list based on the annotations and other tunable thresholds. Later, GermlineCNVCaller also takes in the annotated intervals to use as covariates towards analysis.
Explicit GC-correction, although optional, is recommended. The default v4.1.0.0 cnv_germline_cohort_workflow.wdl
pipeline workflow omits explicit gc-correction and we activate it in the pipeline by setting do_explicit_gc_correction":"True"
. The tutorial illustrates the optional AnnotateIntervals step by performing the recommended explicit GC-content-based filtering.
AnnotateIntervals with GC content
gatk AnnotateIntervals \ -L chr20XY.interval_list \ -R ref/Homo_sapiens_assembly38.fasta \ -imr OVERLAPPING_ONLY \ -O chr20XY.annotated.tsv
This produces a four-column table where the fourth column gives the fraction of GC content.
Comments on select parameters
- The tool requires the
-R
reference and the-L
intervals. The tool calculates GC-content for the intervals using the reference. -
Although optional for the tool, we recommend annotating mappability by providing a
--mappability-track
regions file in either.bed
or.bed.gz
format. Be sure to merge any overlapping intervals beforehand. The tutorial omits use of this resource.GATK recommends use of the the single-read mappability track, as the multi-read track requires much longer times to process. For example, the Hoffman lab at the University of Toronto provides human and mouse mappability BED files for various kmer lengths at https://bismap.hoffmanlab.org. The accompanying publication is titled Umap and Bismap: quantifying genome and methylome mappability.
-
Optionally and additionally, annotate segmental duplication content by providing a
--segmental-duplication-track
regions file in either.bed
or.bed.gz
format. - Exclude undesirable intervals with the
-XL
parameter, e.g. intervals corresponding to centromeric regions.
FilterIntervals takes preprocessed intervals and either annotated intervals or read counts or both. It can also exclude intervals given with -XL
. When given both types of data, the tool retains the intervals that intersect from filtering on each data type. The v4.1.0.0 cnv_germline_cohort_workflow.wdl
pipeline script requires read counts files, and so by default the pipeline script always performs the FilterIntervals step on read counts.
FilterIntervals based on GC-content and cohort extreme counts
gatk FilterIntervals \ -L chr20XY.interval_list \ --annotated-intervals chr20XY.annotated.tsv \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \ -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \ -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \ -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \ -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \ -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \ -imr OVERLAPPING_ONLY \ -O chr20XY.cohort.gc.filtered.interval_list
This produces a Picard-style intervals list containing a subset of the starting intervals (230,126 of 242,549). Of the filtered intervals, GC-content filters 42 intervals, and extreme coverage counts from the 24-sample cohort remove an additional 12,381 intervals for a total of 12,423 removed filtered intervals (5.122% of starting).
Comments on select parameters
- The tool requires the preprocessed intervals, provided with
-L
, from Section 1. -
Given annotated intervals with
--annotated-intervals
, the tool filters intervals on the given annotation(s).- GC-content thresholds are set by
--minimum-gc-content
and--maximum-gc-content
, where defaults are 0.1 and 0.9, respectively. - Mappability thresholds are set by
--minimum-mappability
and--maximum-mappability
. Defaults are 0.9 and 1.0, respectively. - Segmental duplication content thresholds are set by
--minimum-segmental-duplication-content
and--maximum-segmental-duplication-content
. Defaults are 0.0 and 0.5, respectively.
- GC-content thresholds are set by
-
Given read counts files, each with
-I
and in either HDF5 or TSV format, the tool filters intervals on low and extreme read counts with the following tunable thresholds.-
--low-count-filter-count-threshold
default is 10 -
--low-count-filter-percentage-of-samples
default is 50.0 -
--extreme-count-filter-minimum-percentile
default is 1.0 -
--extreme-count-filter-maximum-percentile
default is 99.0 -
--extreme-count-filter-percentage-of-samples
default is 90.0
The read counts data must match each other in intervals.
For the default parameters, the tool first filters intervals with a count less than 10 in greater than 50% of the samples. The tool then filters the remaining intervals with a count percentile less than 1 or greater than 99 in a percentage of samples greater than 90%. These parameters effectively exclude intervals where all samples have extreme outlier counts, e.g. are deleted.
To disable counts based filtering, omit the read counts or, e.g. when using the v4.1.0.0
cnv_germline_cohort_workflow.wdl
pipeline script, set the twopercentage-of-samples
parameters as follows. -
--low-count-filter-percentage-of-samples 100 \ --extreme-count-filter-percentage-of-samples 100 \
- Provide intervals to exclude from analysis with
--exclude-intervals
or-XL
, e.g. centromeric regions. A frugal strategy is to collect read counts using the entirety of intervals and then to exclude undesirable regions later at the FilterIntervals step (section 2), the DetermineGermlineContigPloidy step (section 3), at the GermlineCNVCaller step (section 5) and/or post-calling.
3. Call autosomal and allosomal contig ploidy with DetermineGermlineContigPloidy
DetermineGermlineContigPloidy calls contig level ploidies for both autosomal, e.g. human chr20, and allosomal contigs, e.g. human chrX. The tool determines baseline contig ploidies using sample coverages and contig ploidy priors that give the prior probabilities for each ploidy state for each contig. In this process, the tool generates global baseline coverage and noise data GermlineCNVCaller will use in section 5.
The tool determines baseline contig ploidies using the total read count per contig. Researchers should consider the impact of this for their data. For example, for the tutorial WGS data, the contribution of the PAR regions to total coverage counts on chrX is small and the tool correctly calls allosomal ploidies. However, consider blacklisting PAR regions for data where the contribution is disporportionate, e.g. targeted panels.
DetermineGermlineContigPloidy in COHORT MODE
The cohort mode requires a --contig-ploidy-priors
table and produces a ploidy model.
gatk DetermineGermlineContigPloidy \ -L chr20XY.cohort.gc.filtered.interval_list \ --interval-merging-rule OVERLAPPING_ONLY \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \ -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \ -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \ -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \ -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \ -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \ --contig-ploidy-priors chr20XY_contig_ploidy_priors.tsv \ --output . \ --output-prefix ploidy \ --verbosity DEBUG
This produces a ploidy-calls
directory and a ploidy-model
directory. The ploidy-calls
directory contains a folder of data for each sample in the cohort including the contig ploidy calls. Each sample directory, e.g. ploidy-calls/SAMPLE_0
, contains five files.
-
contig_ploidy.tsv
notes the ploidy and genotype quality (GQ) of the ploidy call for each contig. -
global_read_depth.tsv
notes an average depth value and an average ploidy across all the intervals of the sample. -
mu_psi_s_log__.tsv
captures the posterior mean for all of the modeled parameters. -
sample_name.txt
contains the readgroup sample (RG SM) name. -
std_psi_s_log__.tsv
captures the standard deviation for all of the modeled paramters.
The ploidy-model
directory contains aggregated model data for the cohort. This is the model to provide to a case-mode DetermineGermlineContigPloidy analysis and to GermlineCNVCaller. The tutorial ploidy-model
directory contains the eight files as follows.
-
contig_ploidy_prior.tsv
is a copy of the ploidy priors given to the tool. -
gcnvkernel_version.json
notes the version of the kernel. -
interval_list.tsv
recapitulates the intervals used, e.g. the filtered intervals. mu_mean_bias_j_lowerbound__.tsv
mu_psi_j_log__.tsv
ploidy_config.json
std_mean_bias_j_lowerbound__.tsv
std_psi_j_log__.tsv
The theano model automatically generates mu_
and std_
files and may append transformations it performs to the file name, e.g. log
or lowerbound
as we see above. These are likely of interest only to advanced users.
DetermineGermlineContigPloidy in CASE MODE
The case mode calls contig ploidies for each sample against the ploidy model given by --model
. The following command runs sample NA19017 against a 23-sample cohort model.
gatk DetermineGermlineContigPloidy \ --model cohort-23wgs-20190213-contig-ploidy-model \ -I cvg/NA19017.tsv \ -O . \ --output-prefix ploidy-case \ --verbosity DEBUG
This produces a ploidy-case-calls
directory, which in turn contains a directory of sample data, SAMPLE_0
. A list of the five resulting files is some paragraphs above.
Comments on select parameters
- It is possible to analyze multiple samples simultaneously in a case mode command. Provide each sample with
-I
. - For the
-L
intervals, supply the most processed intervals list. For the tutorial, this is the filtered intervals. Note the case mode does not require explicit intervals because the ploidy model provides them. - Provide a
--contig-ploidy-priors
table containing the per-contig prior probabilities for integer ploidy state. Again, the case mode does not require an explicit priors file as the ploidy model provides them. Tool index describes this resource in detail. The tutorial uses the following contig ploidy priors. - Optionally provide intervals to exclude from analysis with
--exclude-intervals
or-XL
, e.g. pseudoautosomal (PAR) regions, which can skew results for certain data.
The results for NA19017, from either the cohort mode or the case mode, show ploidy 2 for chr20 and chrX and ploidy 0 for chrY. The PLOIDY_GQ quality metrics differ slightly for the modes. The entirety of NA19017's contig_ploidy.tsv
is shown.
Checking the ploidy calls for each of the 24 samples against metadata confirms expectations. The following table summarizes results for the 24 samples. The data was collated from DetermineGermlineContigPloidy results using a bashscript.
It should be noted, the tutorial's default parameter run gives XY samples CN1 for the majority of chrX, including for PAR regions, where coverage is actually on par with the CN2 of XX samples. See Article#11687 for further discussion.
4. Call copy number variants with GermlineCNVCaller
GermlineCNVCaller learns a denoising model per scattered shard while consistently calling CNVs across the shards. The tool models systematic biases and CNVs simultaneously, which allows for sensitive detection of both rare and common CNVs. For a description of innovations, see Blog #23439 As the tool index states under Important Remarks (v4.1.0.0), the tool should see data from a large enough genomic region so as to be exposed to diverse genomic features. The current recommendation is to provide at least ~10–50Mbp genomic coverage per scatter. This applies to exomes or WGS. This allows reliable inference of bias factors including GC bias. The limitation of analyzing larger regions is available memory. As an analysis covers more data, memory requirements increase.
For expediency, the tutorial commands below analyze small data, specifically the 1400 bins in twelveregions.cohort.gc.filtered.interval_list
and use default parameters. The tutorial splits the 1400 bins into two shards with 700 bins each to illustrate scattering. This results in ~0.7Mbp genomic coverage per shard. See section 4.2 for how to split interval lists by a given number of intervals. Default inference parameters are conservatively set for efficient run times.
The tutorial coverage data are sufficient to analyze the ~15Mb in
chr20sub.cohort.gc.filtered.interval_list
as well as the entirety of chr20, chrX and chrY using the ~230Mb ofchr20XY.cohort.gc.filtered.interval_list
. The former, at 5K bins per shard, give three shards. When running the default parameters in a GATKv4.1.0.0 Docker locally on a MacBook Pro, each cohort-mode shard analysis takes ~20 minutes. The latter gives 46 shards at 5K bins per shard. When running the default parameters of the v4.1.0.0 WDL cohort-mode workflow on the cloud, the majority of the shard analyses complete in half an hour.
GermlineCNVCaller in COHORT MODE
Call gCNVs on the 24-sample cohort in two scatters. Notice the different -L
intervals and --output-prefix
basenames.
gatk GermlineCNVCaller \ --run-mode COHORT \ -L scatter-sm/twelve_1of2.interval_list \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \ -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \ -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \ -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \ -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \ -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \ --contig-ploidy-calls ploidy-calls \ --annotated-intervals twelveregions.annotated.tsv \ --interval-merging-rule OVERLAPPING_ONLY \ --output cohort24-twelve \ --output-prefix cohort24-twelve_1of2 \ --verbosity DEBUG
gatk GermlineCNVCaller \ --run-mode COHORT \ -L scatter-sm/twelve_2of2.interval_list \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \ -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \ -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \ -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \ -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \ -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \ --contig-ploidy-calls ploidy-calls \ --annotated-intervals twelveregions.annotated.tsv \ --interval-merging-rule OVERLAPPING_ONLY \ --output cohort24-twelve \ --output-prefix cohort24-twelve_2of2 \ --verbosity DEBUG
This produces per-interval gCNV calls for each of the cohort samples and a gCNV model for the cohort. Each command produces three directories within cohort24-twelve
: a cohort24-twelve_1of2-calls
folder of per sample gCNV call results, a cohort24-twelve_1of2-model
folder of cohort model data and a cohort24-twelve_1of2-tracking
folder of data that tracks model fitting. The table below lists the cohort mode data files alongside case mode files.
GermlineCNVCaller in CASE MODE
Call gCNVs on a sample against a cohort model. The case analysis must use the same scatter approach as the model generation. So, as above, we run two shard analyses. Here, --model
and --output-prefix
differ between the scatter the commands.
gatk GermlineCNVCaller \ --run-mode CASE \ -I cvg/NA19017.tsv \ --contig-ploidy-calls ploidy-case-calls \ --model cohort23-twelve/cohort23-twelve_1of2-model \ --output case-twelve-vs-cohort23 \ --output-prefix case-twelve-vs-cohort23_1of2 \ --verbosity DEBUG
gatk GermlineCNVCaller \ --run-mode CASE \ -I cvg/NA19017.tsv \ --contig-ploidy-calls ploidy-case-calls \ --model cohort23-twelve/cohort23-twelve_2of2-model \ --output case-twelve-vs-cohort23 \ --output-prefix case-twelve-vs-cohort23_2of2 \ --verbosity DEBUG
This produces both calls
and tracking
folders with, e.g. the case-twelve-vs-cohort23_1of2
basename. The case-twelve-vs-cohort23_1of2-calls
folder contains case sample gCNV call results and the case-twelve-vs-cohort23_1of2-tracking
folder contains model fitting results. The case mode results files are listed in the table below alongside cohort mode data files.
Comments on select parameters
- The
-O
output directory must be extant before running the command. Future releases (v4.1.1.0) will create the directory. - The default
--max-copy-number
is capped at 5. This means the tool reports any events with more copies as CN5. - For the cohort mode, optionally provide
--annotated-intervals
to include the annotations as covariates. These must contain all of the-L
intervals. The-L
intervals is an exact match or a subset of the annotated intervals. - For the case mode, the tool accepts only a single
--model
directory at a time. So the case must be analyzed with the same number of scatters as the cohort model run. The case mode parameters appear fewer than the cohort mode because the--model
directory provides the seemingly missing requirements, i.e. the scatter intervals and the annotated intervals. - For both modes, provide the
--contig-ploidy-calls
results from DetermineGermlineContigPloidy (Section 3). This not only informs ploidy but also establishes baseline coverage and noise levels for each sample. Later, in section 5, GermlineCNVCaller's shard analyses refer back to these global observations. -
--verbosity DEBUG
allows tracking the Python gcnvkernel model fitting in the stdout, e.g. with information on denoising epochs and whether the model converged. The default INFO level verbosity is the next most verbose and emits only GATK Engine level messages.
At this point, the workflow has done its most heavy lifting to produce data towards copy number calling. In Section 5, we consolidate the data from the scattered GermlineCNVCaller runs, perform segmentation and call copy number states.
One artificial construct of the tutorial is the use of full three contig ploidy calls data even when modeling copy number states for much smaller subset regions. This effectively stabilizes the small analysis.
☞ 4.1 How do I increase the sensitivity of detection?
The tutorial uses default GermlineCNVCaller modeling parameters. However, researchers should expect to tune parameters for data, e.g. from different sequencing technologies. For tuning, first consider the coherence length parameters, p-alt, p-active and the psi-scale parameters. These hyperparameters are just a few of the plethora of adjustable parameters GermlineCNVCaller offers. Refer to the GermlineCNVCaller tool documentation for detailed explanations, and ask on the GATK Forum for further guidance.
The tutorial illustrates one set of parameter changes for WGS data provided by Mark Walker of the GATK SV (Structural Variants) team that dramatically increase the sensitivity of calling on the tutorial data. Article #11687 and Notebook #11686 compare the results of using default vs. the increased-sensitivity parameters. Given the absence of off-the-shelf filtering solutions for CNV calls, when tuning parameters to increase sensitivity, researchers should expect to perform additional due diligence, especially for analyses requiring high precision calls.
__WGS parameters that increase the sensitivity of calling from Mark Walker
--class-coherence-length 1000.0 \ --cnv-coherence-length 1000.0 \ --enable-bias-factors false \ --interval-psi-scale 1.0E-6 \ --log-mean-bias-standard-deviation 0.01 \ --sample-psi-scale 1.0E-6 \
Comments on select sensitivity parameters
- Decreasing
--class-coherence-length
from its default of 10,000bp to 1000bp decreases the expected length of contiguous segments. Factor for bin size when tuning. - Decreasing
--cnv-coherence-length
from its default 10,000bp to 1000bp decreases the expected length of CNV events. Factor for bin size when tuning. - Turning off
--enable-bias-factors
from the defaulttrue
state tofalse
turns off active discovery of learnable bias factors. This should always be on for targeted exome data and in general can be turned off for WGS data. - Decreasing
--interval-psi-scale
from its default of 0.001 to 1.0E-6 reduces the scale the tool considers normal in per-interval noise. - Decreasing
--log-mean-bias-standard-deviation
from its default of 0.1 to 0.01 reduces what is considered normal noise in bias factors. - Decreasing
--sample-psi-scale
from its default of 0.0001 to 1.0E-6 reduces the scale that is considered normal in sample-to-sample variance.
Additional parameters to consider include --depth-correction-tau
, --p-active
and --p-alt
.
-
--depth-correction-tau
has a default of 10000.0 (10K) and defines the precision of read-depth concordance with the global depth value. -
--p-active
has a default of 1e-2 (0.01) and defines the prior probability of common CNV states. -
p-alt
has a default of 1e-6 (0.000001) and defines the expected probability of CNV events (in rare CNV states).
☞ 4.2 How do I make interval lists for scattering?
This step applies to the cohort mode. It is unnecessary for case mode analyses as the model implies the scatter intervals.
The v4.1.0.0 cnv_germline_cohort_workflow.wdl
pipeline workflow scatters the GermlineCNVCaller step. Each scattered analysis is on genomic intervals subset from intervals produced either from PreprocessIntervals (section 1) or from FilterIntervals (section 2). The workflow uses Picard IntervalListTools to break up the intervals list into roughly balanced lists.
gatk IntervalListTools \ --INPUT chr20sub.cohort.gc.filtered.interval_list \ --SUBDIVISION_MODE INTERVAL_COUNT \ --SCATTER_CONTENT 5000 \ --OUTPUT scatter
This produces three intervals lists with ~5K intervals each. For the tutorial's 1Kbp bins, this gives ~5Mbp genomic coverage per scatter. Each list is identically named scattered.interval_list
within its own folder within the scatter
directory. IntervalListTools systematically names the intermediate folders, e.g. temp_0001_of_3
, temp_0002_of_3
and temp_0002_of_3
.
Comments on select parameters
- The
--SUBDIVISION_MODE INTERVAL_COUNT
mode scatters intervals into similarly sized lists according to the count of intervals regardless of the base count. The tool intelligently breaks up thechr20sub.cohort.gc.filtered.interval_list
's ~15K intervals into lists of 5031, 5031 and 5033 intervals. This is preferable to having a fourth interval list with just 95 intervals. - The tool has another useful feature in the context of the gCNV workflow. To subset
-I
binned intervals, provide the regions of interest with-SI
(--SECOND_INPUT
) and use the--ACTION OVERLAPS
mode to create a new intervals list of the overlapping bins. Adding--SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_CONTENT 5000
will produce scatter intervals concurrently with the subsetting.
5. Call copy number segments and consolidate sample results with PostprocessGermlineCNVCalls
PostprocessGermlineCNVCalls consolidates the scattered GermlineCNVCaller results, performs segmentation and calls copy number states. The tool generates per-interval and per-segment sample calls in VCF format and runs on a single sample at a time.
PostprocessGermlineCNVCalls COHORT MODE
Process a single sample from the 24-sample cohort using the sample index. For NA19017, the sample index is 19.
gatk PostprocessGermlineCNVCalls \ --model-shard-path cohort24-twelve/cohort24-twelve_1of2-model \ --model-shard-path cohort24-twelve/cohort24-twelve_2of2-model \ --calls-shard-path cohort24-twelve/cohort24-twelve_1of2-calls \ --calls-shard-path cohort24-twelve/cohort24-twelve_2of2-calls \ --allosomal-contig chrX --allosomal-contig chrY \ --contig-ploidy-calls ploidy-calls \ --sample-index 19 \ --output-genotyped-intervals genotyped-intervals-cohort24-twelve-NA19017.vcf.gz \ --output-genotyped-segments genotyped-segments-cohort24-twelve-NA19017.vcf.gz \ --sequence-dictionary ref/Homo_sapiens_assembly38.dict
PostprocessGermlineCNVCalls CASE MODE
NA19017 is the singular sample with index 0.
gatk PostprocessGermlineCNVCalls \ --model-shard-path cohort23-twelve/cohort23-twelve_1of2-model \ --model-shard-path cohort23-twelve/cohort23-twelve_2of2-model \ --calls-shard-path case-twelve-vs-cohort23/case-twelve-vs-cohort23_1of2-calls \ --calls-shard-path case-twelve-vs-cohort23/case-twelve-vs-cohort23_2of2-calls \ --allosomal-contig chrX --allosomal-contig chrY \ --contig-ploidy-calls ploidy-case-calls \ --sample-index 0 \ --output-genotyped-intervals genotyped-intervals-case-twelve-vs-cohort23.vcf.gz \ --output-genotyped-segments genotyped-segments-case-twelve-vs-cohort23.vcf.gz \ --sequence-dictionary ref/Homo_sapiens_assembly38.dict
Each command generates two VCFs with indices. The genotyped-intervals
VCF contains variant records for each analysis bin and therefore data covers only the interval regions. For the tutorial's small data, this gives 1400 records. The genotyped-segments
VCF contains records for each contiguous copy number state segment. For the tutorial's small data, this is 30 and 31 records for cohort and case mode analyses, respectively.
The two modes--cohort and case--give highly concordant but slightly different results for sample NA19017. The factor that explains the difference is the contribution of the sample itself to the model.
Comments on select parameters
- Specify a
--model-shard-path
directory for each scatter of the cohort model. - Specify a
--calls-shard-path
directory for each scatter of the cohort or case analysis. - Specify the
--contig-ploidy-calls
directory for the cohort or case analysis. - By default
--autosomal-ref-copy-number
is set to 2. - Define allosomal contigs with the
--allosomal-contig
parameter. - The tool requires specifying the
--output-genotyped-intervals
VCF. - Optionally generate segmented VCF results with
--output-genotyped-segments
. The tool segments the regions between the starting bin and the ending bin on a contig. The implication of this is that even if there is a gap between two analysis bins on the same contig, if the copy number state is equal for the bins, then the bins and the entire region between can end up a part of the same segment. The extent of this smoothing over gaps depends on the--cnv-coherence-length
parameter. - The
--sample-index
refers to the index number given to a sample by GermlineCNVCaller. In a case mode analysis of a single sample, the index will always be zero. - The
--sequence-dictionary
is optional. Without it, the tool generates unindexed VCF results. Alternatively, to produce the VCF indices, provide the-R
reference FASTA or use IndexFeatureFile afterward. The v4.1.0.0cnv_germline_cohort_workflow.wdl
pipeline workflow omits index generation.
Here is the result. The header section with lines starting with ##
gives information on the analysis and define the annotations. Notice the singular END
annotation in the INFO
column that denotes the end position of the event. This use of the END
notation is reminiscent of GVCF format GVCF blocks.
In the body of the data, as with any VCF, the first two columns give the contig and genomic start position for the variant. The third ID
column concatenates together CNV_contig_start_stop
, e.g. CNV_chr20_1606001_1609000
. The REF
column is always N and the ALT
column gives the two types of CNV events of interest in symbolic allele notation--<DEL>
for deletion and <DUP>
for duplication or amplification. Again, the INFO
field gives the END
position of the variant. The FORMAT
field lists sample-level annotations GT:CN:NP:QA:QS:QSE:QSS
. GT
of 0 indicates normal ploidy, 1 indicates deletion and 2 denotes duplication. The CN
annotation indicates the copy number state. For the tutorial small data, CN
ranges from 0 to 3.
Developer Andrey Smirnov notes filtering on QS can increase specificity. For a discussion of results, see Article #11687.
10 comments
Is there any way to get the log2 output instead of the CN from PostprocessGermlineCNVCalls?
Hi, I believe the *.tsv files in the tutorial_11684.tar.gz either from the GoogleDrive or from the FTP site are deprecated and cannot run through GermlineCNVCaller since GATK v4.1.x.x. I managed to hack the format and fixed it myself. You might want to update the tutorial files as well.
Hi,
Thanks for the great tutorial.
I'm trying to run the pipeline for 300 low coverage samples (~5X). At the step of running GermlineCNVCaller, I'm seeing the tool keeps increasing the number of epochs because CNV calling is not converged. It is now at 50 epochs. Is this something expected or is it possible the optimisation procedure has been "trapped" ?
Thanks for the tutorial! Could you help me to understand her the NA19017.chr20sub.bam file was prepared? Is it just a BWA mapping reads? Does it got the sort and marked duplicates steps?
I would like to know where I can get the following files:
ploidy_priors.tsv
mappability-track regions file (in either .bed or .bed.gz format).
segmental-duplication-track regions file (in either .bed or .bed.gz format).
contig-ploidy-priors_contig_
This link below is broken from above. Has there been an update with the Tutorial which matches the latest WDL pipeline?
ftp://ftp.broadinstitute.org/tutorials/dataset
Download tutorial_11684.tar.gz either from the GoogleDrive or from the FTP site. The bundle includes data for Notebook #11685 and Notebook #11686. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. The example data is from the 1000 Genomes project Phase 3 aligned to GRCh38.
Thanks for the tutorial! I have some troubles when using your tutorial to call CNV. Can you give me some suggestions? Here is my questions:
Can't generate ploidy-calls directory and ploidy-calls/SAMPLE_0 when use DetermineGermlineContigPloidy – GATK (broadinstitute.org).
Hi,
I am planning to run the Germline CNVs in docker on 106 targeted exomes. I am not planning to use the wld pipeline. I already run the Preprocessing step and I wonder how to run the next step, CollectReadsCounts over all those exams using the script below. Should I use a for loop to iterate over each bam sample and get every hdf5 sample result?
In addition, is necessary to generate the cohort model on "normal samples" or it can be done on the same batch of affected ones?
thanks
Hi,
In step 4, the script contains all the input files. Running this script will take a lot of time. Is it possible to create a script for each input file? Will the results obtained be consistent with the results obtained from the script that contains all the input files?
thanks
I'm trying to run this on a WES cohort with 200k intervals, split into 5k groups by the scatter method described in section 4.2
When I compare between scattered & non-scattered results (on a smaller subset that is not intractable to run whole), the segments called are not the same. I presume this is because I am missing any CNVs that span a boundary between scatter groups. How can I get around this?
Thank you
Please sign in to leave a comment.