GATK Germline CNV calling
Hello ,
I very recently started my journey in Bioinformatics.
I have been tasked with building a variant calling pipeline (WES & WGS). So far I have successfully set it up for calling SNPs and INDELs using GATKs best practices workflow in normal and dragen mode. I am trying to proceed with CNVs for whole genome (hg19). I am not able to find the right interval list from GATK's public resource bundle. I used the command:
gatk PreprocessIntervals -R Homo_sapiens_assembly19.fasta --padding 0 -imr OVERLAPPING_ONLY -O hg19.preprocessed.interval_list
to generate my own interval list. I proceeded with the steps in https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-rare-germline-copy-number-variants#1 and built the below sequence of commands and tested it on a batch of 3 samples (I've added only the rough version without path or variable information here). I am not sure if I am doing it right. I build a cohort model from my 3 samples and called each of the samples again in case mode against the same model. Where do I use the panel of normals I created? And since I am doing this for the entire genome without specific intervals, do I skip the
--output-genotyped-segments
flag in the Germline CNV caller. Also how do I filter the CNVs after calling them? Is there a standard filtration protocol like we do for SNPs and INDELs?"
# ------------------------------------------------------------------------------------------------------------------
# STEP 1: Collect Read Counts & Create Panel of Normals- CNV
# ------------------------------------------------------------------------------------------------------------------
$gatk PreprocessIntervals \
-R ${ref} \
--padding 0 \
-imr OVERLAPPING_ONLY \
-O ${interval}
# Collect read counts for each BAM file
cd ${aligned_reads}
# Find all bam files in given Sample directory
files=$(find . -type f -name "*GATK_sorted_dedup_reads.bam")
for i in ${files}; do
filename=$(basename "$i" .bam) # Extract the base filename without the .bam extension
$gatk CollectReadCounts \
-I $i \
--interval-merging-rule OVERLAPPING_ONLY \
-L ${interval} \
-O "${data}/${filename}_read_counts.hdf5"
$gatk AnnotateIntervals \
-L ${interval} \
-R ${ref} \
-imr OVERLAPPING_ONLY \
-O ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/Homo_sapiens_assembly19_annotated_interval_list.tsv
SAMPLE_INPUT=""
for counts_file in ${data}/*_read_counts.hdf5; do
SAMPLE_INPUT+="--input $counts_file "
done
$gatk FilterIntervals \
-L ${interval} \
--annotated-intervals ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/Homo_sapiens_assembly19_annotated_interval_list.tsv \
$SAMPLE_INPUT \
-imr OVERLAPPING_ONLY \
-O ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/Homo_sapiens_assembly19_filtered.interval_list
SAMPLE_INPUT=""
for counts_file in ${data}/*_read_counts.hdf5; do
SAMPLE_INPUT+="--input $counts_file "
done
$gatk CreateReadCountPanelOfNormals \
$SAMPLE_INPUT \
-O pon.hdf5
# ------------------------------------------------------------------------------------------------------------------
# STEP 2: Denoise Read Counts
# ------------------------------------------------------------------------------------------------------------------
for counts_file in ${data}/*_read_counts.hdf5; do
filename=$(basename "$counts_file" "_read_counts.hdf5") # Extract the base filename without the read_counts.hdf5 extension
$gatk DenoiseReadCounts \
-I $counts_file \
--count-panel-of-normals pon.hdf5 \
--standardized-copy-ratios "${data}/${filename}_standardized_CR.tsv" \
--denoised-copy-ratios "${data}/${filename}_denoised_CR.tsv"
$gatk PlotDenoisedCopyRatios \
--standardized-copy-ratios "${data}/${filename}_standardized_CR.tsv" \
--denoised-copy-ratios "${data}/${filename}_denoised_CR.tsv" \
--sequence-dictionary ${dict} \
--output-prefix ${filename} \
-O .
done
# ------------------------------------------------------------------------------------------------------------------
# STEP 3: Determine Germline Contig Ploidy
# ------------------------------------------------------------------------------------------------------------------
PLOIDY_INPUT=""
for counts_file in ${data}/*_read_counts.hdf5; do
PLOIDY_INPUT+="--input $counts_file "
done
$gatk DetermineGermlineContigPloidy \
-L ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/Homo_sapiens_assembly19_filtered.interval_list \
--interval-merging-rule OVERLAPPING_ONLY \
$PLOIDY_INPUT \
--contig-ploidy-priors "${ploidy_priors}" \
--output ${data} \
--output-prefix ploidy \
--verbosity DEBUG
CPLOIDY_INPUT=""
for counts_file in ${data}/*_read_counts.hdf5; do
CPLOIDY_INPUT+="--input $counts_file "
done
$gatk DetermineGermlineContigPloidy \
--model ${data}/ploidy-model \
$CPLOIDY_INPUT \
--output ${data} \
--output-prefix ploidy_case \
--verbosity DEBUG
# ------------------------------------------------------------------------------------------------------------------
# STEP 3: Run GermlineCNVCaller in Cohort Mode
# ------------------------------------------------------------------------------------------------------------------
COHORT_INPUT=""
for counts_file in ${data}/*_read_counts.hdf5; do
COHORT_INPUT+="--input $counts_file "
done
$gatk GermlineCNVCaller \
-L ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/Homo_sapiens_assembly19_filtered.interval_list \
--interval-merging-rule OVERLAPPING_ONLY \
--run-mode COHORT \
--contig-ploidy-calls ${data}/ploidy-calls \
$COHORT_INPUT \
--annotated-intervals ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/Homo_sapiens_assembly19_annotated_interval_list.tsv \
--output ${data}/CNVcalls \
--output-prefix cohort \
--verbosity DEBUG
# ------------------------------------------------------------------------------------------------------------------
# STEP 4: Run GermlineCNVCaller in Case Mode for Each Sample
# ------------------------------------------------------------------------------------------------------------------
CASE_INPUT=""
for counts_file in ${data}/*_read_counts.hdf5; do
CASE_INPUT+="--input $counts_file "
done
# filename=$(basename "$counts_file" _read_counts.hdf5)
$gatk GermlineCNVCaller \
--run-mode CASE \
--contig-ploidy-calls ${data}/ploidy_case-calls \
--model ${data}/CNVcalls/cohort-model \
$CASE_INPUT \
--output ${data}/CNVcasecalls \
--output-prefix pcase \
--verbosity DEBUG
# ------------------------------------------------------------------------------------------------------------------
# STEP 5: Postprocess Germline CNV Calls to Generate VCF Files
# ------------------------------------------------------------------------------------------------------------------
i=0
for counts_file in ${data}/*_read_counts.hdf5; do
filename=$(basename "$counts_file" _read_counts.hdf5)
echo "Iteration $i"
echo "Generating final vcf file for: $counts_file"
$gatk PostprocessGermlineCNVCalls \
--model-shard-path ${data}/CNVcalls/cohort-model \
--calls-shard-path ${data}/CNVcalls/cohort-calls \
--contig-ploidy-calls ${data}/ploidy-calls \
--sample-index $i\
--sequence-dictionary ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/Homo_sapiens_assembly19.dict \
--output-genotyped-intervals ${results}/${filename}_gen_intv.vcf.gz \
--output-genotyped-segments ${results}/${filename}_gen_seg.vcf.gz \
--output-denoised-copy-ratios ${results}/${filename}_denoised_copy_ratios.tsv
i=$((i+1))
done
Thank you, I really appreciate your help in this!
-
Hi Ramesh
GCNV has 2 separate modes both of which may serve for different purposes. COHORT mode runs without a prior model and generates the model on the fly as it detects and calculates CNVs. If you have all your samples run in COHORT mode you don't have to re-run them in CASE mode again. COHORT mode usually requires high number of samples to run preferably 30 or more.
CASE mode is useful when you have low number of samples and a comparable model that you generated previously from a normal cohort.
Looking at the way you generated your flow (though we cannot fix any scriptual errors here) I spotted a few tools that are not part of the actual GCNV workflow. Those tools belong to Somatic CNV analysis workflow so you can remove them.
You may refer to our chart below
About your question on interval lists you can prepare one yourself that suits your capture kit and/or reference genome using the tool PreProcessIntervals.
Finally we have a publication on GCNV workflow on UKBB exomes which you may refer to when filtering exome CNV calls.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10904014/
I hope this helps.
-
Thank you so much for replying and the flow chart Gökalp Çelik. I will remove the unnecessary tools/steps. I was testing the pipeline for both cases and hence was running it on CASE and COHORT. Ideally we would be running in COHORT mode.
We are trying to identify CNVs in our WGS data and sometimes we might run it on WES. Are there any specific differences in the process between the two other than interval definition and the command in preprocess interval step?
Thank you for your time and guidance.
-
Hi again.
Those 2 different sample types might require different parameters for tuning. Our defaults in GATK 4.5 and above are mostly tuned for whole exome data in the study published in the link below
https://www.nature.com/articles/s41588-023-01449-0
Whole genome sequencing samples could be tuned with parameters indicated in the document below.
I hope this helps.
Regards.
-
Thanks again Gökalp Çelik.
Can you also kindly clarify about the interval list and ploidy priors.
So, I'm currently using wgs_calling_regions.hg38.interval_list from GATK's resource bundle.
I have excluded all the non-standard and random chromosomes and kept just chr1:22, X and Y
My ploidy priors file is:
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
chr1 0 0.01 0.98 0.01
chr2 0 0.01 0.98 0.01
chr3 0 0.01 0.98 0.01
chr4 0 0.01 0.98 0.01
chr5 0 0.01 0.98 0.01
chr6 0 0.01 0.98 0.01
chr7 0 0.01 0.98 0.01
chr8 0 0.01 0.98 0.01
chr9 0 0.01 0.98 0.01
chr10 0 0.01 0.98 0.01
chr11 0 0.01 0.98 0.01
chr12 0 0.01 0.98 0.01
chr13 0 0.01 0.98 0.01
chr14 0 0.01 0.98 0.01
chr15 0 0.01 0.98 0.01
chr16 0 0.01 0.98 0.01
chr17 0 0.01 0.98 0.01
chr18 0 0.01 0.98 0.01
chr19 0 0.01 0.98 0.01
chr20 0 0.01 0.98 0.01
chr21 0 0.01 0.98 0.01
chr22 0 0.01 0.98 0.01
chrX 0.01 0.49 0.49 0.01
chrY 0.495 0.495 0.01 0
When I run the GermLineCNVCaller in cohort mode, I get the warning:
15:21:38.476 WARNING gcnvkernel.structs.metadata - Sample 2908946 has an anomalous ploidy (3) for contig chr1. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
15:21:38.493 WARNING gcnvkernel.structs.metadata - Sample 2908956 has an anomalous ploidy (3) for contig chr1. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
15:21:38.494 WARNING gcnvkernel.structs.metadata - Sample 2908956 has an anomalous karyotype ({'X': 3, 'Y': 0}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
15:21:38.512 WARNING gcnvkernel.structs.metadata - Sample 2908966 has an anomalous ploidy (3) for contig chr1. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
Now my bam and ref files include the non-standard chromosomes. Is this causing the issue?
What can I do it avoid this? Should I remove the reads from non standard chromosomes from my bam files? what is the recommended way to handle this? I tried to get mappability tracks from UCSC but am unsure on how and which step of the CNV analysis to use.
This is my current command:
$gatk GermlineCNVCaller \
-L ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/wgs_calling_regions_hg38_filtered.interval_list \
--interval-merging-rule OVERLAPPING_ONLY \
--run-mode COHORT \
--contig-ploidy-calls ${data}/ploidy-calls \
$COHORT_INPUT \
--annotated-intervals ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/wgs_calling_regions_hg38_annotated_interval_list.tsv \
--output ${data}/CNVcalls \
--output-prefix cohort \
--verbosity DEBUGThank you again for patiently guiding me through this.
-
Hi Ramesh
Ploidy priors are empirical values that could represent your data as baseline probability for aneuploidies. If you wish to keep your samples more in the diploid state and only accept non-marginal differences as aneuploidies you may need to increase the prior probability of the diploid state and reduce all others. Regardless GermlineCNVCaller step will reassess and report any copy number estimations regardless of your ploidy state.
For the intervals we recommend using intervals that reside in the more of the mappable parts of the genome which can be filtered out using additional mappability and segmental duplications annotations.
I hope this helps.
-
Thank you for that clarification Gökalp Çelik. Are there any specific mappability tracks or annotations recommended as best practice?
-
Hi again.
You may want to refer to Hoffman labs pages for mappability and UCSC segmental duplications page for segmental duplication annotations.
https://bismap.hoffmanlab.org/
https://genome.ucsc.edu/cgi-bin/hgTrackUi?g=genomicSuperDups
I hope this helps.
Regards.
-
Thank you so much!
Please sign in to leave a comment.
8 comments