DetermineGermlineContigPloidy with exome germline data
AnsweredHello there,
I am currently trying to analyse germline exome data captured with Truseq and Nextera capture kits. I would like to check for any exon deletions in single sample cases.
I have thus far completed steps 1 and 2 of the GATK docs: https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-common-and-rare-germline-copy-number-variants
My output thus far seems okay:
CONTIG START END GC_CONTENT MAPPABILITYchr1
chr1 685470 686904 0.400000 0.000000
chr1 924181 925203 0.767351 1.000000
chr1 925671 926268 0.678930 1.000000
chr1 929904 930591 0.659884 1.000000
I am now up to step 3 and am unsure as to what cohort model to use? Do I need a cohort model to run a single sample? I tried using the cohort provided in the tutorial data in the link provided above:
cohort-23wgs-20190213-contig-ploidy-model
But am not sure if that is the way to tackle this problem.
Q1. Is my pipeline correct for what I want to achieve?
Q2. Using the tutorial cohort model resulted in an error regarding gcnvkernal:
traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'gcnvkernel'
Any help would be greatly appreciated.
Thanks
-
You have to generate your own model. Single sample cases can only be analyzed using a model prepared with a cohort of same sample prep methods and sequencing techniques.
First gather about 30 or more exome samples prepared with the same method and keep in mind the coverages should be close to each other.
Run all Germline CNV workflow in COHORT mode to generate a model as well as inferring the ones in the cohort.
Using the model generated analyze your upcoming samples prepared with the same kit and method using the CASE mode.
When running the workflow make sure that GATK python environment is set and active or use the docker image provided by the team.
That is all.
-
Thanks for providing your insight, SkyWarrior!
-
Thank you, SkyWarrior - much appreciated.
-
Hi there,
Another quick question. In the following command:
gatk PreprocessIntervals \ -R ~/ref/Homo_sapiens_assembly38.fasta \ -L targets.interval_list \ --bin-length 0 \ -imr OVERLAPPING_ONLY \ -O targets.preprocessed.interval_list
Is the -L targets.interval_list pertain to the capture kit used (i.e Nextera, Truseq), and would require a sorted bed file of exon regions for that capture kit e.g:
GRCh38_ensembl_99_exon_regions_UTR_CDS_SS_merged.sorted.bed
Kind regards
-
Use the enrichment kit's original target file. Otherwise you will get lots of false positives.
-
Thank you SkyWarrior.
To make sure I get this correct. The pipeline for cohort mode would go like this:
1. Illumina only have the hg19 enrichment kit .bed files, so I lifted this over to what my bam samples were aligned to (hg38) with CrossMap.
1.
gatk PreprocessIntervals \ -R ~/ref/Homo_sapiens_assembly38.fasta \ -L nextera_exome_target_capture_liftedover_by_crossmap.h19tohg38.bed \ --bin-length 0 \
--padding 250 -imr OVERLAPPING_ONLY \ -O nextera_targets.preprocessed.interval_list2. I would then do the following for each of the 30 bam sample files I have?
2.
gatk CollectReadCounts \ -L nextera_targets.preprocessed.interval_list \ -R ref/Homo_sapiens_assembly38.fasta \ -imr OVERLAPPING_ONLY \ -I Sample1.bam \ --format TSV \ -O sample1.tsv3.
gatk AnnotateIntervals \ -L nextera_targets.preprocessed.interval_list \ -R ref/Homo_sapiens_assembly38.fasta \
--mappability-track ref/k36.umap.bed
-imr OVERLAPPING_ONLY \
-O nextera_targets.annotated.tsv(k36.umap.bed: I wasn't sure which k-mer single-read track to use. So will try k36).
Below with step 4, I am assuming the defaults will be applied as per the GATK 'How to'.
4.
gatk FilterIntervals \ -L nextera_targets.preprocessed.interval_list \ --annotated-intervals nextera_targets.annotated.tsv \ -I sample1.tsv -I sample2.tsv -I sample3.tsv...-I sample30.tsv \ -imr OVERLAPPING_ONLY \ -O nextera_targets.cohort.gc.filtered.interval_list5.
gatk DetermineGermlineContigPloidy \ -L nextera_targets.cohort.gc.filtered.interval_list \ --interval-merging-rule OVERLAPPING_ONLY \ -I sample1.tsv -I sample2.tsv -I sample3.tsv...-I sample30.tsv \ --contig-ploidy-priors 30sample_cohort_contig_ploidy_priors.tsv \ --output . \ --output-prefix ploidy \ --verbosity DEBUG6.
gatk DetermineGermlineContigPloidy \ --model 30samples_cohort-contig-ploidy-model \ -I NewSample.tsv \ -O . \ --output-prefix ploidy-case \ --verbosity DEBUGWould this be correct?
Thank you.
Edit: So after step 5, go straight to the below?
gatk GermlineCNVCaller \ --run-mode COHORT \ -L nextera_targets.cohort.gc.filtered.interval_list \ -I sample1.tsv -I sample2.tsv -I sample3.tsv...-I sample30.tsv\ --contig-ploidy-calls ploidy-calls \ --annotated-intervals nextera_targets.annotated.tsv\ --interval-merging-rule OVERLAPPING_ONLY \ --output cohort30 \ --output-prefix cohort30 \ --verbosity DEBUG
-
Once you run DetermineGermlineContigPloidy you need to move to GermlineCNVCaller to call CNVs and generate a model for CNVcalls for future case samples.
You need to keep both ContigPloidy model and CNVCall model for analyzing samples case by case for the future.
Cohort mode will also call CNVs for this set of sample so you don't need to recall them in a seperate case mode.
-
Thank you SkyWarrior,
I added an edit to the previous post. To make sure I have understood what you've written.
-
Yes skip number 6 for Cohort mode.
Case mode will be a seperate issue after this cohort.
-
Thank you SkyWarrior. I'm up to step -> 5. DetermineGermlineContigPloidy and the command requires a contig_ploidy_priors.tsv file. Is this a blank file you create for the script, or something generated prior? I don't see it generated in the previous steps.
--contig-ploidy-priors chr20XY_contig_ploidy_priors.tsv \
-
That is something that you need to determine empirically or by trial and error until you get the correct ploidy profile of all samples in your cohort. For starters you may start with the example provided in the tutorial.
-
Thanks SkyWarrior. I have seen the tutorial file but seem to be in the same situation as this person (https://www.biostars.org/p/450343/). I haven't done CNV calling before and do not understand how this file can be generated, empirically or otherwise.
I don't mind putting numbers in a table under the following headers, but not sure which numbers to put...and from where.
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
chr20 0 0.01 0.98 0.01
chrX 0.01 0.49 0.49 0.01
chrY 0.495 0.495 0.01 0I created a test ploidy file (just to give it a whirl) with the following: Not sure if this would make sense.
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
chr1 0.01 0.01 0.97 0.01
chr2 0.01 0.01 0.97 0.01
chr3 0.01 0.01 0.97 0.01
chr4 0.01 0.01 0.97 0.01
chr5 0.01 0.01 0.97 0.01
chr6 0.01 0.01 0.97 0.01
chr7 0.01 0.01 0.97 0.01
chr8 0.01 0.01 0.97 0.01...etcBut receive this error, am I missing something v obvious? any help with this by the GATK community would be greatly appreciated.
14:56:22.498 DEBUG ScriptExecutor - --contig_ploidy_prior_table=/gatk/my_data/cohort30_test_contig_ploidy_priors.tsv
14:56:22.498 DEBUG ScriptExecutor - --output_model_path=/gatk/my_data/ploidy-model
15:01:31.102 INFO cohort_determine_ploidy_and_depth - THEANO_FLAGS environment variable has been set to: device=cpu,floatX=float64,optimizer=fast_run,compute_test_value=ignore,openmp=true,blas.ldflags=-lmkl_rt,openmp_elemwise_minsize=10
15:01:32.140 INFO gcnvkernel.structs.metadata - Generating intervals metadata...
Traceback (most recent call last):
File "/tmp/cohort_determine_ploidy_and_depth.8287152907514644081.py", line 112, in <module>
ploidy_config = gcnvkernel.PloidyModelConfig.from_args_dict(args_dict)
File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/models/model_ploidy.py", line 127, in from_args_dict
return PloidyModelConfig(**relevant_kwargs)
File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/models/model_ploidy.py", line 46, in __init__
contig_ploidy_prior_map)
File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/models/model_ploidy.py", line 61, in _get_validated_contig_ploidy_prior_map
padded_validated_prior = commons.get_normalized_prob_vector(padded_validated_prior, config.prob_sum_tol)
File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/models/commons.py", line 29, in get_normalized_prob_vector
assert all(prob_vector >= 0), "Probabilities must be non-negative"
AssertionError: Probabilities must be non-negative
15:01:32.862 DEBUG ScriptExecutor - Result: 1
15:01:32.864 INFO DetermineGermlineContigPloidy - Shutting down engine
[August 26, 2021 3:01:32 PM GMT] org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy done. Elapsed time: 5.99 minutes.
Runtime.totalMemory()=254279680
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /tmp/cohort_determine_ploidy_and_depth.8287152907514644081.py --sample_coverage_metadata=/tmp/samples-by-coverage-per-contig7725728372826703434.tsv --output_calls_path=/gatk/my_data/ploidy-calls --mapping_error_rate=1.000000e-02 --psi_s_scale=1.000000e-04 --mean_bias_sd=1.000000e-02 --psi_j_scale=1.000000e-03 --learning_rate=5.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.990000e-01 --log_emission_samples_per_round=2000 --log_emission_sampling_rounds=100 --log_emission_sampling_median_rel_error=5.000000e-04 --max_advi_iter_first_epoch=1000 --max_advi_iter_subsequent_epochs=1000 --min_training_epochs=20 --max_training_epochs=100 --initial_temperature=2.000000e+00 --num_thermal_advi_iters=5000 --convergence_snr_averaging_window=5000 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=1 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=7.500000e-01 --disable_caller=false --disable_sampler=false --disable_annealing=false --interval_list=/tmp/intervals7908972937829736917.tsv --contig_ploidy_prior_table=/gatk/my_data/cohort30_test_contig_ploidy_priors.tsv --output_model_path=/gatk/my_data/ploidy-model
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:130)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:420)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:317)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289) -
It is a simple tsv file where you have prior probabilities for each ploidy for each contig. Ploidy prior 2 is diploid 0 is null 1 is monoploid and so on.
You need to generate an empirical table for your 23 contigs. Then you can modify it based on the results that you get at the end of the DCP step 5. If any of the samples does not fit the expectations then modify the prior probability or check if you have used a wrong reference genome that does not contain properly masked repeats etc. This step will determine the proper ploidy designation for each chromosome for your samples so you can detect major chromosomal aneuploidies even before going into the CNV calling step.
-
Here is a sample one. This one was for hs37d5 used by broad and 1KG. All seperated by tabs not spaces. Make sure that your file is also tab seperated. Sometimes text editors prefer using spaces when a tab is inserted.
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
1 0.01 0.01 0.97 0.01
2 0.01 0.01 0.97 0.01
3 0.01 0.01 0.97 0.01
4 0.01 0.01 0.97 0.01
5 0.01 0.01 0.97 0.01
6 0.01 0.01 0.97 0.01
7 0.01 0.01 0.97 0.01
8 0.01 0.01 0.97 0.01
9 0.01 0.01 0.97 0.01
10 0.01 0.01 0.97 0.01
11 0.01 0.01 0.97 0.01
12 0.01 0.01 0.97 0.01
13 0.01 0.01 0.97 0.01
14 0.01 0.01 0.97 0.01
15 0.01 0.01 0.97 0.01
16 0.01 0.01 0.97 0.01
17 0.01 0.01 0.97 0.01
18 0.01 0.01 0.97 0.01
19 0.01 0.01 0.97 0.01
20 0.01 0.01 0.97 0.01
21 0.01 0.01 0.97 0.01
22 0.01 0.01 0.97 0.01
X 0.01 0.49 0.49 0.01
Y 0.49 0.49 0.1 0.1 -
Indeed, it was a lone space instead of tab that caused the error. Thank you SkyWarrior.
-
Thank you for all of your help SkyWarrior!
-
Thank you SkyWarrior
The calling (step 6 was) nearly completed until I received this memory error - I increased the memory to 280G with
docker run --shm-size=280G -v /Users/adminskose/Desktop/COHORT:/gatk/my_data -it broadinstitute/gatk:4.1.9.0
I also tried increasing anything I could in the docker image resources section. Is there a particular memory setting required for this process that you know of?
Many thanks
Error:
20:48:51.377 DEBUG ScriptExecutor - Result: 137
20:48:51.868 INFO GermlineCNVCaller - Shutting down engine
[August 27, 2021 8:48:51 PM GMT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 422.51 minutes.
Runtime.totalMemory()=1311244288
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 137
Command Line: python /tmp/cohort_denoising_calling.5738132176042270420.py --ploidy_calls_path=/gatk/my_data/ploidy-calls --output_calls_path=/gatk/my_data/cohort30/called-calls --output_tracking_path=/gatk/my_data/cohort30/called-tracking --modeling_interval_list=/tmp/intervals7421483875993324401.tsv --output_model_path=/gatk/my_data/cohort30/called-model --enable_explicit_gc_bias_modeling=True --read_count_tsv_files /tmp/Aunt_00227275247155116590099.tsv /tmp/Father_07642202045537147100435.tsv /tmp/Mother_03242472401014775770475.tsv /tmp/Mother_21444404844645603723824.tsv /tmp/Mother_52407598246531532512606.tsv /tmp/Cousin_06171987603320518605800.tsv /tmp/Father_08191032913929519815218.tsv /tmp/Mother_06307855512767211927590.tsv /tmp/Mother_21546747478566758013771.tsv /tmp/Proband1_03495136839957016008636.tsv /tmp/Father_00023523627836568411360.tsv /tmp/Father_15161280105794603916062.tsv /tmp/Mother_07643028147355291748090.tsv /tmp/Mother_22395758833663134263753.tsv /tmp/Proband_00685066893391276255137.tsv /tmp/Father_19937842005244523613977.tsv /tmp/Mother_08198213079318454091213.tsv /tmp/Mother_22504682910223960392545.tsv /tmp/Proband_02472889380613820790475.tsv /tmp/Father_06305771446938808048257.tsv /tmp/Mother_0002411156890795892901.tsv /tmp/Mother_15163026192816970364793.tsv /tmp/Mother_22828895388085527777662.tsv /tmp/Proband_03248660237427068947680.tsv /tmp/Father_06325780077682864208095.tsv /tmp/Mother_00681007256202893352551.tsv /tmp/Mother_19931275797041203466900.tsv /tmp/Mother_3701180657078747230590.tsv /tmp/Proband_08192886203270378102281.tsv /tmp/Proband_10606081273717948166214.tsv --psi_s_scale=1.000000e-04 --mapping_error_rate=1.000000e-02 --depth_correction_tau=1.000000e+04 --q_c_expectation_mode=hybrid --max_bias_factors=5 --psi_t_scale=1.000000e-03 --log_mean_bias_std=1.000000e-01 --init_ard_rel_unexplained_variance=1.000000e-01 --num_gc_bins=20 --gc_curve_sd=1.000000e+00 --active_class_padding_hybrid_mode=50000 --enable_bias_factors=True --disable_bias_factors_in_active_class=False --p_alt=1.000000e-06 --cnv_coherence_length=1.000000e+04 --max_copy_number=5 --p_active=0.010000 --class_coherence_length=10000.000000 --learning_rate=1.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.900000e-01 --log_emission_samples_per_round=50 --log_emission_sampling_rounds=10 --log_emission_sampling_median_rel_error=5.000000e-03 --max_advi_iter_first_epoch=5000 --max_advi_iter_subsequent_epochs=200 --min_training_epochs=10 --max_training_epochs=50 --initial_temperature=1.500000e+00 --num_thermal_advi_iters=2500 --convergence_snr_averaging_window=500 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=10 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=1.000000e+00 --disable_caller=false --disable_sampler=false --disable_annealing=false
The exit code indicates that the process was terminated. This may mean the process requires additional memory. -
How many samples do you have?
What is the number of segments/intervals for each sample?
What cpu do you have on server?
-
I have 30 samples, not entirely sure about intervals per sample, but the gcnvkernal.tasks.inference_task/-base was the following:
20:46:24.764 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.646 +/- 0.000, SNR: 50.0, T: 1.00: 100%|#########9| 4997/5000 [6:55:40<00:07, 2.55s/it]
20:46:27.306 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.646 +/- 0.000, SNR: 50.0, T: 1.00: 100%|#########9| 4998/5000 [6:55:42<00:05, 2.55s/it]
20:46:29.838 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.646 +/- 0.000, SNR: 49.9, T: 1.00: 100%|#########9| 4999/5000 [6:55:45<00:02, 2.54s/it]
20:46:32.448 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.646 +/- 0.000, SNR: 49.7, T: 1.00: 100%|##########| 5000/5000 [6:55:47<00:00, 2.56s/it]
20:46:32.456 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.646 +/- 0.000, SNR: 49.7, T: 1.00: 100%|##########| 5000/5000 [6:55:47<00:00, 4.99s/it]
20:46:32.459 WARNING gcnvkernel.tasks.inference_task_base - Inference task completed successfully but convergence not achieved.
20:46:32.463 INFO gcnvkernel.tasks.task_cohort_denoising_calling - Instantiating the denoising model (main)...
20:48:51.377 DEBUG ScriptExecutor - Result: 137
20:48:51.868 INFO GermlineCNVCaller - Shutting down engine
[August 27, 2021 8:48:51 PM GMT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 422.51 minutes.
Runtime.totalMemory()=1311244288As for servers, I have a personal one with 2 GHz Quad-Core Intel Core i5 with 16GB memory, but can access an HPC...is it possible to do it on the local specs?
Thank you
-
I would recommend no less than 32 gb for 200k intervals with 30 samples. I would also recommend 4.2.2.1. 4.1.8.0 to 4.2.0.0 may have issues with skylake cpus xeon or not.
-
Thank you SkyWarrior.
I think I'm nearly through. Is there a straightforward way to visualise the results?
I found this article https://gatk.broadinstitute.org/hc/en-us/articles/360035531452-After-gCNV-calling considerations
Is IGV the only way to have a quick look if there is a significant del/dup? I am looking for a 6.6Mb deletion, so it should be distinct enough to be detected/seen. Perhaps there's a well known R package?
Thanks
-
Not really. Use IGV and compare the allelic imbalances from the short variants vcf and you should be set. Make sure that the changes that you track are not repeated in unrelated samples. Otherwise it is just a false positive or a common copy number variation that usually means nothing.
-
Thank you SkyWarrior.
By comparing the allelic balances of the short variants vcf, do you mean the segments or interval vcf that is produced with post-processing?
The image below shows the segmented-output.vcf of three individuals, the first two are related, the bottom one is not, and should have a large deletion at the end of chr22. By what I see in the image this is the case. Am I interpreting this correctly?
Pardon my ignorance, but I am also not 100% sure what the top grey and bottom grey blocks/segments mean for each individual (each allele?).
Thanks in advance.
-
I am confused with the contig ploidy priors table. Please correct me if my understanding is wrong. Lets say i am working on sample having XYY Syndrome. In that case, should my ploidy table have 0.97 in PLOIDY_PRIORS_2 at Y contig row?
Please sign in to leave a comment.
24 comments