The optimization step for ELBO update returned a NaN
GATK version used: 1.4.1.4
I am trying to make a pipeline to get CNVs for around 3000 genomes and I keep getting this error when I run GermlineCNVCaller. I have tried many things, that have been suggested in similar posts regarding NaNs during this step. I am using 1.4.1.4 which implemented epsilons in the code. I have also tried another specific fix from: https://github.com/broadinstitute/gatk/pull/6613 but it did not fix the problem.
Neither of these suggested fixes seem to apply to my issue and I'm not sure how to proceed.
Commands used:
#commands performed prior to the error
gatk AnnotateIntervals \
-L ${binnedIntervals10k} \
-R ${refGen} \
-imr OVERLAPPING_ONLY \
-O ${annotatedIntervalData}
gatk FilterIntervals \
-L ${binnedIntervals10k} \
--annotated-intervals ${annotatedIntervalData} \
-imr OVERLAPPING_ONLY \
-O ${filteredIntervalData} \
--arguments_file ${tsvCommands}
gatk DetermineGermlineContigPloidy \
-L ${filteredIntervalData} \
--interval-merging-rule OVERLAPPING_ONLY \
--output . \
--output-prefix ${ploidyPrefix} \
--verbosity DEBUG \
--contig-ploidy-priors ${contigPloidyPriors} \
--arguments_file ${tsvCommands}
#this is the command where the error is thrown:
gatk GermlineCNVCaller --run-mode COHORT -L ${binnedIntervals10k} \
--contig-ploidy-calls ${ploidyPrefix}-calls \
--annotated-intervals ${annotatedIntervalData} \
--interval-merging-rule OVERLAPPING_ONLY \
--output ${cohortOutputName} \
--output-prefix ${cohortOutputPrefix} \
--verbosity DEBUG \
--arguments_file ${tsvCommands}
The error happens at this point in the output:
...
22:09:06.536 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.098 +/- 0.012, SNR: 12.8, T: 1.00: 55%|#####4 | 2733/5000 [1:38:36<1:21:47, 2.16s/it]
22:09:08.649 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.097 +/- 0.012, SNR: 12.8, T: 1.00: 55%|#####4 | 2734/5000 [1:38:38<1:21:45, 2.16s/it]
22:09:10.762 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.097 +/- 0.012, SNR: 12.8, T: 1.00: 55%|#####4 | 2735/5000 [1:38:40<1:21:43, 2.16s/it]
22:09:13.109 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.096 +/- 0.012, SNR: 12.8, T: 1.00: 55%|#####4 | 2736/5000 [1:38:42<1:21:41, 2.16s/it]
22:09:15.239 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.096 +/- 0.012, SNR: 12.8, T: 1.00: 55%|#####4 | 2737/5000 [1:38:44<1:21:38, 2.16s/it]
22:09:17.360 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.095 +/- 0.012, SNR: 12.8, T: 1.00: 55%|#####4 | 2738/5000 [1:38:47<1:21:36, 2.16s/it]
22:09:19.485 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.094 +/- 0.016, SNR: 9.4, T: 1.00: 55%|#####4 | 2739/5000 [1:38:49<1:21:34, 2.16s/it]
This was the error thrown:
20:29:27.263 DEBUG ScriptExecutor - Executing:
20:29:27.263 DEBUG ScriptExecutor - python
20:29:27.263 DEBUG ScriptExecutor - /tmp/cohort_denoising_calling.2950272073405165184.py
20:29:27.263 DEBUG ScriptExecutor - --ploidy_calls_path=/home/lees50/gatkBgAnalysis/ploidy-calls
20:29:27.263 DEBUG ScriptExecutor - --output_calls_path=/home/lees50/gatkBgAnalysis/cnvCohort/cnvCohort_1of1-calls
20:29:27.263 DEBUG ScriptExecutor - --output_tracking_path=/home/lees50/gatkBgAnalysis/cnvCohort/cnvCohort_1of1-tracking
20:29:27.263 DEBUG ScriptExecutor - --modeling_interval_list=/tmp/intervals2154483483635100359.tsv
20:29:27.263 DEBUG ScriptExecutor - --output_model_path=/home/lees50/gatkBgAnalysis/cnvCohort/cnvCohort_1of1-model
20:29:27.263 DEBUG ScriptExecutor - --enable_explicit_gc_bias_modeling=True
20:29:27.263 DEBUG ScriptExecutor - --read_count_tsv_files
20:29:27.263 DEBUG ScriptExecutor - /tmp/sample-[2917 files]
...
20:29:27.275 DEBUG ScriptExecutor - --psi_s_scale=1.000000e-04
20:29:27.275 DEBUG ScriptExecutor - --mapping_error_rate=1.000000e-02
20:29:27.275 DEBUG ScriptExecutor - --depth_correction_tau=1.000000e+04
20:29:27.275 DEBUG ScriptExecutor - --q_c_expectation_mode=hybrid
20:29:27.275 DEBUG ScriptExecutor - --max_bias_factors=5
20:29:27.275 DEBUG ScriptExecutor - --psi_t_scale=1.000000e-03
20:29:27.275 DEBUG ScriptExecutor - --log_mean_bias_std=1.000000e-01
20:29:27.275 DEBUG ScriptExecutor - --init_ard_rel_unexplained_variance=1.000000e-01
20:29:27.275 DEBUG ScriptExecutor - --num_gc_bins=20
20:29:27.275 DEBUG ScriptExecutor - --gc_curve_sd=1.000000e+00
20:29:27.275 DEBUG ScriptExecutor - --active_class_padding_hybrid_mode=50000
20:29:27.275 DEBUG ScriptExecutor - --enable_bias_factors=True
20:29:27.275 DEBUG ScriptExecutor - --disable_bias_factors_in_active_class=False
20:29:27.275 DEBUG ScriptExecutor - --p_alt=1.000000e-06
20:29:27.275 DEBUG ScriptExecutor - --cnv_coherence_length=1.000000e+04
20:29:27.275 DEBUG ScriptExecutor - --max_copy_number=5
20:29:27.275 DEBUG ScriptExecutor - --p_active=0.010000
20:29:27.275 DEBUG ScriptExecutor - --class_coherence_length=10000.000000
20:29:27.275 DEBUG ScriptExecutor - --learning_rate=1.000000e-02
20:29:27.275 DEBUG ScriptExecutor - --adamax_beta1=9.000000e-01
20:29:27.275 DEBUG ScriptExecutor - --adamax_beta2=9.900000e-01
20:29:27.275 DEBUG ScriptExecutor - --log_emission_samples_per_round=50
20:29:27.275 DEBUG ScriptExecutor - --log_emission_sampling_rounds=10
20:29:27.275 DEBUG ScriptExecutor - --log_emission_sampling_median_rel_error=5.000000e-03
20:29:27.275 DEBUG ScriptExecutor - --max_advi_iter_first_epoch=5000
20:29:27.275 DEBUG ScriptExecutor - --max_advi_iter_subsequent_epochs=200
20:29:27.275 DEBUG ScriptExecutor - --min_training_epochs=10
20:29:27.275 DEBUG ScriptExecutor - --max_training_epochs=50
20:29:27.275 DEBUG ScriptExecutor - --initial_temperature=1.500000e+00
20:29:27.275 DEBUG ScriptExecutor - --num_thermal_advi_iters=2500
20:29:27.275 DEBUG ScriptExecutor - --convergence_snr_averaging_window=500
20:29:27.275 DEBUG ScriptExecutor - --convergence_snr_trigger_threshold=1.000000e-01
20:29:27.275 DEBUG ScriptExecutor - --convergence_snr_countdown_window=10
20:29:27.275 DEBUG ScriptExecutor - --max_calling_iters=10
20:29:27.275 DEBUG ScriptExecutor - --caller_update_convergence_threshold=1.000000e-03
20:29:27.275 DEBUG ScriptExecutor - --caller_internal_admixing_rate=7.500000e-01
20:29:27.275 DEBUG ScriptExecutor - --caller_external_admixing_rate=1.000000e+00
20:29:27.275 DEBUG ScriptExecutor - --disable_caller=false
20:29:27.275 DEBUG ScriptExecutor - --disable_sampler=false
20:29:27.275 DEBUG ScriptExecutor - --disable_annealing=false
Traceback (most recent call last):
File "/tmp/cohort_denoising_calling.2950272073405165184.py", line 168, in <module>
warm_up_task.engage()
File "/home/lees50/.conda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/tasks/inference_task_base.py", line 339, in engage
converged_continuous = self._update_continuous_posteriors()
File "/home/lees50/.conda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/tasks/inference_task_base.py", line 395, in _update_continuous_posteriors
assert not np.isnan(loss), "The optimization step for ELBO update returned a NaN"
AssertionError: The optimization step for ELBO update returned a NaN
22:09:22.380 DEBUG ScriptExecutor - Result: 1
22:09:22.382 INFO GermlineCNVCaller - Shutting down engine
[January 29, 2021 at 10:09:22 PM AEST] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 100.29 minutes.
Runtime.totalMemory()=966787072
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /tmp/cohort_denoising_calling.2950272073405165184.py --ploidy_calls_path=/home/lees50/gatkBgAnalysis/ploidy-calls --output_calls_path=/home/lees50/gatkBgAnalysis/cnvCohort/cnvCohort_1of1-calls --output_tracking_path=/home/lees50/gatkBgAnalysis/cnvCohort/cnvCohort_1of1-tracking --modeling_interval_list=/tmp/intervals2154483483635100359.tsv --output_model_path=/home/lees50/gatkBgAnalysis/cnvCohort/cnvCohort_1of1-model --enable_explicit_gc_bias_modeling=True --read_count_tsv_files /tmp/sample-[numbers].tsv --psi_s_scale=1.000000e-04 --mapping_error_rate=1.000000e-02 --depth_correction_tau=1.000000e+04 --q_c_expectation_mode=hybrid --max_bias_factors=5 --psi_t_scale=1.000000e-03 --log_mean_bias_std=1.000000e-01 --init_ard_rel_unexplained_variance=1.000000e-01 --num_gc_bins=20 --gc_curve_sd=1.000000e+00 --active_class_padding_hybrid_mode=50000 --enable_bias_factors=True --disable_bias_factors_in_active_class=False --p_alt=1.000000e-06 --cnv_coherence_length=1.000000e+04 --max_copy_number=5 --p_active=0.010000 --class_coherence_length=10000.000000 --learning_rate=1.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.900000e-01 --log_emission_samples_per_round=50 --log_emission_sampling_rounds=10 --log_emission_sampling_median_rel_error=5.000000e-03 --max_advi_iter_first_epoch=5000 --max_advi_iter_subsequent_epochs=200 --min_training_epochs=10 --max_training_epochs=50 --initial_temperature=1.500000e+00 --num_thermal_advi_iters=2500 --convergence_snr_averaging_window=500 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=10 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=1.000000e+00 --disable_caller=false --disable_sampler=false --disable_annealing=false
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.executeGermlineCNVCallerPythonScript(GermlineCNVCaller.java:449)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:300)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
Using GATK jar /home/lees50/0_Tools/gatk2/4.1.4.1-gcccore-8.3.0-java-11/gatk-package-4.1.4.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/lees50/0_Tools/gatk2/4.1.4.1-gcccore-8.3.0-java-11/gatk-package-4.1.4.1-local.jar GermlineCNVCaller --run-mode COHORT -L binnedIntervals10k.interval_list --contig-ploidy-calls ploidy-calls --annotated-intervals intervalDataAnnotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output cnvCohort --output-prefix cnvCohort_1of1 --verbosity DEBUG --arguments_file /home/lees50/gatkBgAnalysis/alignments/tsvCommands
-
3000 genomes seems to be too much. How much memory do you have available for this analysis?
Also have you tried the docker version of gatk especially 4.1.7.0 (latest one compatible with skylake and above processors without issues). Your conda libraries may be different from what was installed in the docker version which may further change your results.
-
Thanks for the reply.
I am analyzing only a small portion of each genome, specifically 46 blood group genes for a total of 1.4 Mbp. I am using a HPC with 8 cores and 80gb of memory.
I am thinking of further splitting the intervals into shards, but this does not seem necessary given the suggestions in this CNV tutorial: https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-common-and-rare-germline-copy-number-variants.
I will try to run gatk using docker and see if it helps.
-
Hi simon lee, let us know if using docker helps. If not, we can look into other solutions as well.
-
I think I solved the problem and I think it had something to do with memory.
I noticed that running less than 500 genomes didn't appear to throw an error. I tried running all 3000 genomes while splitting up my intervals into multiple shards. After that, the analysis was successful with no errors.
thanks for the suggestions.
-
Thanks for the update simon lee!
-
Hi, I know is been a while but I got the same problem in step 3.
Can someone please tell me how solve it?
I saw that simon lee used splitting, if this is the issue, I would like to get some explanation on how to do it.
thanks! -
Hi Shai Casif
What version are you using?
-
Hi Gökalp Çelik
I am using GATK version 4.1.0.0. I split the .interval_list using SplitIntervals with a scatter count of 100, but I am still encountering the same issue.
-
Hi again.
This issue is solved in a later version of GATK specifically 4.1.4.1
https://github.com/broadinstitute/gatk/releases/tag/4.1.4.1
More improvements and fixes were added in the later versions so we recommend you to use the latest version (4.6.0.0) as a definitive solution.
Regards.
-
Thanks Gökalp Çelik
After that, would you recommend restarting the pipeline from the beginning, or can I continue my work from where I left off? -
It will be better to do all steps from the beginning. GCNV versions also are not compatible between 4.1 and 4.6.
Please sign in to leave a comment.
11 comments