GermlineCNVCaller edge case?
I am observing a strange behavior with GermlineCNVCaller which used to work fine before but now acting strange on different computers.
Previously I performed a CNV calling on 11 WGS samples using docker image of gatk 4.1.8.1 and it all worked fine.
Now I added 3 more samples and I moved my docker image to 4.1.9.0 and boom. Although there is no problem with the DCP step GermlineCNVCaller python script segfaults with exit code 139 right before the denoising warm-up stage starts. This happened in 2 seperate systems one Skylake i9 128gb memory (debian buster 4.19) and other coffee-lake i5 32gb memory (fedora 33 5.9.11). I tried to limit the number of targets (excluding different chromosomes each time to run GCNV) to make sure that process does not consume more than 20gb of ram and both systems crashed with exit code 139 from the python script a possible segfault during compile stage.
Interestingly the very same code and samples worked flawlessly on 2 other systems one with Sandybridge Xeon and 32 gb of memory (fedora 33 5.9.11 non-ECC memory) and other with Cascade Lake Xeon with 128gb of memory (Centos 7 3.10 ECC Registered Memory).
All trials included gatk 4.1.9.0 4.1.8.1 and latest nightly from 11-22. 2 upper systems crashed yet 2 below systems survived.
I was wondering if this could be an edge case in gcnvkernel or any other parts of the python environment because I was able to call CNVs on many exome samples without a hitch on the above 2 systems.
I may be able to upload my files if any of the developers wishes to replicate the issue.
Here is my log for the failed instances. All the same
03:50:19.829 INFO gcnvkernel.tasks.task_cohort_denoising_calling - Instantiating the denoising model (warm-up)...
03:51:14.133 DEBUG ScriptExecutor - Result: 139
03:51:14.135 INFO GermlineCNVCaller - Shutting down engine
[December 2, 2020 3:51:14 AM GMT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 2.92 minutes.
Runtime.totalMemory()=7434928128
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 139
Command Line: python /home/WGS_GATK_CNV/tmp/cohort_denoising_calling.4881618556333046966.py --ploidy_calls_path=/home/WGS_GATK_CNV/DCPoutput/WGS_GATK_CNV_CNV-calls --output_calls_path=/home/WGS_GATK_CNV/CNVCalls/GATK_CNV-calls --output_tracking_path=/home/WGS_GATK_CNV/CNVCalls/GATK_CNV-tracking --modeling_interval_list=/home/WGS_GATK_CNV/tmp/intervals4393369786929696184.tsv --output_model_path=/home/WGS_GATK_CNV/CNVCalls/GATK_CNV-model --enable_explicit_gc_bias_modeling=True --read_count_tsv_files /home/WGS_GATK_CNV/tmp/Wes11964524832530795686064.tsv /home/WGS_GATK_CNV/tmp/Wes1197737677997828074616.tsv /home/WGS_GATK_CNV/tmp/Wes12698469994795682841508.tsv /home/WGS_GATK_CNV/tmp/Wes12708000605781211140356.tsv /home/WGS_GATK_CNV/tmp/Wes12715414547641366378717.tsv /home/WGS_GATK_CNV/tmp/Wes2921459800353038366848.tsv /home/WGS_GATK_CNV/tmp/Wes7325362233466406331624.tsv /home/WGS_GATK_CNV/tmp/Wes7333435403601142225470.tsv /home/WGS_GATK_CNV/tmp/Wes7342723515437658413118.tsv /home/WGS_GATK_CNV/tmp/Wes7358999144600508588065.tsv /home/WGS_GATK_CNV/tmp/Wes7367280700237727048706.tsv /home/WGS_GATK_CNV/tmp/Wes7372152427712545879429.tsv /home/WGS_GATK_CNV/tmp/Wes7381789128275660999213.tsv /home/WGS_GATK_CNV/tmp/Wes7596516274389032021867.tsv --psi_s_scale=1.000000e-04 --mapping_error_rate=1.000000e-02 --depth_correction_tau=1.000000e+04 --q_c_expectation_mode=hybrid --max_bias_factors=5 --psi_t_scale=1.000000e-03 --log_mean_bias_std=1.000000e-01 --init_ard_rel_unexplained_variance=1.000000e-01 --num_gc_bins=20 --gc_curve_sd=1.000000e+00 --active_class_padding_hybrid_mode=50000 --enable_bias_factors=True --disable_bias_factors_in_active_class=False --p_alt=1.000000e-06 --cnv_coherence_length=1.000000e+04 --max_copy_number=5 --p_active=0.010000 --class_coherence_length=10000.000000 --learning_rate=1.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.900000e-01 --log_emission_samples_per_round=50 --log_emission_sampling_rounds=10 --log_emission_sampling_median_rel_error=5.000000e-03 --max_advi_iter_first_epoch=5000 --max_advi_iter_subsequent_epochs=200 --min_training_epochs=10 --max_training_epochs=100 --initial_temperature=1.500000e+00 --num_thermal_advi_iters=2500 --convergence_snr_averaging_window=500 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=10 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=1.000000e+00 --disable_caller=false --disable_sampler=false --disable_annealing=false
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:130)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.executeGermlineCNVCallerPythonScript(GermlineCNVCaller.java:438)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:309)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Thanks in advance.
-
Hi SkyWarrior, we think this might be from changes to docker that occurred in GATK 4.1.8.0. Would you be able to test this with GATK 4.1.7.0 and see if you get the same issue?
Also, how many targets and samples were used as input?
-
Thanks for the response. I have not tried 4.1.7.0 and it seems it started running fine now on the skylake i9 system. I will extend my tests to other systems as well and repost here once I can confidently tell that the problem is limited to 4.1.8.1 and above.
I am trying to keep the number of targets to 500000 at most and number of samples is 14.
-
Thanks for the update SkyWarrior, let us know if you confirm with those tests.
-
I wonder if this may be an issue with numpy 1.17.5, which we updated to in the 4.1.8.0 Docker/conda changes. See https://numpy.org/doc/1.17/release.html: "Downstream developers should use Cython >= 0.29.13 for Python 3.8 support and OpenBLAS >= 3.7 to avoid errors on the Skylake architecture."
Many of the Docker/conda changes were made to ensure that the CNN tool was able to properly use MKL-enabled numpy and Tensorflow. Specifically, the relevant packages were conda-installed from channels that compiled them against MKL (rather than pip-installed, which resulted in dependency clobbering).
If this causes issues with more recent architectures, it may be possible to either configure a new conda environment against OpenBLAS packages and/or run GermlineCNVCaller with theano flags that disable MKL (see, from the tool docs: "Advanced users may wish to set the
THEANO_FLAGS
environment variable to override the GATK theano configuration. For example, by runningTHEANO_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk GermlineCNVCaller ...
, users can specify the theano compilation directory (which is set to$HOME/.theano
by default). See theano documentation at http://deeplearning.net/software/theano/library/config.html.")I would expect that gCNV runtime might differ with MKL and OpenBLAS, but probably not substantially. Unfortunately, I don't have the bandwidth at the moment to put together such an environment, but let me know if you need further pointers.
Note that more changes to the conda environment may be on their way in the next few months (to update PyMC3, which is used by gCNV, and include pyro), so if rolling back to 4.1.7.0 works, I might just stick with that for now.
Thanks as always SkyWarrior for the detailed reports. It's especially difficult for us to test across different architectures, so any data points you can provide are extremely helpful!
-
Thanks Samuel Lee
I can confidently say that the problem is confined to 4.1.8.1 and above docker images and 4.1.7.0 docker image works fine on all the systems.
About the usage of MKL. I am in no position to judge however MKL's bias against certain brand cpus (GenuineIntel vs AuthenticAMD) might effect the performance of CNN and GCNV under disadvantaged CPU's. I am not sure how many people keep track of what they have or use as an infrastructure or cloud but some cloud providers keep serving more alternative CPU's against Intel and yet those may go unnoticed. I currently don't have any AMD cpus but chances are I will procure some in midterm due to pricing differences.
Regards.
-
Thanks. I will take a look at using packages compiled against OpenBLAS for the next iteration of the environment and will update here when I can.
-
Hi Everyone,
As a quick solution, I recommend the following Conda environment which worked well for me. Note that all of them are not required Python dependencies. Dear SkyWarrior your comments are always useful. Thanks.
channels:
- bioconda
- anaconda
- conda-forge
- BioBuilds
dependencies:
- conda-forge::python=3.6.13
- picard
- vcftools
- bcftools
- samtools
- bamtools
- bioconda::gatk4<4.1.8.1
- gcnvkernel
- bwa
- htslib
- fastqc
- multiqc
- trimmomatic
- tabix
- moreutils
- anaconda::numpy=1.15
- pip
- mkl
- mkl-service
- theano=
- tensorflow
- scipy
- pymc3
- h5py
- keras
- intel-openmp
- scikit-learn
- matplotlib
- pandas
- conda-forge::qt=5.12.9
- conda-forge::libglib=2.70.1
- conda-forge::libffi=3.4.2
- biopython
- pyvcf
- pysam
### Cheers. -
Thanks for posting this Mehdi!
Please sign in to leave a comment.
8 comments