GATK-gCNV for WES is taking too long on the server (138 WES samples running for >20 days)
REQUIRED for all errors and issues:
a) GATK version used: gatk:4.4.0.0
b) Exact command used: singularity exec docker://broadinstitute/gatk:4.4.0.0 gatk GermlineCNVCaller --run-mode COHORT -L /home/MW/gCNV/scatter/temp_0001_of_10/scattered.interval_list -I MW1_1_P.tsv -I MW_2_P.tsv -I MW_3_P.tsv -I MW_4_P.tsv -I MW_5_P.tsv -I MW_6_P.tsv -I MW_7_P.tsv -I MW_8_P.tsv -I MW_9_P.tsv -I MW_10_P.tsv ..............and so on till..... -I MW_138_P.tsv --contig-ploidy-calls /home/MW/gCNV/ploidy-calls --annotated-intervals /home/MW/gCNV/v8.annotated_intervals.tsv --interval-merging-rule OVERLAPPING_ONLY --output cohortMW-v8-shards --output-prefix normal_cohort_run_1of10 --verbosity DEBUG
I am trying to run gCNV on a cohort of 138 WES samples. I have followed the steps given in this link (https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-rare-germline-copy-number-variants) with the following order of steps:
gatk PreprocessIntervals
gatk AnnotateIntervals
gatk CollectReadCounts
gatk FilterIntervals
gatk DetermineGermlineContigPloidy (where I got the ploidy of all male samples as XXY but females were correct; which is another problem that I could not figure out why)
gatk IntervalListTools --INPUT v8_filtered.interval_list --SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_CONTENT 30000 --OUTPUT scatter
there were 10 scatter/temp_0001_of_10
The first scatter itself has been running for 20 days and not been completed till now. There is no new file being made in the *shards folder. the only file is the interval_list.tsv that was made on the day this command was started.
Does it take this long or have I done something wrong? should the scatter be more than 10?
The server has $nproc=128 and there are 4 scatters running in 4 screens. The BAM size ranging from from 3GB to 17GB
Please help? Should I wait? Should I cancel?
Are there pre-existing hg38 WES trained ploidy-models that I can use to run GermlineCNVCaller in CASE MODE?
-
Hi S
We do not support using singularity for container execution. On the other hand each instance of GermlineCNVCaller is designed to use all available threads present unless it is limited by the execution engine. I suggest you to check if this is the case. We usually run our instances with as few cores as possible to prevent overloading of virtual machines. My personal experience is using only 4 to 8 cores is more than enough to run a single GermlineCNVCaller instance.
Besides there may also be other reasons for the low performance.
1- GermlineCNVCaller uses plenty of memory depending on the number of samples x targets. You may wish to reduce the number of targets to make your compute environment use less memory to complete each task.
2- IO performance is important since intermediate files are written to the temporary folder during operation and we recommend setting a large temporary space for THEANO/PYTORCH compilation.
I hope these help.
Regards.
-
Thank you very much Gökalp Çelik
Based on your suggestion, I changed the IntervalListTools flag --SCATTER_CONTENT from 30000 to 10000 and that seemed to have a drastic change. A single scatter shard that was running for 20 days got completed in a few hours.
Please sign in to leave a comment.
2 comments