Help with running GATK RNAseq workflow with docker backend and multiple cpus?
How do I configure the workflow to automatically run every tool with docker with more threads/cpus? I'm having a hard time.
I'm using this docker.conf inside my backend section and set it as default.
Do I need to change the main workflow WDL to add the cpu/memory to every task's runtime? Or setting it only on my conf file will ensure that my containers are running with the prespecified cpu and memory settings? Do I need to pass cpu and memory on my input.json as well?
I tried:
- hardcoding the runtime cpu/memory inside the main workflow WDL;
- Setting the unspecificied default runtime config in my docker conf to use the amount of cpus/mem I wished;
- Tried the same as above and also added in my input.json fields to pass the variables setting the number of cpus and memory to the main WDL;
However, nothing seem to work. I mean, I can use the default "localExample" config with the default settings, but it takes 3 days to process a sample because I believe it only uses multithreading during STAR steps.
My last error was this:
[2021-09-02 12:27:30,32] [info] WorkflowManagerActor Workflow a7905283-922a-43c2-bebd-2d86aa2a9eaf failed (during InitializingWorkflowState): Task HaplotypeCaller has an
invalid runtime attribute cpu = !! NOT FOUND !!
Task gtfToCallingIntervals has an invalid runtime attribute cpu = !! NOT FOUND !!
Task MergeBamAlignment has an invalid runtime attribute cpu = !! NOT FOUND !!
Task BaseRecalibrator has an invalid runtime attribute cpu = !! NOT FOUND !!
Task SplitNCigarReads has an invalid runtime attribute cpu = !! NOT FOUND !!
Task RevertSam has an invalid runtime attribute cpu = !! NOT FOUND !!
Task ApplyBQSR has an invalid runtime attribute cpu = !! NOT FOUND !!
Task ScatterIntervalList has an invalid runtime attribute cpu = !! NOT FOUND !!
Task MarkDuplicates has an invalid runtime attribute cpu = !! NOT FOUND !!
Task SamToFastq has an invalid runtime attribute cpu = !! NOT FOUND !!
Task MergeVCFs has an invalid runtime attribute cpu = !! NOT FOUND !!
Task VariantFiltration has an invalid runtime attribute cpu = !! NOT FOUND !!
This is the workflow I forked:
https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels
Here's my custom .conf file: https://pastebin.com/HJt7L9RG
Thank you
-
Hi ThyagoLC,
The way you will want to configure your WDL will depend on where you are running it. If you are running on your local machine, the docker containers will have access to the available CPU and memory.
You can edit the commands in each task of the workflow to use a specific amount of CPUs. For example, StarGenerateReferences (the task where STAR is called), allows you to specify the amount of CPUs with --runThreadN ${threads}. Some tools do not have the option to specify CPUs but you can adjust the memory for GATK commands with --java-options and the Xmx parameter. Here is an article covering resource specification with GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360035532372-Java-is-using-too-many-resources-threads-memory-or-CPU-
Hope this helps!
Best,
Genevieve
-
Genevieve Brandt (she/her), Thanks!
Got it. So the problem is that other than STAR, those other Picard tools don't support multi-threading, so I can't pass multiple cpus in runtime section unlike like in STAR's, right?
So, for these tools without CPU option, I can only use the java option -XX:ConcGCThreads and set it to max I have and docker will automatically use those.
One last question, HaplotypeCaller run on scattered intervals, do they run on multiple cpus per interval? Will I gain in speed by increasing scatter_count from the default of 6?
Thank you.
-
No problem!
Your first two assumptions are correct, yes.
For your question, each HaplotypeCaller scatter will use a thread. As the scatter count increases, if it exceeds the available threads then each shard will be competing for the CPU resources. Increasing the scatter count would help with speed because you would be running multiple shards at the same time. But if there are too many shards, the CPU becomes a bottle neck and could slow down the workflow.
Best,
Genevieve
Please sign in to leave a comment.
3 comments