ReadsPipelineSparkMulticore.wdl, Unrecognized runtime attribute keys: discs, cpu
AnsweredHello, while running ReadsPipelineSparkMulticore.wdl (https://github.com/broadinstitute/gatk/tree/master/scripts/spark_wdl) encountered the following problem. I would like to know what could be the reason?I use this command:
java -jar ../cromwell-77.jar run ReadsPipelineSparkMulticore.wdl -i exome/ReadsPiplineSpark_exome.json
Here is json file:
I run this command on local server.
Here is its configuration (CPU and RAM):
Found this error in my stderr file:
22/04/06 15:36:17 ERROR Executor: Exception in task 10.0 in stage 11.0 (TID 1596)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.broadinstitute.hellbender.utils.recalibration.BaseRecalibrationEngine.calculateFractionalErrorArray(BaseRecalibrationEngine.java:440)
at org.broadinstitute.hellbender.utils.recalibration.BaseRecalibrationEngine.processRead(BaseRecalibrationEngine.java:141)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.lambda$null$0(BaseRecalibratorSparkFn.java:33)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn$$Lambda$705/136574652.accept(Unknown Source)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at org.broadinstitute.hellbender.utils.iterators.CloseAtEndIterator.forEachRemaining(CloseAtEndIterator.java:47)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.lambda$apply$6ed74b3e$1(BaseRecalibratorSparkFn.java:33)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn$$Lambda$635/777640102.call(Unknown Source)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I will be greateful for your help
-
Hi Andrew,
The java.lang.OutOfMemoryError: GC overhead limit exceeded indicates that you ran out of memory. I would recommend using java options to specify the memory you want to allocate:
Let me know if this works!
Best,
Genevieve
-
Hi Andrew Erzunov,
It looks like this is a memory issue from the program log snippet you shared (java.lang.OutOfMemoryError: GC overhead limit exceeded).
Our GATK support team does not support personal cromwell instances, so I'm not familiar with troubleshooting this WDL and how to fix memory issues (ReadsPipelineSparkMulticore.wdl). You can see if other users are able to help out in the comments of this post and you can also check out these other cromwell resources:
- Bioinformatics Stack Exchange
- Cromwell slack organization: cromwellhq.slack.com
- Cromwell Documentation
Alternatively, you can try running ReadsPipelineSpark within GATK and I will be able to better determine if there is something you can do to help out the memory problem. Here is a section in our readme about running GATK4 spark tools locally: https://github.com/broadinstitute/gatk#sparklocal.
Please let me know if you have any other questions.
Best,
Genevieve
-
Hi Genevieve,
Thanks a lot for your response.
I tried to run ReadsPipelineSpark within GATK locally.Here is a command i used:
docker run -v /xchg/local/pipelines/references/cromwell/gatk_wdl/exome_data:/data -it broadinstitute/gatk:latest ./gatk ReadsPipelineSpark -I /data/exome_align.bam -R /data/hg38_no_alt.fa --known-sites /data/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf -O /data/output_exome.vcf --spark-runner LOCAL --spark-master 'local[100]'
But got the following errors:
I will be greateful for your help,
Andrew -
Thank you very much for your response, Genevieve.
Using the Java argument -Xmx really helped to solve "java.lang.OutOfMemoryError" problem.ReadsPipelineSpark successfully executed on exome data, but when i passed genome data, i got that "The covariates table is missing ReadGroup someId in RecalTable0" error:
I set Read Group parameter via bwa.
Bam file has been successfully validated:
Command which i used for validation:
java -jar picard.jar ValidateSamFile I=genome/genome_align.bam MODE=SUMMARY
I will be greateful for your help,
Andrew -
Hi Andrew,
Could you post the program log from when the pipeline runs BaseRecalibrator? It looks like this error might be related to this reported issue: https://github.com/broadinstitute/gatk/issues/6242. Where all the reads from one read group get filtered out during BaseRecalibrator, causing an error with ApplyBQSR.
Best,
Genevieve
-
Hello, Genevieve.
I am attaching the log file after running the following command:./gatk-4.2.6.1/gatk ReadsPipelineSpark -I /media/gene/sdb/cromwell/gatk_wdl/genome/genome_align.bam -R /media/gene/sdb/cromwell/gatk_wdl/reference/hg38_no_alt.fa --known-sites /media/gene/sdb/cromwell/gatk_wdl/reference/Homo_sapiens_assembly38.dbsnp138.vcf -O /media/gene/sdb/cromwell/gatk_wdl/genome/output_genome.vcf --spark-runner LOCAL --spark-master 'local[45]' --java-options "-Xmx90G" --tmp-dir /media/gene/sdb/cromwell/gatk_wdl/temp_files
Here is log file:
https://drive.google.com/file/d/1SHNkuwBeYEZ48nsxUVsbdTkBBP4mVW3i/view?usp=sharing
Also i tried using "ReadGroupBlackListReadFilter" option with above command:
--read-filter ReadGroupBlackListReadFilter --read-group-black-list RG:someId
But got the following error:
I will be greateful for your help,
Andrew -
Hi Andrew,
I didn't see any information in the logs about reads being filtered, so I'm not sure about the cause. Could you verify that you successfully added read groups to your file with this command?
samtools view -H sample.bam | grep '^@RG'
I don't think that option will be your best step forward, it probably will be most helpful to find the cause of the read group problem.
Best,
Genevieve
-
Hello, Genevieve.
After applying the following command, i got the following output:Faithfully,
Andrew -
Great thank you! Since you only have one read group in your file, that is why it does not work to blacklist that one read group.
I noticed you don't have PU in your read group, you might want to add that. Here's more information about read groups: https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
I don't think the PU issue is causing your error, so I'm going to keep looking into the ReadsPipelineSpark error.
-
Hi Genevieve,
I tried to add PU field:
And after running the following command:
I got this error:Faithfully,
Andrew -
Hi Andrew,
It looks like your error is the same after you added the PU field. I'm having trouble finding out what is truly causing the problem during ApplyBQSR in this ReadsPipelineSpark pipeline. The pipeline does not seem to print out the results from the read filters during the BaseRecalibrator step. There is better error handling with the tools themselves. Would you be able to run the tools separately that are in ReadsPipelineSpark so we can figure out why you are getting this error? I know this is not ideal, but I don't think there is a better way to troubleshoot.
Best,
Genevieve
-
Hello Genevieve,
After running BaseRecalibrator tool, I got the following result (BaseRecalibrator was able to recalibrate 0 reads):
https://drive.google.com/file/d/1mVtbBjgUcA9WFNaI3bOZYiXc-yH-KqtK/view?usp=sharing
And after running ApplyBQSR, i got the same error as after running ReadsPipelineSpark:
https://drive.google.com/file/d/1Mipt7O0_CR1gosxdfsuSsR_mODzQ_F9t/view?usp=sharing
Also, after using samtools coverage i got following result:
Faithfully,
Andrew -
Hi Andrew,
It looks like all your reads were filtered by the MappingQualityNotZeroReadFilter:
15:32:48.714 INFO BaseRecalibrator - 993014098 read(s) filtered by: MappingQualityNotZeroReadFilter
This indicates that something went wrong with your mapping, because all your reads have a mapping quality of 0. You can read more about the read filter here: https://gatk.broadinstitute.org/hc/en-us/articles/5358856018459-MappingQualityNotZeroReadFilter
Best,
Genevieve
Please sign in to leave a comment.
13 comments