MarkDuplicatesSpark Error
REQUIRED for all errors and issues:
a) GATK version used:
gatk-package-4.3.0.0
b) Exact command used:
!/bin/sh
#SBATCH --time=2-10:15 -N 2 --cpus-per-task=24 -p l_long --qos ll
#SBATCH -J Mark_duplicates
#SBATCH --output=Mark_duplicates.o
#SBATCH --error=Mark_duplicates.e
gatk MarkDuplicatesSpark -I ${aligned_reads}/$1 -O ${aligned_reads}/MK264_sorted_dedup_reads.bam
c) Entire program log:
Runtime.totalMemory()=7683964928
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1899 in stage 1.0 failed 1 times, most recent failure: Lost>
at htsjdk.samtools.SAMRecordSparkCodec.encode(SAMRecordSparkCodec.java:114)
at org.broadinstitute.hellbender.engine.spark.SAMRecordToGATKReadAdapterSerializer.write(SAMRecordToGATKReadAdapterSerializ>
at org.broadinstitute.hellbender.engine.spark.SAMRecordToGATKReadAdapterSerializer.write(SAMRecordToGATKReadAdapterSerializ>
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:245)
at org.apache.spark.serializer.SerializationStream.writeKey(Serializer.scala:132)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:260)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:188)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGSchedule>
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGSchedule>
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at org.broadinstitute.hellbender.utils.spark.SparkUtils.putReadsWithTheSameNameInTheSamePartition(SparkUtils.java:205)
at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortReadsAccordingToHeader(SparkUtils.java:144)
at org.broadinstitute.hellbender.utils.spark.SparkUtils.querynameSortReadsIfNecessary(SparkUtils.java:306)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.mark(MarkDuplicatesSpark.java:20>
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.mark(MarkDuplicatesSpark.java:27>
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.runTool(MarkDuplicatesSpark.java>
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
-
Hi Yosra Bejaoui,
I can't tell exactly what's happening here. It looks like it's failing during a serialization operation, but it seems like the stacktrace is cutoff somehow. However, the line it's referencing in the part of the stacktrace that is there (SAMRecordSparkCodec.java:114) is a line where an exception is thrown due to a problem in the input data
throw new RuntimeException("Mismatch between read length and quals length writing read " +
alignment.getReadName() + "; read length: " + alignment.getReadLength() +
"; quals length: " + alignment.getBaseQualities().length);I can't say for sure but I suspect you have a read which is either missing quality scores or has an error in the qualiy scores which makes there be too few or two many compared to the number of bases.
If you can identify the problematic reads that would clarify things. If it's just a few reads with that problem you could try applying a read filter to remove them.
--read-filter MatchingBasesAndQualsReadFilter
should remove any reads with mismatches
Please sign in to leave a comment.
1 comment