Proper way to prepare .bam for MarkDuplicatesSpark
AnsweredHi folks,
I am new with GATK, please advise me on the details i am missing. I am trying out the protocol for identifying SNP/indel variant from RNA-seq data.
I performed a 2-passed alignment in STAR with 2 samples with a manifest that looks like this
s1_r1.fq.gz s1_r2.fq.gz sample1
s2_r1.fq.gz s2_r2.fq.gz sample2
Lane and platform information is irrelevant in this design and STAR automatically add ID: entries in the 3rd column if no other fields are provided.
It runs fine, for about 5 minutes, then GATK complains that the .bam is malformed as certain reads are missing the RG tag.
I have read some of the options available
https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
These posts suggest to add the RG with samtool's addreplacerg or picard's AddOrReplaceReadGroups which, from my understanding is more for single sample bam.
I also cannot understand why is MarkDuplicatesSpark raising this error when RG was included at the alignment stage in STAR.
Do advice on the bam should have been prepared.
REQUIRED for all errors and issues:
a) GATK version used: 4.2.6.1
b) Exact command used:
./gatk MarkDuplicatesSpark -I ~/test/output/pass2/Aligned.out.bam -O ~/test/output/pass2/markDup.bam -conf 'spark.local.dir=/mnt/Elements/tmp'
The preceding 1740 tasks executed successfully. The errors start from here.
c) Entire program log:
22/07/06 13:57:17 INFO DAGScheduler: Submitting 1740 missing tasks from ShuffleMapStage 4 (MapPartitionsRDD[27] at flatMapToPair at MarkDuplicatesSparkUtils.java:128) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
22/07/06 13:57:17 INFO TaskSchedulerImpl: Adding task set 4.0 with 1740 tasks
22/07/06 13:57:17 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 2828, localhost, executor driver, partition 0, PROCESS_LOCAL, 8410 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 2829, localhost, executor driver, partition 1, PROCESS_LOCAL, 8410 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 2830, localhost, executor driver, partition 2, PROCESS_LOCAL, 9092 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 2831, localhost, executor driver, partition 3, PROCESS_LOCAL, 8422 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 2832, localhost, executor driver, partition 4, PROCESS_LOCAL, 8410 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 2833, localhost, executor driver, partition 5, PROCESS_LOCAL, 8738 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 2834, localhost, executor driver, partition 6, PROCESS_LOCAL, 10427 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 2835, localhost, executor driver, partition 7, PROCESS_LOCAL, 8754 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 8.0 in stage 4.0 (TID 2836, localhost, executor driver, partition 8, PROCESS_LOCAL, 8417 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 9.0 in stage 4.0 (TID 2837, localhost, executor driver, partition 9, PROCESS_LOCAL, 9395 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 10.0 in stage 4.0 (TID 2838, localhost, executor driver, partition 10, PROCESS_LOCAL, 9088 bytes)
22/07/06 13:57:17 INFO TaskSetManager: Starting task 11.0 in stage 4.0 (TID 2839, localhost, executor driver, partition 11, PROCESS_LOCAL, 8413 bytes)
22/07/06 13:57:17 INFO Executor: Running task 0.0 in stage 4.0 (TID 2828)
22/07/06 13:57:17 INFO Executor: Running task 1.0 in stage 4.0 (TID 2829)
22/07/06 13:57:17 INFO Executor: Running task 9.0 in stage 4.0 (TID 2837)
22/07/06 13:57:17 INFO Executor: Running task 10.0 in stage 4.0 (TID 2838)
22/07/06 13:57:17 INFO Executor: Running task 3.0 in stage 4.0 (TID 2831)
22/07/06 13:57:17 INFO Executor: Running task 8.0 in stage 4.0 (TID 2836)
22/07/06 13:57:17 INFO Executor: Running task 2.0 in stage 4.0 (TID 2830)
22/07/06 13:57:17 INFO Executor: Running task 5.0 in stage 4.0 (TID 2833)
22/07/06 13:57:17 INFO Executor: Running task 6.0 in stage 4.0 (TID 2834)
22/07/06 13:57:17 INFO Executor: Running task 4.0 in stage 4.0 (TID 2832)
22/07/06 13:57:17 INFO Executor: Running task 7.0 in stage 4.0 (TID 2835)
22/07/06 13:57:17 INFO Executor: Running task 11.0 in stage 4.0 (TID 2839)
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks including 4 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks including 4 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 5 non-empty blocks including 5 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 6 non-empty blocks including 6 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 7 non-empty blocks including 7 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks including 3 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 5 non-empty blocks including 5 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 7 non-empty blocks including 7 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 7 non-empty blocks including 7 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 5 non-empty blocks including 5 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 5 non-empty blocks including 5 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Getting 5 non-empty blocks including 5 local blocks and 0 remote blocks
22/07/06 13:57:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:19 ERROR Executor: Exception in task 11.0 in stage 4.0 (TID 2839)
org.broadinstitute.hellbender.exceptions.UserException$ReadMissingReadGroup: SAM/BAM/CRAM file (unknown) is malformed: Read A00609:414:H33TFDSX3:1:1107:14579:28745 is missing the read group (RG) tag, which is required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.getLibraryForRead(MarkDuplicatesSparkUtils.java:62)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.EmptyFragment.<init>(EmptyFragment.java:36)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.MarkDuplicatesSparkRecord.newEmptyFragment(MarkDuplicatesSparkRecord.java:37)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$null$0(MarkDuplicatesSparkUtils.java:139)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:439)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$transformToDuplicateNames$8ccfa6f1$1(MarkDuplicatesSparkUtils.java:148)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/07/06 13:57:19 ERROR Executor: Exception in task 3.0 in stage 4.0 (TID 2831)
org.broadinstitute.hellbender.exceptions.UserException$ReadMissingReadGroup: SAM/BAM/CRAM file (unknown) is malformed: Read A00609:414:H33TFDSX3:1:1102:32334:16344 is missing the read group (RG) tag, which is required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.getLibraryForRead(MarkDuplicatesSparkUtils.java:62)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.EmptyFragment.<init>(EmptyFragment.java:36)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.MarkDuplicatesSparkRecord.newEmptyFragment(MarkDuplicatesSparkRecord.java:37)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$null$0(MarkDuplicatesSparkUtils.java:139)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:439)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$transformToDuplicateNames$8ccfa6f1$1(MarkDuplicatesSparkUtils.java:148)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/07/06 13:57:19 INFO TaskSetManager: Starting task 12.0 in stage 4.0 (TID 2840, localhost, executor driver, partition 12, PROCESS_LOCAL, 8744 bytes)
22/07/06 13:57:19 INFO Executor: Running task 12.0 in stage 4.0 (TID 2840)
22/07/06 13:57:19 INFO TaskSetManager: Starting task 13.0 in stage 4.0 (TID 2841, localhost, executor driver, partition 13, PROCESS_LOCAL, 8744 bytes)
22/07/06 13:57:19 INFO Executor: Running task 13.0 in stage 4.0 (TID 2841)
22/07/06 13:57:19 WARN TaskSetManager: Lost task 11.0 in stage 4.0 (TID 2839, localhost, executor driver): org.broadinstitute.hellbender.exceptions.UserException$ReadMissingReadGroup: SAM/BAM/CRAM file (unknown) is malformed: Read A00609:414:H33TFDSX3:1:1107:14579:28745 is missing the read group (RG) tag, which is required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.getLibraryForRead(MarkDuplicatesSparkUtils.java:62)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.EmptyFragment.<init>(EmptyFragment.java:36)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.MarkDuplicatesSparkRecord.newEmptyFragment(MarkDuplicatesSparkRecord.java:37)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$null$0(MarkDuplicatesSparkUtils.java:139)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:439)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$transformToDuplicateNames$8ccfa6f1$1(MarkDuplicatesSparkUtils.java:148)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/07/06 13:57:19 ERROR TaskSetManager: Task 11 in stage 4.0 failed 1 times; aborting job
22/07/06 13:57:19 WARN TaskSetManager: Lost task 3.0 in stage 4.0 (TID 2831, localhost, executor driver): org.broadinstitute.hellbender.exceptions.UserException$ReadMissingReadGroup: SAM/BAM/CRAM file (unknown) is malformed: Read A00609:414:H33TFDSX3:1:1102:32334:16344 is missing the read group (RG) tag, which is required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.getLibraryForRead(MarkDuplicatesSparkUtils.java:62)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.EmptyFragment.<init>(EmptyFragment.java:36)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.MarkDuplicatesSparkRecord.newEmptyFragment(MarkDuplicatesSparkRecord.java:37)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$null$0(MarkDuplicatesSparkUtils.java:139)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:439)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$transformToDuplicateNames$8ccfa6f1$1(MarkDuplicatesSparkUtils.java:148)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/07/06 13:57:19 INFO TaskSchedulerImpl: Cancelling stage 4
22/07/06 13:57:19 INFO TaskSchedulerImpl: Killing all running tasks in stage 4: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 7.0 in stage 4.0 (TID 2835), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 4.0 in stage 4.0 (TID 2832), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 1.0 in stage 4.0 (TID 2829), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 8.0 in stage 4.0 (TID 2836), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 5.0 in stage 4.0 (TID 2833), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 12.0 in stage 4.0 (TID 2840), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 9.0 in stage 4.0 (TID 2837), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 6.0 in stage 4.0 (TID 2834), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 13.0 in stage 4.0 (TID 2841), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 10.0 in stage 4.0 (TID 2838), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 2.0 in stage 4.0 (TID 2830), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor is trying to kill task 0.0 in stage 4.0 (TID 2828), reason: Stage cancelled
22/07/06 13:57:19 INFO TaskSchedulerImpl: Stage 4 was cancelled
22/07/06 13:57:19 INFO DAGScheduler: ShuffleMapStage 4 (flatMapToPair at MarkDuplicatesSparkUtils.java:128) failed in 1.290 s due to Job aborted due to stage failure: Task 11 in stage 4.0 failed 1 times, most recent failure: Lost task 11.0 in stage 4.0 (TID 2839, localhost, executor driver): org.broadinstitute.hellbender.exceptions.UserException$ReadMissingReadGroup: SAM/BAM/CRAM file (unknown) is malformed: Read A00609:414:H33TFDSX3:1:1107:14579:28745 is missing the read group (RG) tag, which is required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.getLibraryForRead(MarkDuplicatesSparkUtils.java:62)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.EmptyFragment.<init>(EmptyFragment.java:36)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.MarkDuplicatesSparkRecord.newEmptyFragment(MarkDuplicatesSparkRecord.java:37)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$null$0(MarkDuplicatesSparkUtils.java:139)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:439)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$transformToDuplicateNames$8ccfa6f1$1(MarkDuplicatesSparkUtils.java:148)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
22/07/06 13:57:19 INFO DAGScheduler: Job 2 failed: sortByKey at SparkUtils.java:165, took 1.325682 s
22/07/06 13:57:19 INFO ShuffleBlockFetcherIterator: Getting 8 non-empty blocks including 8 local blocks and 0 remote blocks
22/07/06 13:57:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
22/07/06 13:57:19 INFO ShuffleBlockFetcherIterator: Getting 7 non-empty blocks including 7 local blocks and 0 remote blocks
22/07/06 13:57:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/07/06 13:57:19 INFO Executor: Executor killed task 9.0 in stage 4.0 (TID 2837), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 7.0 in stage 4.0 (TID 2835), reason: Stage cancelled
22/07/06 13:57:19 WARN TaskSetManager: Lost task 9.0 in stage 4.0 (TID 2837, localhost, executor driver): TaskKilled (Stage cancelled)
22/07/06 13:57:19 WARN TaskSetManager: Lost task 7.0 in stage 4.0 (TID 2835, localhost, executor driver): TaskKilled (Stage cancelled)
22/07/06 13:57:19 INFO Executor: Executor killed task 12.0 in stage 4.0 (TID 2840), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 13.0 in stage 4.0 (TID 2841), reason: Stage cancelled
22/07/06 13:57:19 INFO SparkUI: Stopped Spark web UI at http://10.96.128.66:4040
22/07/06 13:57:19 WARN TaskSetManager: Lost task 12.0 in stage 4.0 (TID 2840, localhost, executor driver): TaskKilled (Stage cancelled)
22/07/06 13:57:19 WARN TaskSetManager: Lost task 13.0 in stage 4.0 (TID 2841, localhost, executor driver): TaskKilled (Stage cancelled)
22/07/06 13:57:19 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/07/06 13:57:19 INFO Executor: Executor killed task 4.0 in stage 4.0 (TID 2832), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 8.0 in stage 4.0 (TID 2836), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 1.0 in stage 4.0 (TID 2829), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 10.0 in stage 4.0 (TID 2838), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 6.0 in stage 4.0 (TID 2834), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 2.0 in stage 4.0 (TID 2830), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 0.0 in stage 4.0 (TID 2828), reason: Stage cancelled
22/07/06 13:57:19 INFO Executor: Executor killed task 5.0 in stage 4.0 (TID 2833), reason: Stage cancelled
22/07/06 13:57:20 INFO MemoryStore: MemoryStore cleared
22/07/06 13:57:20 INFO BlockManager: BlockManager stopped
22/07/06 13:57:20 INFO BlockManagerMaster: BlockManagerMaster stopped
22/07/06 13:57:20 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/07/06 13:57:20 INFO SparkContext: Successfully stopped SparkContext
13:57:20.693 INFO MarkDuplicatesSpark - Shutting down engine
[6 July, 2022 1:57:20 PM SGT] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 7.00 minutes.
Runtime.totalMemory()=6234832896
org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 4.0 failed 1 times, most recent failure: Lost task 11.0 in stage 4.0 (TID 2839, localhost, executor driver): org.broadinstitute.hellbender.exceptions.UserException$ReadMissingReadGroup: SAM/BAM/CRAM file (unknown) is malformed: Read A00609:414:H33TFDSX3:1:1107:14579:28745 is missing the read group (RG) tag, which is required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.getLibraryForRead(MarkDuplicatesSparkUtils.java:62)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.EmptyFragment.<init>(EmptyFragment.java:36)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.MarkDuplicatesSparkRecord.newEmptyFragment(MarkDuplicatesSparkRecord.java:37)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$null$0(MarkDuplicatesSparkUtils.java:139)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:439)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$transformToDuplicateNames$8ccfa6f1$1(MarkDuplicatesSparkUtils.java:148)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:309)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:171)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:151)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:936)
at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortUsingElementsAsKeys(SparkUtils.java:165)
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.sortSamRecordsToMatchHeader(ReadsSparkSink.java:207)
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:107)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:374)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.runTool(MarkDuplicatesSpark.java:367)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: org.broadinstitute.hellbender.exceptions.UserException$ReadMissingReadGroup: SAM/BAM/CRAM file (unknown) is malformed: Read A00609:414:H33TFDSX3:1:1107:14579:28745 is missing the read group (RG) tag, which is required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.getLibraryForRead(MarkDuplicatesSparkUtils.java:62)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.EmptyFragment.<init>(EmptyFragment.java:36)
at org.broadinstitute.hellbender.utils.read.markduplicates.sparkrecords.MarkDuplicatesSparkRecord.newEmptyFragment(MarkDuplicatesSparkRecord.java:37)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$null$0(MarkDuplicatesSparkUtils.java:139)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:439)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils.lambda$transformToDuplicateNames$8ccfa6f1$1(MarkDuplicatesSparkUtils.java:148)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/07/06 13:57:20 INFO ShutdownHookManager: Shutdown hook called
-
Hi samuel
Welcome to using GATK! I hope that we can help get your pipeline sorted out and functioning.
To start off with, your input command appears to be using `
-conf
` instead of the `--conf
` option. I didn't see an error related to that in your log file, so I am not sure that is the issue, but just in case I'd make sure your commands follow the tool doc.Unfortunately, we do not typically run multi-sample BAMs with
MarkDuplicatesSpark
. This doesn't necessarily mean that the tools will never work with multi-sample BAMs, just that it would really depend on the way your input is formatted.You could try making sure that each read in your input file has the correct read group and library information for that sample and try again. If
MarkDuplicatesSpark
doesn't work, try using Picard'sMarkDuplicates
, which works differently than the Spark version and may end up giving you the result you want. No promises though, sinceMarkDuplicates
(like the Spark version) doesn't explicitly support multi-sample BAMs.That said, the fastest way to get your duplicates marked would definitely be to split up your BAM into single samples, and run
MarkDuplicatesSpark
on them that way. Since the tool is expecting single-sample inputs, feeding in the right kind of files should resolve the issue. Try re-generating your BAMs if you have the raw data, or (if you only have the BAM) usesamtools
or another comparable tool to split them up based sample/RG.I hope that helps!
-
Thanks for the clarification! I will use single sample bam from now
I didn't get an out of space error when i ran the program so the missing hyphen in
--conf
is probably a typo when i was typing the post.Yes, i went ahead and used MarkDuplicates, i could get it to work after i included SM into the RG tags. I will re-run it with single sample bam and compare the results.
Thanks a lot!
-
Great! I'm glad it worked out. Don't hesitate to come back with more questions if you have GATK-related trouble further on down the line.
Please sign in to leave a comment.
3 comments