Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GATK MarkDuplicatesSpark output error

Answered
0

3 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Bilge Tabak,

    I don't see any sort of error in this program log. Could you run the program to completion and then post the error?

    Best,

    Genevieve

    1
    Comment actions Permalink
  • Avatar
    Bilge Tabak

    Hello Genevieve Brandt (she/her),

    As you recommended, I run the program one more time and i saw this at the very end of the program:

    22/03/13 19:17:12 INFO Executor: Finished task 3696.0 in stage 2.0 (TID 6008). 1783 bytes result sent to driver

    22/03/13 19:17:12 INFO TaskSetManager: Finished task 3696.0 in stage 2.0 (TID 6008) in 772 ms on localhost (executor driver) (3698/3698)

    22/03/13 19:17:12 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 

    22/03/13 19:17:12 INFO DAGScheduler: ResultStage 2 (collect at SparkUtils.java:205) finished in 419.557 s

    22/03/13 19:17:12 INFO DAGScheduler: Job 1 finished: collect at SparkUtils.java:205, took 778.398099 s

    22/03/13 19:17:12 INFO SparkUI: Stopped Spark web UI at http://bilgesmbpsecond.home:4040

    22/03/13 19:17:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

    22/03/13 19:17:16 INFO MemoryStore: MemoryStore cleared

    22/03/13 19:17:16 INFO BlockManager: BlockManager stopped

    22/03/13 19:17:16 INFO BlockManagerMaster: BlockManagerMaster stopped

    22/03/13 19:17:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

    22/03/13 19:17:16 INFO SparkContext: Successfully stopped SparkContext

    19:17:16.717 INFO  MarkDuplicatesSpark - Shutting down engine

    [March 13, 2022 7:17:16 PM EET] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 14.84 minutes.

    Runtime.totalMemory()=4019191808

    ***********************************************************************




    A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads




    ***********************************************************************

    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

    22/03/13 19:17:16 INFO ShutdownHookManager: Shutdown hook called

    22/03/13 19:17:16 INFO ShutdownHookManager: Deleting directory /private/var/folders/xp/7ryg036s6gl1r209ppsh06cr0000gn/T/spark-4b0c83fa-6bcf-4b12-9faf-a0ababb0b788

    I guess it couldn't find the reads in the input file. I don't understand why it couldn't find them, it proceeded the run so well.

    Normally, GATK advises to use MarkDuplicatesSpark, then MarkDuplicates and SortSam. We've tried the MarkDuplicatesSpark with our very first file which is the alignment of reference genome and our data (~140gb) we couldn't get any output. Then we tried to sort first, the sorted file is about ~40gb (the file that i am mentioning), then run MarkDuplictesSpark but it didn't worked either. 

    I've run the program with the alignment file one more time before sending this comment to you and again, it is giving the same error:

    22/03/13 20:21:47 INFO Executor: Finished task 13323.0 in stage 2.0 (TID 21651). 1790 bytes result sent to driver

    22/03/13 20:21:47 INFO TaskSetManager: Finished task 13323.0 in stage 2.0 (TID 21651) in 136 ms on localhost (executor driver) (13324/13325)

    22/03/13 20:21:47 INFO Executor: Finished task 13324.0 in stage 2.0 (TID 21652). 1444 bytes result sent to driver

    22/03/13 20:21:47 INFO TaskSetManager: Finished task 13324.0 in stage 2.0 (TID 21652) in 139 ms on localhost (executor driver) (13325/13325)

    22/03/13 20:21:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 

    22/03/13 20:21:47 INFO DAGScheduler: ResultStage 2 (collect at SparkUtils.java:205) finished in 306.504 s

    22/03/13 20:21:47 INFO DAGScheduler: Job 1 finished: collect at SparkUtils.java:205, took 1032.350825 s

    22/03/13 20:21:47 INFO SparkUI: Stopped Spark web UI at http://bilgesmbpsecond.home:4040

    22/03/13 20:21:47 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

    22/03/13 20:21:53 INFO MemoryStore: MemoryStore cleared

    22/03/13 20:21:53 INFO BlockManager: BlockManager stopped

    22/03/13 20:21:53 INFO BlockManagerMaster: BlockManagerMaster stopped

    22/03/13 20:21:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

    22/03/13 20:21:53 INFO SparkContext: Successfully stopped SparkContext

    20:21:53.752 INFO  MarkDuplicatesSpark - Shutting down engine

    [March 13, 2022 8:21:53 PM EET] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 22.94 minutes.

    Runtime.totalMemory()=3030908928

    ***********************************************************************




    A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads




    ***********************************************************************

    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

    22/03/13 20:21:53 INFO ShutdownHookManager: Shutdown hook called

    22/03/13 20:21:53 INFO ShutdownHookManager: Deleting directory /private/var/folders/xp/7ryg036s6gl1r209ppsh06cr0000gn/T/spark-6670b584-02a5-43fb-a2c7-76f4a6015e5e

    I am stuck at this point unfortunately. I am waiting for your reply.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Bilge Tabak,

    Thank you for posting this! I can see the issue here - you should easily be able to fix it. The error message is not that MarkDuplicates cannot find the reads, it's that MarkDuplicates cannot find read groups:

    A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads

    Read groups are mandatory for using GATK. Please see this document on read groups for more information about read groups and how to add them:

    1. https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
    2. https://gatk.broadinstitute.org/hc/en-us/articles/360035532352-Errors-about-read-group-RG-information

    Please let me know if you have any other questions.

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk