Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

(How to) Run the Pathseq pipeline Follow

9 comments

  • Avatar
    zhan li
    Download tutorial_10913.tar.gz from the ftp site. Extract the archive with the command:

    Sorry, the ftp link is unavaliable, how can I get the data?

    2
    Comment actions Permalink
  • Avatar
    Joe Li

    The FTP link for the tutorial files are no longer available. Please help?

    2
    Comment actions Permalink
  • Avatar
    sy zhang

    Excuse me,when I run the PathSeqPipelineSpark using pre-built microbe reference files of GATK Resource Bundle to detect microbe of mice scRNA-BAM file, there are some errors:

    ......

    23/02/20 02:26:31 INFO TaskSetManager: Finished task 534.0 in stage 4.0 (TID 2702) in 7371 ms on localhost (executor driver) (542/542)
    23/02/20 02:26:31 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
    23/02/20 02:26:31 INFO DAGScheduler: ResultStage 4 (count at PSFilterFileLogger.java:47) finished in 80.017 s
    23/02/20 02:26:31 INFO DAGScheduler: Job 3 finished: count at PSFilterFileLogger.java:47, took 1635.386275 s
    23/02/20 02:26:32 INFO SparkUI: Stopped Spark web UI at http://10.10.10.152:4040
    23/02/20 02:26:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    23/02/20 02:26:33 INFO MemoryStore: MemoryStore cleared
    23/02/20 02:26:33 INFO BlockManager: BlockManager stopped
    23/02/20 02:26:33 INFO BlockManagerMaster: BlockManagerMaster stopped
    23/02/20 02:26:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    23/02/20 02:26:34 INFO SparkContext: Successfully stopped SparkContext
    02:26:34.177 INFO  PathSeqPipelineSpark - Shutting down engine
    [February 20, 2023 at 2:26:34 AM CST] org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark done. Elapsed time: 44.24 minutes.
    Runtime.totalMemory()=211493584896
    java.lang.IllegalArgumentException: Unsupported class file major version 55
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
        at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
        at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
        at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
        at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
        at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
        at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
        at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:88)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:77)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
        at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1.apply(PairRDDFunctions.scala:505)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1.apply(PairRDDFunctions.scala:498)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
        at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:498)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:641)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:641)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
        at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:640)
        at org.apache.spark.api.java.JavaPairRDD.groupByKey(JavaPairRDD.scala:559)
        at org.broadinstitute.hellbender.tools.spark.pathseq.PSFilter.filterDuplicateSequences(PSFilter.java:166)
        at org.broadinstitute.hellbender.tools.spark.pathseq.PSFilter.doFilter(PSFilter.java:289)
        at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark.runTool(PathSeqPipelineSpark.java:238)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
        Suppressed: java.lang.IllegalStateException: Cannot compute metrics if primary, pre-aligned host, quality, host, duplicate, or final paired read counts are not initialized
            at org.broadinstitute.hellbender.tools.spark.pathseq.loggers.PSFilterMetrics.computeDerivedMetrics(PSFilterMetrics.java:72)
            at org.broadinstitute.hellbender.tools.spark.pathseq.loggers.PSFilterFileLogger.close(PSFilterFileLogger.java:64)
            at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark.runTool(PathSeqPipelineSpark.java:239)
            ... 8 more
    23/02/20 02:26:34 INFO ShutdownHookManager: Shutdown hook called
    23/02/20 02:26:34 INFO ShutdownHookManager: Deleting directory /tmp/spark-84326ac0-95d8-4bbb-8cf8-b7df3b1d7a3f

    ------------------------------------------------------------------------------------------------------------------------------------

    Can you help me? Please!

    2
    Comment actions Permalink
  • Avatar
    sy zhang

    Can fasta of different bacteria be entered into BwaMemIndexImageCreator or PathSeqBuildReferenceTaxonomy together? Or one by one?

    1
    Comment actions Permalink
  • Avatar
    Xingyu Liao

    Hello, the FTP link below seems no longer available.

    Download tutorial_10913.tar.gz from the ftp site

    Could you kindly point me in the right direction? Is there an alternative link that I could use? Or do you have any other suggestions on how I might be able to obtain this data?

     

     
    1
    Comment actions Permalink
  • Avatar
    Shi

    Hi, GATK team,

    I recently used the GATK PathSeq pipeline to detect virus from RNA-seq data. It seems succeeded, the last few lines of the log file are shown below:

    ..........
    20/03/04 10:03:30 INFO TaskSchedulerImpl: Removed TaskSet 43.0, whose tasks have all completed, from pool
    20/03/04 10:03:30 INFO DAGScheduler: ResultStage 43 (foreach at BwaMemIndexCache.java:84) finished in 2.808 s
    20/03/04 10:03:30 INFO DAGScheduler: Job 9 finished: foreach at BwaMemIndexCache.java:84, took 2.810214 s
    20/03/04 10:03:30 INFO SparkUI: Stopped Spark web UI at http://bioRUN:4040
    20/03/04 10:03:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    20/03/04 10:03:32 INFO MemoryStore: MemoryStore cleared
    20/03/04 10:03:32 INFO BlockManager: BlockManager stopped
    20/03/04 10:03:32 INFO BlockManagerMaster: BlockManagerMaster stopped
    20/03/04 10:03:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    20/03/04 10:03:32 INFO SparkContext: Successfully stopped SparkContext
    10:03:32.092 INFO PathSeqPipelineSpark - Shutting down engine
    [March 4, 2020 10:03:32 AM EST] org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark done. Elapsed time: 7.70 minutes.
    Runtime.totalMemory()=13398704128
    20/03/04 10:03:32 INFO ShutdownHookManager: Shutdown hook called
    20/03/04 10:03:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-a39060d4-c445-4785-bcce-02d955a812a5

    But, the results of output files confused me. I obtained many dsDNA_viruses,_no_RNA_stage from RNA-seq data. If this viruses haven't RNA stage, Why could I detect it from RNA-seq data? some results are shown below:

    196896 root|Viruses|dsDNA_viruses,_no_RNA_stage|Caudovirales|Myoviridae|unclassified_Myoviridae no_rank unclassified_Myoviridae Viruses 0.17882352941176471 5.697151424287859 2 0 0
    197310 root|Viruses|dsDNA_viruses,_no_RNA_stage|Caudovirales|Myoviridae|Tevenvirinae|T4virus|unclassified_T4virus|Enterobacteria_phage_RB14 species Enterobacteria_phage_RB14 Viruses 0.039607843137254906 1.261869065467267 2 0 165429
    66711 root|Viruses|dsDNA_viruses,_no_RNA_stage|Caudovirales|Myoviridae|Tevenvirinae|T4virus|Escherichia_virus_AR1|Escherichia_phage_AR1 no_rank Escherichia_phage_AR1 Viruses 0.039607843137254906 1.261869065467267 2 0 167435
    329380 root|Viruses|dsDNA_viruses,_no_RNA_stage|Caudovirales|Myoviridae|Tevenvirinae|T4virus|unclassified_T4virus no_rank unclassified_T4virus Viruses 0.43568627450980385 13.88055972013994 2 0 0

    Any help greatly appreciated. Thank you!

    Best

    0
    Comment actions Permalink
  • Avatar
    Christopher Koch

    Are there pre-built GRCh37 reference files available somewhere? I only see a link to GRCh38 references.

    0
    Comment actions Permalink
  • Avatar
    Erika Zuljan

    Hello! Under the point: "Java heap out of memory error" I am trying to understand the sentence "This should generally be set to a value greater than the sum of all reference files.". Is this meant for all the *.fasta, *.img and *.db files used to t´run the PathSeqPipelineSpark ? This would make in my case aroung 220G of memory to only run one sample . Do I understand this correctly? (I am using the reference files from  Broad's 'gcp-public-data' Google Bucket)

    0
    Comment actions Permalink
  • Avatar
    Cameron Griffiths

    sy zhang

    Have you made sure that you are running the correct version of Java for GATK? I ran into the same error and it was fixed by changing my Java version.

    https://gatk.broadinstitute.org/hc/en-us/articles/360035532332-Java-version-issues

    https://gatk.broadinstitute.org/hc/en-us/articles/360035889531-What-are-the-requirements-for-running-GATK

     

    Also, multiple microbes can indeed be entered into BwaMemIndexImageCreator or PathSeqBuildReferenceTaxonomy together. You can arrange them sequentially in your .fasta file with one header per microbe.

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk