Beta tutorial Please report any issues in the comments section.
Overview
PathSeq is a GATK pipeline for detecting microbial organisms in short-read deep sequencing samples taken from a host organism (e.g. human). The diagram below summarizes how it works. In brief, the pipeline performs read quality filtering, subtracts reads derived from the host, aligns the remaining (non-host) reads to a reference of microbe genomes, and generates a table of detected microbial organisms. The results can be used to determine the presence and abundance microbial organisms as well as to discover novel microbial sequences.
PathSeq pipeline diagram Boxes outlined with dashed lines represent files. The green boxes at the top depict the three phases of the pipeline: read quality filtering / host subtraction, microbe alignment, and taxonomic abundance scoring. The blue boxes show tools used for pre-processing the host and microbe references for use with PathSeq.
Tutorial outline
This tutorial describes:
- How to run the full PathSeq pipeline on a simulated mixture of human and E. coli reads using pre-built small-scale reference files
- How to prepare custom host and microbe reference files for use with PathSeq
A more detailed introduction of the pipeline can be found in the PathSeqPipelineSpark tool documentation. For more information about the other tools, see the Metagenomics section of the GATK documentation.
How to obtain reference files
Host and microbe references must be prepared for PathSeq as described in this tutorial. The tutorial files provided below contain references that are designed specifically for this tutorial and should not be used in practice. Users can download recommended pre-built reference files for use with PathSeq from the Broad's 'gcp-public-data' Google Bucket. This tutorial also covers how to build custom host and microbe references.
Tutorial Requirements
The PathSeq tools are bundled with the GATK 4 release. For the most up-to-date GATK installation instructions, please see https://github.com/broadinstitute/gatk. This tutorial assumes you are using a POSIX (e.g. Linux or MacOS) operating system with at least 2Gb of memory.
Obtain tutorial files
Download tutorial_10913.tar.gz
from the ftp site. Extract the archive with the command:
tar xzvf pathseq_tutorial.tar.gz cd pathseq_tutorial
You should now have the following files in your current directory:
- test_sample.bam : simulated sample of 3M paired-end 151-bp reads from human and E. coli
- hg19mini.fasta : human reference sequences (indexed)
- e_coli_k12.fasta : E. coli reference sequences (indexed)
- e_coli_k12.fasta.img : PathSeq BWA-MEM index image
- e_coli_k12.db : PathSeq taxonomy file
Run the PathSeq pipeline
The pipeline accepts reads in BAM format (if you have FASTQ files, please see this discussion on how to convert to BAM). In this example, the pipeline can be run using the following command:
gatk PathSeqPipelineSpark \ --input test_sample.bam \ --filter-bwa-image hg19mini.fasta.img \ --kmer-file hg19mini.hss \ --min-clipped-read-length 70 \ --microbe-fasta e_coli_k12.fasta \ --microbe-bwa-image e_coli_k12.fasta.img \ --taxonomy-file e_coli_k12.db \ --output output.pathseq.bam \ --scores-output output.pathseq.txt
This ran in 2 minutes on a Macbook Pro with a 2.8GHz Quad-core CPU and 16 GB of RAM. If running on a local workstation, users can monitor the progress of the pipeline through a web browser at http://localhost:4040.
Interpreting the output
The PathSeq output files are:
- output.pathseq.bam : contains all high-quality non-host reads aligned to the microbe reference. The YP read tag lists the NCBI taxonomy IDs of any aligned species meeting the alignment identity criteria (see the --min-score-identity and --identity-margin). This tag is omitted if the read was not successfully mapped, which may indicate the presence of organisms not represented in the microbe database.
- output.pathseq.txt : a tab-delimited table of the input sample’s microbial composition. This can be imported into Excel and organized by selecting Data -> Filter from the menu:
tax_id | taxonomy | type | name | kingdom | score | score_normalized | reads | unambiguous | reference_length |
---|---|---|---|---|---|---|---|---|---|
1 | root | root | root | root | 189580 | 100 | 189580 | 189580 | 0 |
131567 | root cellular_organisms | no_rank | cellular_organisms | root | 189580 | 100 | 189580 | 189580 | 0 |
2 | ... cellular_organisms Bacteria | superkingdom | Bacteria | Bacteria | 189580 | 100 | 189580 | 189580 | 0 |
1224 | ... Proteobacteria | phylum | Proteobacteria | Bacteria | 189580 | 100 | 189580 | 189580 | 0 |
1236 | ... Proteobacteria Gammaproteobacteria | class | Gammaproteobacteria | Bacteria | 189580 | 100 | 189580 | 189580 | 0 |
91347 | ... Gammaproteobacteria Enterobacterales | order | Enterobacterales | Bacteria | 189580 | 100 | 189580 | 189580 | 0 |
543 | ... Enterobacterales Enterobacteriaceae | family | Enterobacteriaceae | Bacteria | 189580 | 100 | 189580 | 189580 | 0 |
561 | ... Enterobacteriaceae Escherichia | genus | Escherichia | Bacteria | 189580 | 100 | 189580 | 189580 | 0 |
562 | ... Escherichia Escherichia_coli | species | Escherichia_coli | Bacteria | 189580 | 100 | 189580 | 189580 | 0 |
83333 | ... Escherichia_coli Escherichia_coli_K-12 | no_rank | Escherichia_coli_K-12 | Bacteria | 189580 | 100 | 189580 | 189580 | 0 |
511145 | ... Escherichia_coli_str._K-12_substr._MG1655 | no_rank | Escherichia_coli_str._K-12_substr._MG1655 | Bacteria | 189580 | 100 | 189580 | 189580 | 4641652 |
Each line provides information for a single node in the taxonomic tree. A "root" node corresponding to the top of the tree is always listed. Columns to the right of the taxonomic information are:
- score : indicates the amount of evidence that this taxon is present, based on the number of reads that aligned to references in this taxon. This takes into account uncertainty due to ambiguously mapped reads by dividing their weight across each possible hit. It it also normalized by genome length.
- score_normalized : the same as score, but normalized to sum to 100 within each kingdom.
- reads : number of mapped reads (ambiguous or unambiguous)
- unambiguous : number of unambiguously mapped reads
- reference_length : reference length (in bases) if there is a reference assigned to this taxon. Unlike scores, this number is not propagated up the tree, i.e. it is 0 if there is no reference corresponding directly to the taxon. In the above example, the MG1655 strain reference length is only shown in the strain row (4,641,652 bases).
In this example, one can see that PathSeq detected 189,580 reads reads that mapped to the strain reference for E. coli K-12 MG1655. This read count is propogated up the tree (species, genus, family, etc.) to the root node. If other species were present, their read counts would be listed and added to their corresponding ancestral taxonomic classes.
Microbe discovery
PathSeq can also be used to discover novel microorganisms by analyzing the unmapped reads, e.g. using BLAST or de novo assembly. To get the number of non-host (microbe plus unmapped) reads use the samtools view
command:
samtools view –c output.pathseq.bam 189580
Since the reported number of E. coli reads is the same number of reads in the output BAM, there are 0 reads of unknown origin in this sample.
Preparing Custom Reference Files
Custom host and microbe references must both be prepared for use with PathSeq. The references should be supplied as FASTA files with proper indices and sequence dictionaries. The host reference is used to build a BWA-MEM index image and a k-mer file. The microbe reference is used to build another BWA-MEM index image and a taxonomy file. Here we assume you are starting with the FASTA reference files that have been properly indexed:
- host.fasta : your custom host reference sequences
- microbe.fasta : your custom microbe reference sequences
Build the host and microbe BWA index images
The BWA index images must be build using BwaMemIndexImageCreator:
gatk BwaMemIndexImageCreator -I host.fasta gatk BwaMemIndexImageCreator -I microbe.fasta
Generate the host k-mer library file
The PathSeqBuildKmers tool creates a library of k-mers from a host reference FASTA file. Create a hash set of all k-mers in the host reference with following command:
gatk PathSeqBuildKmers \ --reference host.fasta \ -O host.hss
Build the taxonomy file
Download the latest RefSeq accession catalog RefSeq-releaseXX.catalog.gz, where XX is the latest RefSeq release number, at:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/
Download NCBI taxonomy data files dump (no need to extract the archive):
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
Assuming these files are now in your current working directory, build the taxonomy file using PathSeqBuildReferenceTaxonomy:
gatk PathSeqBuildReferenceTaxonomy \ -R microbe.fasta \ --refseq-catalog RefSeq-releaseXX.catalog.gz \ --tax-dump taxdump.tar.gz \ -O microbe.db
Example reference build script
The preceding instructions can be conveniently executed with the following bash script:
#!/bin/bash set -eu GATK_HOME=/path/to/gatk REFSEQ_CATALOG=/path/to/RefSeq-releaseXX.catalog.gz TAXDUMP=/path/to/taxdump.tar.gz echo "Building pathogen reference..." $GATK_HOME/gatk BwaMemIndexImageCreator -I microbe.fasta $GATK_HOME/gatk PathSeqBuildReferenceTaxonomy -R microbe.fasta --refseq-catalog $REFSEQ_CATALOG --tax-dump $TAXDUMP -O microbe.db echo "Building host reference..." $GATK_HOME/gatk BwaMemIndexImageCreator -I host.fasta $GATK_HOME/gatk PathSeqBuildKmers --reference host.fasta -O host.hss
Troubleshooting
Java heap out of memory error
Increase the Java heap limit. For example, to increase the limit to 4GB with the --java-options
flag:
gatk --java-options "-Xmx4G" ...
This should generally be set to a value greater than the sum of all reference files.
The output is empty
The input reads must pass an initial validity filter, WellFormedReadFilter. A common cause of empty output is that the input reads do not pass this filter, often because none of the reads have been assigned to a read group (with an RG
tag). For instructions on adding read groups, see this article, but note that PathSeqPipelineSpark and PathSeqFilterSpark do not require the input BAM to be sorted or indexed.
9 comments
Sorry, the ftp link is unavaliable, how can I get the data?
The FTP link for the tutorial files are no longer available. Please help?
Excuse me,when I run the PathSeqPipelineSpark using pre-built microbe reference files of GATK Resource Bundle to detect microbe of mice scRNA-BAM file, there are some errors:
......
23/02/20 02:26:31 INFO TaskSetManager: Finished task 534.0 in stage 4.0 (TID 2702) in 7371 ms on localhost (executor driver) (542/542)
23/02/20 02:26:31 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
23/02/20 02:26:31 INFO DAGScheduler: ResultStage 4 (count at PSFilterFileLogger.java:47) finished in 80.017 s
23/02/20 02:26:31 INFO DAGScheduler: Job 3 finished: count at PSFilterFileLogger.java:47, took 1635.386275 s
23/02/20 02:26:32 INFO SparkUI: Stopped Spark web UI at http://10.10.10.152:4040
23/02/20 02:26:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/02/20 02:26:33 INFO MemoryStore: MemoryStore cleared
23/02/20 02:26:33 INFO BlockManager: BlockManager stopped
23/02/20 02:26:33 INFO BlockManagerMaster: BlockManagerMaster stopped
23/02/20 02:26:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/02/20 02:26:34 INFO SparkContext: Successfully stopped SparkContext
02:26:34.177 INFO PathSeqPipelineSpark - Shutting down engine
[February 20, 2023 at 2:26:34 AM CST] org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark done. Elapsed time: 44.24 minutes.
Runtime.totalMemory()=211493584896
java.lang.IllegalArgumentException: Unsupported class file major version 55
at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:88)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:77)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1.apply(PairRDDFunctions.scala:505)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1.apply(PairRDDFunctions.scala:498)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:498)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:641)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:641)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:640)
at org.apache.spark.api.java.JavaPairRDD.groupByKey(JavaPairRDD.scala:559)
at org.broadinstitute.hellbender.tools.spark.pathseq.PSFilter.filterDuplicateSequences(PSFilter.java:166)
at org.broadinstitute.hellbender.tools.spark.pathseq.PSFilter.doFilter(PSFilter.java:289)
at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark.runTool(PathSeqPipelineSpark.java:238)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Suppressed: java.lang.IllegalStateException: Cannot compute metrics if primary, pre-aligned host, quality, host, duplicate, or final paired read counts are not initialized
at org.broadinstitute.hellbender.tools.spark.pathseq.loggers.PSFilterMetrics.computeDerivedMetrics(PSFilterMetrics.java:72)
at org.broadinstitute.hellbender.tools.spark.pathseq.loggers.PSFilterFileLogger.close(PSFilterFileLogger.java:64)
at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark.runTool(PathSeqPipelineSpark.java:239)
... 8 more
23/02/20 02:26:34 INFO ShutdownHookManager: Shutdown hook called
23/02/20 02:26:34 INFO ShutdownHookManager: Deleting directory /tmp/spark-84326ac0-95d8-4bbb-8cf8-b7df3b1d7a3f
------------------------------------------------------------------------------------------------------------------------------------
Can you help me? Please!
Can fasta of different bacteria be entered into BwaMemIndexImageCreator or PathSeqBuildReferenceTaxonomy together? Or one by one?
Hello, the FTP link below seems no longer available.
Could you kindly point me in the right direction? Is there an alternative link that I could use? Or do you have any other suggestions on how I might be able to obtain this data?
Hi, GATK team,
I recently used the GATK PathSeq pipeline to detect virus from RNA-seq data. It seems succeeded, the last few lines of the log file are shown below:
..........
20/03/04 10:03:30 INFO TaskSchedulerImpl: Removed TaskSet 43.0, whose tasks have all completed, from pool
20/03/04 10:03:30 INFO DAGScheduler: ResultStage 43 (foreach at BwaMemIndexCache.java:84) finished in 2.808 s
20/03/04 10:03:30 INFO DAGScheduler: Job 9 finished: foreach at BwaMemIndexCache.java:84, took 2.810214 s
20/03/04 10:03:30 INFO SparkUI: Stopped Spark web UI at http://bioRUN:4040
20/03/04 10:03:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/03/04 10:03:32 INFO MemoryStore: MemoryStore cleared
20/03/04 10:03:32 INFO BlockManager: BlockManager stopped
20/03/04 10:03:32 INFO BlockManagerMaster: BlockManagerMaster stopped
20/03/04 10:03:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/03/04 10:03:32 INFO SparkContext: Successfully stopped SparkContext
10:03:32.092 INFO PathSeqPipelineSpark - Shutting down engine
[March 4, 2020 10:03:32 AM EST] org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark done. Elapsed time: 7.70 minutes.
Runtime.totalMemory()=13398704128
20/03/04 10:03:32 INFO ShutdownHookManager: Shutdown hook called
20/03/04 10:03:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-a39060d4-c445-4785-bcce-02d955a812a5
But, the results of output files confused me. I obtained many dsDNA_viruses,_no_RNA_stage from RNA-seq data. If this viruses haven't RNA stage, Why could I detect it from RNA-seq data? some results are shown below:
196896 root|Viruses|dsDNA_viruses,_no_RNA_stage|Caudovirales|Myoviridae|unclassified_Myoviridae no_rank unclassified_Myoviridae Viruses 0.17882352941176471 5.697151424287859 2 0 0
197310 root|Viruses|dsDNA_viruses,_no_RNA_stage|Caudovirales|Myoviridae|Tevenvirinae|T4virus|unclassified_T4virus|Enterobacteria_phage_RB14 species Enterobacteria_phage_RB14 Viruses 0.039607843137254906 1.261869065467267 2 0 165429
66711 root|Viruses|dsDNA_viruses,_no_RNA_stage|Caudovirales|Myoviridae|Tevenvirinae|T4virus|Escherichia_virus_AR1|Escherichia_phage_AR1 no_rank Escherichia_phage_AR1 Viruses 0.039607843137254906 1.261869065467267 2 0 167435
329380 root|Viruses|dsDNA_viruses,_no_RNA_stage|Caudovirales|Myoviridae|Tevenvirinae|T4virus|unclassified_T4virus no_rank unclassified_T4virus Viruses 0.43568627450980385 13.88055972013994 2 0 0
Any help greatly appreciated. Thank you!
Best
Are there pre-built GRCh37 reference files available somewhere? I only see a link to GRCh38 references.
Hello! Under the point: "Java heap out of memory error" I am trying to understand the sentence "This should generally be set to a value greater than the sum of all reference files.". Is this meant for all the *.fasta, *.img and *.db files used to t´run the PathSeqPipelineSpark ? This would make in my case aroung 220G of memory to only run one sample . Do I understand this correctly? (I am using the reference files from Broad's 'gcp-public-data' Google Bucket)
sy zhang
Have you made sure that you are running the correct version of Java for GATK? I ran into the same error and it was fixed by changing my Java version.
https://gatk.broadinstitute.org/hc/en-us/articles/360035532332-Java-version-issues
https://gatk.broadinstitute.org/hc/en-us/articles/360035889531-What-are-the-requirements-for-running-GATK
Also, multiple microbes can indeed be entered into BwaMemIndexImageCreator or PathSeqBuildReferenceTaxonomy together. You can arrange them sequentially in your .fasta file with one header per microbe.
Please sign in to leave a comment.