PathseqPipelineSpark stops without error message
Hi,
I was trying to run the following command with a bam file as input:
gatk --java-options '-Xmx64G' PathSeqPipelineSpark \
--input Samples/58.sorted.bam \
--kmer-file Files/host_reference_kmers-built.hss \
--filter-bwa-image Files/hg38.fasta.img \
--microbe-bwa-image Files/pathseq_microbe.fa.img \
--microbe-fasta Files/pathseq_microbe.fa \
--taxonomy-file Files/pathseq_taxonomy.db \
--scores-output Results/58/58_scores.txt \
--output Results/58/58_output_reads.bam \
--filter-metrics Results/58/58_filter-metrics.txt \
--score-metrics Results/58/58_score_metrics.txt 2>&1 | tee Results/58/58_pipeline_out.txt
I also tried with other bam files. For all of them, at a given moment (stage 36), the process shuts down without error message. After that, the only output available is filter-metrics.txt as probably the other ones were not built until this stage.
I don't know what is happening. I already tried allocating more memory and changing the tmp directory, without success. Doing the former I noticed that at the end there are a lot of non-deleted temporary files (therefore probably the process didn't stop "peacefully").
Anybody there already faced this issue?
Below I show you the last lines of stdout. Notice how everything is okay until the process stops. Always at this step, whatever the input.
20/02/06 15:34:46 INFO Executor: Running task 0.0 in stage 36.0 (TID 6033)
20/02/06 15:34:46 INFO Executor: Running task 2.0 in stage 36.0 (TID 6035)
20/02/06 15:34:46 INFO Executor: Running task 1.0 in stage 36.0 (TID 6034)
20/02/06 15:34:46 INFO Executor: Running task 6.0 in stage 36.0 (TID 6039)
20/02/06 15:34:46 INFO Executor: Running task 3.0 in stage 36.0 (TID 6036)
20/02/06 15:34:46 INFO Executor: Running task 10.0 in stage 36.0 (TID 6043)
20/02/06 15:34:46 INFO Executor: Running task 8.0 in stage 36.0 (TID 6041)
20/02/06 15:34:46 INFO Executor: Running task 13.0 in stage 36.0 (TID 6046)
20/02/06 15:34:46 INFO Executor: Running task 18.0 in stage 36.0 (TID 6051)
20/02/06 15:34:46 INFO Executor: Running task 19.0 in stage 36.0 (TID 6052)
20/02/06 15:34:46 INFO Executor: Running task 23.0 in stage 36.0 (TID 6056)
20/02/06 15:34:46 INFO Executor: Running task 9.0 in stage 36.0 (TID 6042)
20/02/06 15:34:46 INFO Executor: Running task 22.0 in stage 36.0 (TID 6055)
20/02/06 15:34:46 INFO Executor: Running task 21.0 in stage 36.0 (TID 6054)
20/02/06 15:34:46 INFO Executor: Running task 17.0 in stage 36.0 (TID 6050)
20/02/06 15:34:46 INFO Executor: Running task 15.0 in stage 36.0 (TID 6048)
20/02/06 15:34:46 INFO Executor: Running task 16.0 in stage 36.0 (TID 6049)
20/02/06 15:34:46 INFO Executor: Running task 12.0 in stage 36.0 (TID 6045)
20/02/06 15:34:46 INFO Executor: Running task 11.0 in stage 36.0 (TID 6044)
20/02/06 15:34:46 INFO Executor: Running task 14.0 in stage 36.0 (TID 6047)
20/02/06 15:34:46 INFO Executor: Running task 5.0 in stage 36.0 (TID 6038)
20/02/06 15:34:46 INFO Executor: Running task 4.0 in stage 36.0 (TID 6037)
20/02/06 15:34:46 INFO Executor: Running task 7.0 in stage 36.0 (TID 6040)
20/02/06 15:34:46 INFO Executor: Running task 20.0 in stage 36.0 (TID 6053)
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Getting 399 non-empty blocks including 399 local blocks and 0 remote blocks
20/02/06 15:34:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 144
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 176
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 139
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 135
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 141
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 130
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 129
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 149
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 173
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 197
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 191
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 193
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 168
20/02/06 15:54:29 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 10.100.108.31:39008 in memory (size: 3.5 KB, free: 50.9 GB)
20/02/06 15:54:29 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 10.100.108.31:39008 in memory (size: 4.3 KB, free: 50.9 GB)
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 184
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 183
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 174
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 140
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 160
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 138
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 166
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 126
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 169
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 171
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 196
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 175
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 170
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 156
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 178
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 125
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 161
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 152
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 148
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 153
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 142
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 185
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 136
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 134
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 159
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 150
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 189
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 147
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 154
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 162
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 195
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 163
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 133
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 180
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 194
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 199
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 131
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 137
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 128
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 164
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 188
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 192
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 132
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 157
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 151
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 181
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 186
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 155
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 167
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 145
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 143
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 198
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 179
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 127
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 146
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 190
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 158
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 182
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 187
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 165
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 172
20/02/06 15:54:29 INFO ContextCleaner: Cleaned accumulator 177
-
The most common user error with pathseq is not having read groups in the bam, which causes all reads to get filtered. Can you please check if that is the issue here and/or post the header from one of the bams.
-
Thank you for your reply
Apparently this is not the case, because, as I said in the post, I am able to see the filter-metrics output and for all samples there is a non-zero number of reads after filtering stage.
For example:
## METRICS CLASS org.broadinstitute.hellbender.tools.spark.pathseq.loggers.PSFilterMetrics
PRIMARY_READS 149937214READS_AFTER_PREALIGNED_HOST_FILTER 149937214
READS_AFTER_QUALITY_AND_COMPLEXITY_FILTER 65196700
READS_AFTER_HOST_FILTER 1672161
READS_AFTER_DEDUPLICATION 1433205
FINAL_PAIRED_READS 624720
FINAL_UNPAIRED_READS 808485
FINAL_TOTAL_READS 1433205
LOW_QUALITY_OR_LOW_COMPLEXITY_READS_FILTERED 84740514
HOST_READS_FILTERED 63524539
DUPLICATE_READS_FILTERED 238956
Thanks again
-
Can you please post the header from one of the bams.
-
Hello vitor,
Thank you for clarifying that. There are a few things I am wondering about:
1) Double-check the logs for any useful messages. What was the last message generated by GATK? What you are seeing at the end is from Spark and usually does not contain useful information.
2) Possibly running out of memory. If the JVM tries to allocate too much memory it will be killed by the OS. Is there a return code for the process?
3) What environment are you running in? Is it a single machine/VM or a Spark cluster? If the latter, the error may be occurring on one of the workers, in which case it would not be produced in the executor log (although I would expect it to exit gracefully if this were the case).
3) It may be easier to troubleshoot this if you run the pipeline piecemeal with PathSeqFilterSpark, PathSeqBwaSpark, and PathSeqScoreSpark rather than PathseqPipelineSpark.
-
Hi Mark Walker
Thank your for the alternatives suggested.
I just ran a test using the lightest bam file (5gb) of my dataset and it worked perfectly. So probably the process is running out of memory for the other heavier files (10gb+). How the input size influence the memory used by the process? Currently I am setting 'Xmx64G" to run these tests. Any tips on how many more gb should I allocate with 10gb+ files as input?
-
Memory bottlenecks usually occur during alignment to the microbe reference. At this stage, pathseq requires memory for both the reference image as well as reads. However, the reference is loaded outside of the JVM, so there must be at least as much memory available to the OS as the size of the reference. The JVM must have enough memory to store the non-host reads that remain after filtering. Therefore memory usage does not scale with input BAM size per se, but rather the number of microbial reads present.
I would recommend doubling the memory to 128GB to accommodate the microbial reference.
Please sign in to leave a comment.
6 comments