Error using a edited gatk wdl (bam-unmapped-bam workflow) to accept array[files] as input rather than a single .bam file
Dear GATK,
I modified a `gatk pre-processing wdl workflow` (`bam-unmapped-bam`) to accept an array of files rather than a single `.bam` file as input but keeps throwing an error when i run.
`Failed to evaluate input 'input_bam' (reason 1 of 1): No coercion defined from wom value(s) '["gs://fc-ab3dca22-b0ad-45da-9687-a165c0408145/Blood_germline/HI.3746.007.N710---N506.AFN-01584.bam", "gs://fc-ab3dca22-b0ad-45da-9687-a165c0408145/Blood_germline/HI.3746.008.N710---N506.AFN-01584.bam"]' of type 'Array[File]' to 'File'.`
My input is a `.txt` containing the paths of multiple .bam files per sample because they come from different lanes.
I am suspecting my output specifications `array[file]` are incorrect but also fails.
- Any suggestions on how i can get it to run successfully?
thanks
sam
original wdl script was retrived here; https://dockstore.org/workflows/github.com/gatk-workflows/gatk4-data-processing/processing-for-variant-discovery-gatk4:1.1.0?tab=files
my edited scripts can be found below.........
`
## Copyright Broad Institute, 2018
##
## This WDL converts BAM to unmapped BAMs
##
## Requirements/expectations :
## - BAM file
##
## Outputs :
## - Sorted Unmapped BAMs
##
## Cromwell version support
## - Successfully tested on v31
## - Does not work on versions < v23 due to output syntax
##
## Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
## For program versions, see docker containers.
##
## LICENSING :
## This script is released under the WDL source code license (BSD-3) (see LICENSE in
## https://github.com/broadinstitute/wdl). Note however that the programs it calls may
## be subject to different licenses. Users are responsible for checking that they are
## authorized to run all programs before running this script. Please see the docker
## page at https://hub.docker.com/r/broadinstitute/genomes-in-the-cloud/ for detailed
## licensing information pertaining to the included programs.
# WORKFLOW DEFINITION
workflow BamToUnmappedBams {
File flowcell_bam0
Array[File] flowcell_bams0 = read_lines(flowcell_bam0)
Int? additional_disk_size
Int additional_disk = select_first([additional_disk_size, 20])
Float input_size = size(flowcell_bam0, "GB")
String? gatk_path
String path2gatk = select_first([gatk_path, "/gatk/gatk"])
String? gitc_docker
String gitc_image = select_first([gitc_docker, "us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.3.3-1513176735"])
String? gatk_docker
String gatk_image = select_first([gatk_docker, "us.gcr.io/broad-gatk/gatk"])
call GenerateOutputMap {
input:
input_bam = flowcell_bams0,
disk_size = ceil(input_size) + additional_disk,
docker = gitc_image
}
call RevertSam {
input:
input_bam = flowcell_bams0,
output_map = GenerateOutputMap.output_map,
disk_size = ceil(input_size * 3) + additional_disk,
docker = gatk_image,
gatk_path = path2gatk
}
scatter (unmapped_bam in RevertSam.unmapped_bams) {
String output_basename = basename(unmapped_bam, ".coord.sorted.unmapped.bam")
Float unmapped_bam_size = size(unmapped_bam, "GB")
call SortSam {
input:
input_bam = unmapped_bam,
sorted_bam_name = output_basename + ".unmapped.bam",
disk_size = ceil(unmapped_bam_size * 6) + additional_disk,
docker = gatk_image,
gatk_path = path2gatk
}
}
output {
Array[File] output_bams = SortSam.sorted_bam
}
}
task GenerateOutputMap {
File input_bam
Int disk_size
String docker
command {
set -e
samtools view -H ${input_bam} | grep '^@RG' | cut -f2 | sed s/ID:// > readgroups.txt
echo -e "READ_GROUP_ID\tOUTPUT" > output_map.tsv
for rg in `cat readgroups.txt`; do
echo -e "$rg\t$rg.coord.sorted.unmapped.bam" >> output_map.tsv
done
}
runtime {
docker: docker
disks: "local-disk " + disk_size + " HDD"
preemptible: "3"
memory: "1 GB"
}
output {
File output_map = "output_map.tsv"
}
}
task RevertSam {
File input_bam
File output_map
Int disk_size
String gatk_path
String docker
command {
${gatk_path} --java-options "-Xmx10000m" \
RevertSam \
--INPUT ${input_bam} \
--OUTPUT_MAP ${output_map} \
--OUTPUT_BY_READGROUP true \
--VALIDATION_STRINGENCY LENIENT \
--ATTRIBUTE_TO_CLEAR FT \
--ATTRIBUTE_TO_CLEAR CO \
--SORT_ORDER coordinate
}
runtime {
docker: docker
disks: "local-disk " + disk_size + " HDD"
memory: "12000 MB"
}
output {
Array[File] unmapped_bams = glob("*.bam")
}
}
task SortSam {
File input_bam
String sorted_bam_name
Int disk_size
String gatk_path
String docker
command {
${gatk_path} --java-options "-Xmx34000m" \
SortSam \
--INPUT ${input_bam} \
--OUTPUT ${sorted_bam_name} \
--SORT_ORDER queryname \
--MAX_RECORDS_IN_RAM 1000000
}
runtime {
docker: docker
disks: "local-disk " + disk_size + " HDD"
memory: "36000 MB"
preemptible: 3
}
output {
File sorted_bam = "${sorted_bam_name}"
}
}
`
-
Hi sahuno
This is a Terra/WDL question and not a GATK one. I have let the team know and someone from that WDL team will be in touch shortly.
-
Hey sahuno ! I'll need to take a closer look at this, but at first glance, I think you'll want to move that scatter above the RevertSam call in your workflow (between last input and call RevertSam) because it is an Array. It could look something like this:
scatter (bams in flowcell_bams0) {
...
}call RevertSam {
input:
input_bam = bams,
Similar to this:
-
sahuno were you able to get it to work? Are you modifying this and running in Terra?
Please sign in to leave a comment.
3 comments