memory troubles and modified imports for pipeline running on Google Cloud
Hi every one
I am using the five-dollar-genome-analysis-pipeline deployed on Google Cloud instances. (the most recent version seems to be not ready to use on the cloud)
(gcloud beta lifesciences pipelines run --pipeline-file /home/giedi_prime/software/google-cloud-sdk-324.0.0-linux-x86_64/test1/wdl-runner/wdl_runner/wdl_pipeline.yaml --location us-central1 --regions us-central1 --inputs-from-file WDL=/home/giedi_prime/software/google-cloud-sdk-324.0.0-linux-x86_64/test1/five-dollar-genome-analysis-pipeline/WholeGenomeGermlineSingleSample_xegen3.wdl,WORKFLOW_INPUTS= /home/giedi_prime/software/google-cloud-sdk-324.0.0-linux-x86_64/test1/five-dollar-genome-analysis-pipeline/initial_files/chs.json,WORKFLOW_OPTIONS=/home/giedi_prime/software/google-cloud-sdk-324.0.0-linux-x86_64/test1/five-dollar-genome-analysis-pipeline/generic.google-papi.options.json --env-vars WORKSPACE=gs://xbucket2/titi/work,OUTPUTS=gs://xbucket2/titi/output --logging gs://xbucket2/titi/logging)
My ch.json input file refers to a verified BAM with a size of 60 Go. The standard pipeline (WholeGenomeGermlineSingleSample.wdl) crashes at the step that converts the unmapped BAM in an aligned BAM:
# set the bash variable needed for the command-line
bash_ref_fasta=/cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
# if reference_fasta.ref_alt has data in it,
if [ -s /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt ]; then
java -Xms1000m -Xmx1000m -jar /usr/gitc/picard.jar \
SamToFastq \
INPUT=/cromwell_root/xbucket2/titi/work/WholeGenomeGermlineSingleSample/fad123e4-e255-4be5-bdfe-3821e19876e2/call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/3cfeb7f1-9888-4f2d-aa55-8b4efd6105ee/call-SplitRG/shard-0/SplitLargeReadGroup/9e63119e-3644-4f22-895d-df97b9760bde/call-SamSplitter/attempt-2/glob-c9e7aa653484f6770ff653b8582f2d7c/shard_0006.bam \
FASTQ=/dev/stdout \
INTERLEAVE=true \
NON_PF=true | \
/usr/gitc/bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta /dev/stdin - 2> >(tee shard_0006.bwa.stderr.log >&2) | \
java -Dsamjdk.compression_level=2 -Xms1000m -Xmx1000m -jar /usr/gitc/picard.jar \
MergeBamAlignment \
VALIDATION_STRINGENCY=SILENT \
EXPECTED_ORIENTATIONS=FR \
ATTRIBUTES_TO_RETAIN=X0 \
ATTRIBUTES_TO_REMOVE=NM \
ATTRIBUTES_TO_REMOVE=MD \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=/cromwell_root/xbucket2/titi/work/WholeGenomeGermlineSingleSample/fad123e4-e255-4be5-bdfe-3821e19876e2/call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/3cfeb7f1-9888-4f2d-aa55-8b4efd6105ee/call-SplitRG/shard-0/SplitLargeReadGroup/9e63119e-3644-4f22-895d-df97b9760bde/call-SamSplitter/attempt-2/glob-c9e7aa653484f6770ff653b8582f2d7c/shard_0006.bam \
OUTPUT=shard_0006.bam \
REFERENCE_SEQUENCE=/cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \
PAIRED_RUN=true \
SORT_ORDER="unsorted" \
IS_BISULFITE_SEQUENCE=false \
ALIGNED_READS_ONLY=false \
CLIP_ADAPTERS=false \
MAX_RECORDS_IN_RAM=2000000 \
ADD_MATE_CIGAR=true \
MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
PROGRAM_RECORD_ID="bwamem" \
PROGRAM_GROUP_VERSION="0.7.15-r1140" \
PROGRAM_GROUP_COMMAND_LINE="bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta" \
PROGRAM_GROUP_NAME="bwamem" \
UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
ALIGNER_PROPER_PAIR_FLAGS=true \
UNMAP_CONTAMINANT_READS=true \
ADD_PG_TAG_TO_READS=false
i got the exception:
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Error in writing fastq file /dev/stdout
at htsjdk.samtools.fastq.BasicFastqWriter.write(BasicFastqWriter.java:66)
at picard.sam.SamToFastq.writeRecord(SamToFastq.java:391)
at picard.sam.SamToFastq.handleRecord(SamToFastq.java:305)
at picard.sam.SamToFastq.doWork(SamToFastq.java:192)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File /dev/stdin; Line 18801059
Line: ERR3241665.340717062 147 chrX 58469590 60 150M = 58469323 -417 GTCTATTTGTAGACTCTGCAGCCGAATATTTTTGAGTGGTTTGAGGCCTGTGGTGAAAAATGTAATAACTTCACATAAAAACTAGACACAAGCTTTCTGAGAAACTTATTTCTGATGTGTGCATTCATCTCACAGAGGTGAATCTGTC
at htsjdk.samtools.SAMLineParser.reportFatalErrorParsingLine(SAMLineParser.java:446)
at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:231)
at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:268)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:255)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:228)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
As i use a very large BAM, i supposed that is was a memory trouble. So i modified the imported files: UnmappedBamToAlignedBam and Alignment.wdl (i replaced the Picard calls with -Xmx1000m to -Xmx8000m). I did this by deploying the modified files on our web server (xegen.fr) and by modifying the WholeGenomeGermlineSingleSample.wdl import section, now beginning with:
import "http://xegen.fr/UnmappedBamToAlignedBam_xegen3.wdl" as ToBam
import "https://raw.githubusercontent.com/gatk-workflows/five-dollar-genome-analysis-pipeline/1.2.0/tasks/AggregatedBamQC.wdl" as AggregatedQC
and so
import "http://xegen.fr/Alignment_xegen3.wdl" as Alignment
import "https://raw.githubusercontent.com/gatk-workflows/five-dollar-genome-analysis-pipeline/1.2.0/tasks/SplitLargeReadGroup.wdl" as SplitRG
in UnmappedBamToAlignedBam_xegen3.wdl file.
I am sure that these modifications are used, because if i force a misspelling of the xegen.fr urls, i got an error very quickly. But when i run the pipeline, after ten hours i got the same exception error, and in the output file, the piped command: picard | bwa | picard is called with initial parameters -Xmx1000m. So i turn crasy ;-)
Does anyone know why my imports are not really used or are replaced by another code ? Or does any one know another way to solve the crash at this step of the pipeline with a such big input bam ?
Thanks by advance
If you are seeing an error, please provide(REQUIRED) :
a) GATK version used:
b) Exact command used:
c) Entire error log:
-
Dear Beri
I have a non technical question. Do you know if results, producted by the pipeline, generate a cost on my associated Google cloud account ? Or their persistence is assumed freely by Terra platform ?
Cheers
-
There is a cost to storing data in your workspace bucket, the cost is detailed in the workspace Dashboard tab on the right hand column. For more Terra questions please visit the Terra forum
-
Creating a data pipeline is quite easy in Google Cloud Data Fusion through the use of Data Pipeline Studio. In there you select your data source, select the transformation that you want to perform, and define the sink.
-
Thanks for your input, Rishab Sain!
Please sign in to leave a comment.
34 comments