Genomestrip preprocessing error
Hello I have done SVPreprocessing and got some errors.
I succeed when I ran it with -L option with chromosomal region, but I got failed when I did it with whole chromosome without -L option. I ran with 100 samples.
I have no reason why..
<ERROR>
WARN 11:57:22,535 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.
INFO 11:57:22,540 QCommandLine - Done with errors
INFO 11:57:22,635 QGraph - -------
INFO 11:57:22,637 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_100_last/.queue/tmp' '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit//lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/Queue.jar' '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit/lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.sv.apps.ReduceInsertSizeHistograms' '-I' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_100_last/isd/KOREA1K-183.marked_deduplicates.printread.hist.bin' '-O' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_100_last/isd/KOREA1K-183.marked_deduplicates.printread.dist.bin'
INFO 11:57:22,638 QGraph - Log: /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_100_last/log/SVPreprocess-120.out
INFO 11:57:22,638 QGraph - -------
INFO 11:57:22,639 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_100_last/.queue/tmp' '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit//lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVCommandLine '-T' 'ComputeInsertSizeHistogramsWalker' '-R' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/Homo_sapiens_assembly38/bwa_index/Homo_sapiens_assembly38.fasta' '-I' '/kimlab_wd/yuo1996/C4_analysis/1.marked_bamfiles/batch1_100/KOREA1K-182/KOREA1K-182.marked_deduplicates.printread.bam' '-O' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_100_last/isd/KOREA1K-182.marked_deduplicates.printread.hist.bin' '-disableGATKTraversal' 'true' '-md' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference//metadata/batch1_100_last/' '-configFile' '/kimlab_wd/yuo1996/tools/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt' '-P' 'chimerism.use.correction:false' '-chimerismFile' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_100_last/isd/KOREA1K-182.marked_deduplicates.printread.chimer.dat' '-createHistogramFile' 'true' -createEmpty
INFO 11:57:22,639 QGraph - Log: /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_100_last/log/SVPreprocess-19.out
INFO 11:57:22,640 QCommandLine - Script failed: 1184 Pend, 0 Run, 2 Fail, 203 Done
REQUIRED for all errors and issues:
a) GATK version used: 4.2.5.0
b) Exact command used:
c) Entire program log: too large.
See forum topic details at forum guidelines page: https://gatk.broadinstitute.org/hc/en-us/articles/360053845952-Forum-Guidelines
-
Thank you for your post. Bob Handsaker has been tagged and will get back to you shortly.
-
You need to look for the log files for the failed jobs and provide information about what went wrong for me to be able to help you.
-
I found that I used the reference bundle file "Homo_sapiens_assembly38_12Oct2016.tar.gz", so I updated the file with "Homo_sapiens_assembly38_30Aug2021.tar.gz".
Then I got new errors below when I ran with 2 samples.
Do you have any idea..?
INFO 03:39:45,722 QJobsReporter - Plotting JobLogging GATKReport to file /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/SVPreprocess.jobreport.pdf
WARN 03:39:50,974 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.
INFO 03:39:50,977 QCommandLine - Done with errors
INFO 03:39:51,000 QGraph - -------
INFO 03:39:51,000 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/.queue/tmp' '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit//lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/Queue.jar' '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit/lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.sv.apps.IndexReadCountFile' '-I' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/rccache.merge/P0084.rccache.bin' '-O' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/rccache.merge/P0084.rccache.bin.idx'
INFO 03:39:51,000 QGraph - Log: /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-195.out
INFO 03:39:51,001 QGraph - -------
INFO 03:39:51,001 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/.queue/tmp' '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit//lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/Queue.jar' '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit/lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.sv.apps.MergeReadCounts' '-R' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta' -L chr11:60000001-70000000 '-I' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/rccache.list' '-O' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/rccache.merge/P0185.rccache.bin'
INFO 03:39:51,001 QGraph - Log: /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-396.out
INFO 03:39:51,001 QGraph - -------
INFO 03:39:51,001 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/.queue/tmp' '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit//lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVCommandLine '-T' 'ComputeReadSpanCoverageWalker' '-R' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta' '-I' '/kimlab_wd/yuo1996/C4_analysis/1.marked_bamfiles/batch1_100/KOREA1K-168/KOREA1K-168.marked_deduplicates.printread.bam' '-O' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/spans/KOREA1K-168.marked_deduplicates.printread.spans.txt' '-disableGATKTraversal' 'true' '-md' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference//metadata/batch1_ref2021_headn2/' '-configFile' '/kimlab_wd/yuo1996/tools/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt' '-P' 'chimerism.use.correction:false' '-maxInsertSizeStandardDeviations' '3'
INFO 03:39:51,001 QGraph - Log: /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-17.out
INFO 03:39:51,001 QGraph - -------
INFO 03:39:51,001 QGraph - Failed: samtools index /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/headers.bam
INFO 03:39:51,001 QGraph - Log: /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-4.out
INFO 03:39:51,002 QCommandLine - Script failed: 58 Pend, 0 Run, 4 Fail, 643 Done
------------------------------------------------------------------------------------ -
You need to look at the referenced log files to see what failed in the individual jobs.
E.g. /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-4.out and the others.
This is a samtools index command, though, so probably either samtools isn't installed correctly or there is something wrong with /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/headers.bam
-
I have samtools installed (lastest version ; 1.15 ) and there's no printed words in SVPreprocess-4.out....
I got the intact headers.bam/ headers.bam.bai file and I checked 'samtools index headers.bam' not only worked very well but also made the same .bai file that the program returned..
what can I do..?
-
My best guess would be that either it was a transient error or somehow Queue could not access the job completion status (which is a failure mode I have seen on some clusters) or for some reason thinks the job status was non-zero.
The Queue software that genome strip uses for running pipelines is like "make" in that if there are failed jobs and you rerun it will only retry the failed jobs and any dependencies. If there is no error in the log, and you can run the same command manually, then I would try to just rerun the pipeline and see if it succeeds.
You can apply this same methodology to the other errors as well. If they are transient and do not recur when you rerun, then you can ignore them.
-
I do not run it by 'cluster',. I mean I didnt install any workflow manager,.
And I use bash scrpit to run it, like below.
Is it same with running again as running the bash script again?
or should I run only the command line?
#!/bin/bash
LIST=$1
# ex) /kimlab_wd/yuo1996/C4_korean1K/finish/batch1_100/batch1_100.list
BATCH=$2
# batch1_100#set SV_DIR="/kimlab_wd/yuo1996/tools/svtoolkit"
REF_DIR=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/# $BATCH.. ㅜㅜ
mkdir ${REF_DIR}/metadata/${BATCH}/
paste -d "/" ${LIST} ${LIST} | sed 's/$/.marked_deduplicates.printread.bam/g' | sed 's/^/\/kimlab_wd\/yuo1996\/C4_analysis\/1.marked_bamfiles\/batch1_100\//g' | head -n 2 > ${REF_DIR}/metadata/${BATCH}/input_bam_files.list
mkdir ${REF_DIR}/metadata/${BATCH}/log
cd ${REF_DIR}/metadata/${BATCH}/#PATH=$PATH:/${PENNCNV_PATH_WARE}/PennCNV-1.0.5/
#PATH=$PATH:/${PENNCNV_PATH_WARE}/gw6/bin# These executables must be on your path.
which java > /dev/null
which Rscript > /dev/null
which samtools > /dev/nullexport SV_DIR=/kimlab_wd/yuo1996/tools/svtoolkit/
export PATH=${SV_DIR}/bwa:${PATH}
export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"java -Xmx4g -cp ${classpath} org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/SVPreprocess.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-cp ${classpath} \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile ${SV_DIR}/conf/genstrip_parameters.txt \
-R ${REF_DIR}/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
-I /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/${BATCH}/input_bam_files.list \
-md ${REF_DIR}/metadata/${BATCH}/ \
-bamFilesAreDisjoint true \
-jobLogDir ${REF_DIR}/metadata/${BATCH}/log/ \
-rmd ${REF_DIR}/Homo_sapiens_assembly38/ \
-computeGCProfiles true \
-computeReadCounts true \
-reduceInsertSizeDistributions true \
-disableGATKTraversal \
-useMultiStep \
-P chimerism.use.correction:false \
-run -
I see. Yes, you can try just running it again. Let me know if it runs the second time through. It is possible there is a dependency that is not correct (so it is running some things out of order).
-
After 4 repetitions, I succeeded!!
Now I'm trying to run with my whole samples (100 samples in a batch).
Thanks a lot!!
I have one more small question,
Is there any option to set the number of threads? (I want to increase..! because of large sample sets.)
-
Glad you were able to get it to work.
I was also able to recreate this behavior, so I will look to see if there is a problem with the dependencies and try to fix it.
A similar option to what you are doing would be to use
-jobRunner ParallelShell -maxConcurrentRun N
which will limit the number of concurrent jobs to N.
For more scalability, Queue is designed to run jobs on a local HPC cluster. If your cluster as DRMAA support installed, for example, you can use
-jobRunner Drmaa
and this will dispatch jobs to the cluster. You usually need to set up some options specific to your cluster, because every HPC cluster is different.
We also have a version of the pipelines in Terra that run on the google cloud platform. The WDLs are included with the Genome STRiP code so you can take a look there. I don't think we currently have a public Terra workspace you can clone from, but if you want to clone from a sample workspace (e.g. for 1000 Genomes) I can give you access.
-
Thank you for further information!!
Well actually I did this preprocessing for analysis of C4 copy number which is provided in Terra workspace.
Maybe it can be very helpful to deal with whole pipeline if you gave me the access!?!
My whole sample set is about 1K too.
-
Send me your terra (google) ID or email address. The workflows are all public in Terra, but it can be helpful to have an example usage to work from. I don't have an example currently that is locked down with RequesterPays, etc.
-
This is my address!
I'm gonna try and check if I can get faster or convenient result for next step from it.
Please sign in to leave a comment.
13 comments