Genomestrip CNVdiscoverypipeline failed running paralleled
When I run Genomestrip CNVdiscovery in cluster(SGE),I can run with SGE config: -jobRunner Drmaa
But ,i met error :ERROR 15:04:42,660 Retry - Caught error during attempt 3 of 4.
org.broadinstitute.gatk.queue.QException: Unable to submit job: denied: host "cnn
gb-compute-f10-69.cngb.sz.hpc" is not a submit host
I've found the reason: my cluster in school divide the cluster to submitting nodes and computing nodes,and computing nodes can not deliever tasks. so after delivering(qsub)the jobs frome submitting nodes to computing nodes, pipelines breaks when the inner CNVdiscoverypipeline needs parallele computing and distributing to more computing nodes.
So, How can i do it ,for example: set some argument to run pipeline in single-process instead of endless distributing tasks to much nodes?
Script:
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
java -Xmx4g -cp ${classpath} \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-jobRunner Drmaa \
-gatkJobRunner Drmaa \
-jobNative "-cwd -l vf=8G,num_proc=1 -q st.q -P P18Z10200N0124 -binding linear:1 -v PATH=/hwfssz4/BC_PUB/Software/03.Soft_ALL/jdk1.8.0_202/bin:$PATH -v LD_LIBRARY_PATH=/opt/gridengine/lib/lx-amd64/:$LD_LIBRARY_PATH" \
-cp ${classpath} \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile /zfssz2/ST_MCHRI/BIGDATA/USER/lizhichao/cnvnator/software/genomestrip/svtoolkit/installtest/conf/genstrip_parameters.txt \
-R /zfssz2/ST_MCHRI/BIGDATA/USER/lizhichao/cnvnator/software/genomestrip/svtoolkit/installtest/data/human_b36_chr1.fasta \
-I ${inputFile} \
-genderMapFile /zfssz2/ST_MCHRI/BIGDATA/USER/lizhichao/cnvnator/software/genomestrip/svtoolkit/installtest/data/installtest_gender.map \
-md ${runDir}/metadata \
-runDirectory ${runDir} \
-jobLogDir ${runDir}/logs \
-tilingWindowSize 1000 \
-tilingWindowOverlap 500 \
-maximumReferenceGapLength 1000 \
-boundaryPrecision 100 \
-minimumRefinedLength 500 \
-run
-
You have correctly diagnosed the problem, I believe.
If you cannot run on a set of hosts where the execution hosts are also submit hosts, then the best alternative would be to try to run the top-level Queue script with -jobRunner ParallelShell. This will cause all of the top-level Queue jobs to run on the same host, so you will need to create a large reservation for this job. You can limit the number of parallel shell jobs using -maxConcurrentRun. This will somewhat reduce overall parallelism, but depending on the size of your data set, you may be able to get it to run that way.
As a side note, -jobRunner Shell seems to be a little flaky, so we generally recommend using -jobRunner ParallelShell with -maxConcurrentRun 1 in preference to using -jobRunner Shell.
-
thanks for your reply ,I'm testing under the argument. In adiition,I want to detect deletion and genotyping, so should i run cnvdiscovery pipeline or SV discovery+sv genotyper, both is ok? and when i run cnvdiscovery ,should i run sv_genotyper after the cnvdiscovery?
-
The workflow is generally SVPreprocess followed by one or both of SVDiscovery + SVGenotyper (for deletions only) or CNVDiscovery for CNVs, which also genotypes as part of the discovery pipeline. You can then run SVGenotyper in additional samples if you want (or to get uniform genotyping if you are running in batches).
There is also an LCNV (large CNV) pipeline which is designed to find "microarray resolution" CNVs and will also find mosaic CNVs. The output is not in vcf format, however.
-
Thanks,So ,if i want to focus on the deletion genotyping, should i run the SVDiscovery + SVGenotyper instead of CNVDiscovery? SVDiscovery + SVGenotyper seems to be faster than CNVDiscover.
SVDiscovery + SVGenotyper is not designed by parallel running? and i tested it successfully before.
-
For deletions only, you will get most of them with SVDiscovery + SVGenotyper. You will miss things that can be found only with read depth due to repetitive sequences.
SVDiscovery / SVGenotyper are much faster. They are parallelized, but don't recursively parallelize, so the execution hosts do not need to be submit hosts. That is only done in the CNVDiscovery pipeline.
-
Thanks , if i want to study the deletion mutation of CNV, what pipeline should i select?
-
I think you would have to explain the analysis you want to do in more detail. Feel free to write to me directly if that would be easier.
-
I just want to study the homozygous deletion and heterozygous deletion of CNV genotyping in population,to find some assosication.
-
The reason we wrote two methods / pipelines is because they detect different but sometimes overlapping sets of variants. So for best sensitivity, you should run both.
-
Thanks,I get what you means
Please sign in to leave a comment.
11 comments