run SVPreprocess and SVGenotyper for each sample
Dear Bob Handsaker
As you said that genotyping gains power by genotyping many samples together. However, the process of generating the summary metadata by using SVPreprocess will take a long time when I have several hundreds of samples. Thus, I wonder that when I run SVPreprocess and SVGenotyper for each sample and then using bcftools merge to merge the individual VCF file together (I have candidate variants, only genotyping not discovery), will there be much different from genotyping many samples together? Thank you.
Sincerely,
Zheng Zhuqing
-
Genotyping requires multiple samples to be accurate. For larger cohorts, we typically run in batches of 100. You can merge the genotyping results across batches. If you run in small batches (20 or 30 samples) you need to do very careful QC.
Preprocessing is inherently single-sample. There is a single-sample preprocessing workflow in Terra. For convenience and scalability, though, you should merge the preprocessing data into larger metadata.zip archives, typically one for each calling batch.
-
Dear Bob Handsaker
Thank you. You mean that I can run Processing for each single sample, and then merge the single metadata together, and genotype using the merged metadata as input finally. If that is ok, I will change to this to speed up processing. More, how to merge the metadata (I can not open the Terra in China)?
Best regards,
Zheng Zhuqing
-
I see. The process works well in Terra on google, but may be a bit clunky if not running on the cloud.The WDLs (I think up to date) and associated scripts are in the distribution.
For each sample, we run:
wdl/firecloud/gs_preprocess.wdl
which calls scripts/firecloud/preprocess.sh
which produces metadata.zip as output.Then on a collection of these single-sample metadata.zip files, we run
wdl/firecloud/gs_merge_metadata.wdl
which calls scripts/firecloud/gs_merge_metadata.sh
This last script actually unpacks each input metadata.zip then merges them together to create a merged metadata.zip.For cloud processing, and local processing too, we now use metadata.zip archives instead of metadata directories. These archives need to be created by the ZipMetadata utility, which does special processing to allow all of the other GS code to run against the metadata.zip archives without unzipping them. We could rewrite gs_merge_metadata.sh to not need to unzip the input archives either, but this is not currently how it works.
If you are not running in a cloud environment, it may be easier to run SVPreprocess.q on batches of N samples and then run ZipMetadata to pack up the metadata directory to make it easier to work with.
Metadata merging currently takes a lot of disk.
We typically allocate 350G for 100 single-sample metadata.zip archives with 25-30x coverage and scale up from there.Hope this helps,
-Bob -
Dear Bob Handsaker
Thank you. I found the workflow you mentioned on github https://github.com/broadinstitute/genomestrip-builds/blob/master/svtoolkit/scripts/firecloud/gs_merge_metadata.sh.
Best regards,
Zheng Zhuqing
-
I'm not sure what that repo is, but it's not maintained. The current release on the web site is r1958. The WDLs and scripts are in the tarball.
-Bob
-
Dear Bob Handsaker
Thank you for your reminding, I now refer to the scripts in the svtoolkit_2.00.1958.tar.gz. After trying with forty samples (I ran SVPreprocess on batches of twenty samples, thus I will got two metadata.zip for these samples), although finished, I still have some uncertainties and questions. I am sorry I do not how to highlight the code and error messages here.
- From the online document, I cann't find the description of the parameter embedded in the shell script gs_merge_metadata.sh. If possible, I hope you and your colleages can provide something like this http://software.broadinstitute.org/software/genomestrip/org_broadinstitute_sv_qscript_SVPreprocess.html.
- In the following part, I do not know which parameter points to the ${mdInputLocations}?
mdInputLocations=$(find ${sampleMdDir} -mindepth 1 -maxdepth 1 | awk '{print "-mdi " $0}' | xargs)
echo "mdInputLocations=${mdInputLocations}"java -cp ${SV_CLASSPATH} -Xmx4g \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/preprocess/SVMergeMetadata.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-cp ${SV_CLASSPATH} \
-configFile genstrip_parameters.txt \
-tempDir ${SV_TMPDIR} \
-jobLogDir logs \
-R ${referenceFile} \
${mdInputLocations} \
-md metadata \
-useParallelMerge true \
-jobRunner ParallelShell \
-run \
-l DEBUG \
|| exit 1- During test, the screen prints many messages "DEBUG 19:35:39,590 ParallelShellJobRunner - sh /home/zqzheng/03.CNV/02.GenomeStrip/02.SVMergeMetadata/tmp/.exec5170504113291417775 :: Got return on exit status in future: 0 ". Does this affect the results?
- When genotyping, the program failed with following message, hope you can help me.
INFO 20:26:29,891 HelpFormatter - -----------------------------------------------------------------------------------------
INFO 20:26:29,893 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7.GS-r1941-0-gb493839, Compiled 2020/03/25 20:13:34
INFO 20:26:29,893 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 20:26:29,894 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 20:26:29,894 HelpFormatter - [Fri Jul 17 20:26:29 CST 2020] Executing on Linux 3.10.0-514.2.2.el7.x86_64 amd64
INFO 20:26:29,894 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03
INFO 20:26:29,897 HelpFormatter - Program Args: -T SVGenotyperWalker -R /home/zqzheng/01.meta_data/01.reference/reference.fa -O /home/zqzheng/03.CNV/02.GenomeStrip/SVGenotyper/P0107.genotypes.vcf.gz -disableGATKTraversal true -md /home/zqzheng/03.CNV/02.GenomeStrip/02.SVMergeMetadata/metadata -configFile /home/zqzheng/03.CNV/02.GenomeStrip/genstrip_parameters.txt -runDirectory /home/zqzheng/03.CNV/02.GenomeStrip/SVGenotyper -genderMapFile /home/zqzheng/03.CNV/02.GenomeStrip/gender.map -vcf /home/zqzheng/03.CNV/02.GenomeStrip/DEL.vcf -partitionName P0107 -partition records:10601-10700 -L 1:1-1
INFO 20:26:29,900 HelpFormatter - Executing as zqzheng@localhost.localdomain on Linux 3.10.0-514.2.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03.
INFO 20:26:29,900 HelpFormatter - Date/Time: 2020/07/17 20:26:29
INFO 20:26:29,900 HelpFormatter - -----------------------------------------------------------------------------------------
INFO 20:26:29,900 HelpFormatter - -----------------------------------------------------------------------------------------
INFO 20:26:29,906 17-Jul-2020 GenomeAnalysisEngine - Strictness is SILENT
INFO 20:26:30,021 17-Jul-2020 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 20:26:30,074 17-Jul-2020 IntervalUtils - Processing 1 bp from intervals
INFO 20:26:30,120 17-Jul-2020 GenomeAnalysisEngine - Preparing for traversal
INFO 20:26:30,120 17-Jul-2020 GenomeAnalysisEngine - Done preparing for traversal
INFO 20:26:30,121 17-Jul-2020 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 20:26:30,121 17-Jul-2020 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 20:26:30,121 17-Jul-2020 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 20:26:30,123 17-Jul-2020 SVGenotyper - Opening reference sequence ...
INFO 20:26:30,123 17-Jul-2020 SVGenotyper - Opened reference sequence.
INFO 20:26:30,124 17-Jul-2020 MetaData - Opening metadata ...
INFO 20:26:30,124 17-Jul-2020 MetaData - Adding metadata location /home/zqzheng/03.CNV/02.GenomeStrip/02.SVMergeMetadata/metadata ...
INFO 20:26:30,126 17-Jul-2020 MetaData - Opened metadata.
INFO 20:26:30,127 17-Jul-2020 SVGenotyper - Initializing input data set ...
INFO 20:26:30,633 17-Jul-2020 SVGenotyper - Initialized data set: 120 files, 120 read groups, 120 samples.
INFO 20:26:30,667 17-Jul-2020 MetaData - Loading insert size distributions ...
INFO 20:26:30,705 17-Jul-2020 ReadCountCache - Initializing read count cache with 1 file.
#ERROR: No insert size data for record: E00292:8:HH23MCCXY:2:1207:14326:70926
#ERROR: E00292:8:HH23MCCXY:2:1207:14326:70926 99 3 68186745 60 150M = 68186772 177 AGTACTGAAAAGCATATGACATTTGAACAAATTCTGCTATGTTTCAGAGCATGTACAAGTATCTCATAAAATATCTGTAACCTCATGGGCAACCCCAGACATGGGCAACAAAACCAAAAAAAACCTGCAGAGTTTTCACTAGTACTACTG AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:ZJ-1 NM:i:0 AS:i:150 XS:i:20
#ERROR: Error processing sample: ZJ-1
#ERROR: Error processing input file: /home/zqzheng/01.meta_data/03.alignment/ZJ-1.sorted.markdup.bam
Error: Exception processing cnp: No insert size data for record: E00292:8:HH23MCCXY:2:1207:14326:70926
CNP: . 3:68187887-68187998
##### ERROR --
##### ERROR stack trace
java.lang.RuntimeException: No insert size data for record: E00292:8:HH23MCCXY:2:1207:14326:70926
at org.broadinstitute.sv.genotyping.ReadPairMapper.selectAsPairedRead(ReadPairMapper.java:394)
at org.broadinstitute.sv.genotyping.ReadPairMapper.selectRecord(ReadPairMapper.java:250)
at org.broadinstitute.sv.genotyping.ReadPairMapper.searchWindow(ReadPairMapper.java:226)
at org.broadinstitute.sv.genotyping.ReadPairMapper.getReadPairs(ReadPairMapper.java:201)
at org.broadinstitute.sv.genotyping.GenotypingReadPairModule.getReadPairs(GenotypingReadPairModule.java:310)
at org.broadinstitute.sv.genotyping.GenotypingReadPairModule.genotypeSample(GenotypingReadPairModule.java:121)
at org.broadinstitute.sv.genotyping.GenotypingReadPairModule.genotypeCnp(GenotypingReadPairModule.java:78)
at org.broadinstitute.sv.genotyping.GenotypingReadPairModule.genotypeCnp(GenotypingReadPairModule.java:48)
at org.broadinstitute.sv.genotyping.GenotypingAlgorithm.genotypeCnpInternal(GenotypingAlgorithm.java:149)
at org.broadinstitute.sv.genotyping.GenotypingAlgorithm.genotypeCnp(GenotypingAlgorithm.java:114)
at org.broadinstitute.sv.genotyping.SVGenotyperWalker.processVCFFile(SVGenotyperWalker.java:274)
at org.broadinstitute.sv.genotyping.SVGenotyperWalker.map(SVGenotyperWalker.java:218)
at org.broadinstitute.sv.genotyping.SVGenotyperWalker.map(SVGenotyperWalker.java:58)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:106)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:145)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:95)
at org.broadinstitute.sv.main.SVGenotyper.main(SVGenotyper.java:21)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7.GS-r1941-0-gb493839):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: No insert size data for record: E00292:8:HH23MCCXY:2:1207:14326:70926
##### ERROR ------------------------------------------------------------------------------------------- I do not whether the deviations of breakpoint, maybe over 100bp or even large, affect the genotype results greatly.
Sincerely,
Zheng Zhuqing
-
Dear Bob Handsaker
I know why program failed when genotyping, as the bam file was not included in the process of SVPreprocess.
Sincerely,
Zheng Zhuqing
Please sign in to leave a comment.
7 comments