Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

run SVPreprocess and SVGenotyper for each sample

0

7 comments

  • Avatar
    Bob Handsaker

    Genotyping requires multiple samples to be accurate. For larger cohorts, we typically run in batches of 100. You can merge the genotyping results across batches. If you run in small batches (20 or 30 samples) you need to do very careful QC.

    Preprocessing is inherently single-sample. There is a single-sample preprocessing workflow in Terra. For convenience and scalability, though, you should merge the preprocessing data into larger metadata.zip archives, typically one for each calling batch.

    0
    Comment actions Permalink
  • Avatar
    zzq

    Dear Bob Handsaker

    Thank you. You mean that I can run Processing for each single sample, and then merge the single metadata together, and genotype using the merged metadata as input finally. If that is ok, I will change to this to speed up processing. More, how to merge the metadata (I can not open the Terra in China)?

    Best regards,

    Zheng Zhuqing

    0
    Comment actions Permalink
  • Avatar
    Bob Handsaker


    I see. The process works well in Terra on google, but may be a bit clunky if not running on the cloud.

    The WDLs (I think up to date) and associated scripts are in the distribution.
    For each sample, we run:
    wdl/firecloud/gs_preprocess.wdl
    which calls scripts/firecloud/preprocess.sh
    which produces metadata.zip as output.

    Then on a collection of these single-sample metadata.zip files, we run
    wdl/firecloud/gs_merge_metadata.wdl
    which calls scripts/firecloud/gs_merge_metadata.sh

    This last script actually unpacks each input metadata.zip then merges them together to create a merged metadata.zip.

    For cloud processing, and local processing too, we now use metadata.zip archives instead of metadata directories. These archives need to be created by the ZipMetadata utility, which does special processing to allow all of the other GS code to run against the metadata.zip archives without unzipping them. We could rewrite gs_merge_metadata.sh to not need to unzip the input archives either, but this is not currently how it works.

    If you are not running in a cloud environment, it may be easier to run SVPreprocess.q on batches of N samples and then run ZipMetadata to pack up the metadata directory to make it easier to work with.

    Metadata merging currently takes a lot of disk.
    We typically allocate 350G for 100 single-sample metadata.zip archives with 25-30x coverage and scale up from there.

    Hope this helps,
    -Bob

    1
    Comment actions Permalink
  • Avatar
    zzq

    Dear Bob Handsaker

     

    Thank you. I found the workflow you mentioned on github https://github.com/broadinstitute/genomestrip-builds/blob/master/svtoolkit/scripts/firecloud/gs_merge_metadata.sh.

     

    Best regards,

    Zheng Zhuqing

    0
    Comment actions Permalink
  • Avatar
    Bob Handsaker

    I'm not sure what that repo is, but it's not maintained. The current release on the web site is r1958. The WDLs and scripts are in the tarball.

    -Bob

     

    0
    Comment actions Permalink
  • Avatar
    zzq

    Dear Bob Handsaker

    Thank you for your reminding, I now refer to the scripts in the svtoolkit_2.00.1958.tar.gz.  After trying with forty samples (I ran SVPreprocess on batches of twenty samples, thus I will got two metadata.zip for these samples), although finished, I still have some uncertainties and questions. I am sorry I do not how to highlight the code and error messages here.

    mdInputLocations=$(find ${sampleMdDir} -mindepth 1 -maxdepth 1 | awk '{print "-mdi " $0}' | xargs)
    echo "mdInputLocations=${mdInputLocations}"

    java -cp ${SV_CLASSPATH} -Xmx4g \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/preprocess/SVMergeMetadata.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    -cp ${SV_CLASSPATH} \
    -configFile genstrip_parameters.txt \
    -tempDir ${SV_TMPDIR} \
    -jobLogDir logs \
    -R ${referenceFile} \
    ${mdInputLocations} \
    -md metadata \
    -useParallelMerge true \
    -jobRunner ParallelShell \
    -run \
    -l DEBUG \
    || exit 1

    • During test, the screen prints many messages "DEBUG 19:35:39,590 ParallelShellJobRunner - sh /home/zqzheng/03.CNV/02.GenomeStrip/02.SVMergeMetadata/tmp/.exec5170504113291417775 :: Got return on exit status in future: 0 ". Does this affect the results?
    • When genotyping, the program failed with following message, hope you can help me.

     

    INFO 20:26:29,891 HelpFormatter - -----------------------------------------------------------------------------------------
    INFO 20:26:29,893 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7.GS-r1941-0-gb493839, Compiled 2020/03/25 20:13:34
    INFO 20:26:29,893 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 20:26:29,894 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 20:26:29,894 HelpFormatter - [Fri Jul 17 20:26:29 CST 2020] Executing on Linux 3.10.0-514.2.2.el7.x86_64 amd64
    INFO 20:26:29,894 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03
    INFO 20:26:29,897 HelpFormatter - Program Args: -T SVGenotyperWalker -R /home/zqzheng/01.meta_data/01.reference/reference.fa -O /home/zqzheng/03.CNV/02.GenomeStrip/SVGenotyper/P0107.genotypes.vcf.gz -disableGATKTraversal true -md /home/zqzheng/03.CNV/02.GenomeStrip/02.SVMergeMetadata/metadata -configFile /home/zqzheng/03.CNV/02.GenomeStrip/genstrip_parameters.txt -runDirectory /home/zqzheng/03.CNV/02.GenomeStrip/SVGenotyper -genderMapFile /home/zqzheng/03.CNV/02.GenomeStrip/gender.map -vcf /home/zqzheng/03.CNV/02.GenomeStrip/DEL.vcf -partitionName P0107 -partition records:10601-10700 -L 1:1-1
    INFO 20:26:29,900 HelpFormatter - Executing as zqzheng@localhost.localdomain on Linux 3.10.0-514.2.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03.
    INFO 20:26:29,900 HelpFormatter - Date/Time: 2020/07/17 20:26:29
    INFO 20:26:29,900 HelpFormatter - -----------------------------------------------------------------------------------------
    INFO 20:26:29,900 HelpFormatter - -----------------------------------------------------------------------------------------
    INFO 20:26:29,906 17-Jul-2020 GenomeAnalysisEngine - Strictness is SILENT
    INFO 20:26:30,021 17-Jul-2020 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 20:26:30,074 17-Jul-2020 IntervalUtils - Processing 1 bp from intervals
    INFO 20:26:30,120 17-Jul-2020 GenomeAnalysisEngine - Preparing for traversal
    INFO 20:26:30,120 17-Jul-2020 GenomeAnalysisEngine - Done preparing for traversal
    INFO 20:26:30,121 17-Jul-2020 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 20:26:30,121 17-Jul-2020 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 20:26:30,121 17-Jul-2020 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 20:26:30,123 17-Jul-2020 SVGenotyper - Opening reference sequence ...
    INFO 20:26:30,123 17-Jul-2020 SVGenotyper - Opened reference sequence.
    INFO 20:26:30,124 17-Jul-2020 MetaData - Opening metadata ...
    INFO 20:26:30,124 17-Jul-2020 MetaData - Adding metadata location /home/zqzheng/03.CNV/02.GenomeStrip/02.SVMergeMetadata/metadata ...
    INFO 20:26:30,126 17-Jul-2020 MetaData - Opened metadata.
    INFO 20:26:30,127 17-Jul-2020 SVGenotyper - Initializing input data set ...
    INFO 20:26:30,633 17-Jul-2020 SVGenotyper - Initialized data set: 120 files, 120 read groups, 120 samples.
    INFO 20:26:30,667 17-Jul-2020 MetaData - Loading insert size distributions ...
    INFO 20:26:30,705 17-Jul-2020 ReadCountCache - Initializing read count cache with 1 file.
    #ERROR: No insert size data for record: E00292:8:HH23MCCXY:2:1207:14326:70926
    #ERROR: E00292:8:HH23MCCXY:2:1207:14326:70926 99 3 68186745 60 150M = 68186772 177 AGTACTGAAAAGCATATGACATTTGAACAAATTCTGCTATGTTTCAGAGCATGTACAAGTATCTCATAAAATATCTGTAACCTCATGGGCAACCCCAGACATGGGCAACAAAACCAAAAAAAACCTGCAGAGTTTTCACTAGTACTACTG AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:ZJ-1 NM:i:0 AS:i:150 XS:i:20
    #ERROR: Error processing sample: ZJ-1
    #ERROR: Error processing input file: /home/zqzheng/01.meta_data/03.alignment/ZJ-1.sorted.markdup.bam
    Error: Exception processing cnp: No insert size data for record: E00292:8:HH23MCCXY:2:1207:14326:70926
    CNP: . 3:68187887-68187998
    ##### ERROR --
    ##### ERROR stack trace
    java.lang.RuntimeException: No insert size data for record: E00292:8:HH23MCCXY:2:1207:14326:70926
    at org.broadinstitute.sv.genotyping.ReadPairMapper.selectAsPairedRead(ReadPairMapper.java:394)
    at org.broadinstitute.sv.genotyping.ReadPairMapper.selectRecord(ReadPairMapper.java:250)
    at org.broadinstitute.sv.genotyping.ReadPairMapper.searchWindow(ReadPairMapper.java:226)
    at org.broadinstitute.sv.genotyping.ReadPairMapper.getReadPairs(ReadPairMapper.java:201)
    at org.broadinstitute.sv.genotyping.GenotypingReadPairModule.getReadPairs(GenotypingReadPairModule.java:310)
    at org.broadinstitute.sv.genotyping.GenotypingReadPairModule.genotypeSample(GenotypingReadPairModule.java:121)
    at org.broadinstitute.sv.genotyping.GenotypingReadPairModule.genotypeCnp(GenotypingReadPairModule.java:78)
    at org.broadinstitute.sv.genotyping.GenotypingReadPairModule.genotypeCnp(GenotypingReadPairModule.java:48)
    at org.broadinstitute.sv.genotyping.GenotypingAlgorithm.genotypeCnpInternal(GenotypingAlgorithm.java:149)
    at org.broadinstitute.sv.genotyping.GenotypingAlgorithm.genotypeCnp(GenotypingAlgorithm.java:114)
    at org.broadinstitute.sv.genotyping.SVGenotyperWalker.processVCFFile(SVGenotyperWalker.java:274)
    at org.broadinstitute.sv.genotyping.SVGenotyperWalker.map(SVGenotyperWalker.java:218)
    at org.broadinstitute.sv.genotyping.SVGenotyperWalker.map(SVGenotyperWalker.java:58)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:106)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:145)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:95)
    at org.broadinstitute.sv.main.SVGenotyper.main(SVGenotyper.java:21)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7.GS-r1941-0-gb493839):
    ##### ERROR
    ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: No insert size data for record: E00292:8:HH23MCCXY:2:1207:14326:70926
    ##### ERROR ------------------------------------------------------------------------------------------

    • I do not whether the deviations of breakpoint, maybe over 100bp or even large, affect the genotype results greatly. 

     

    Sincerely,

    Zheng Zhuqing

    0
    Comment actions Permalink
  • Avatar
    zzq

    Dear Bob Handsaker

    I know why program failed when genotyping, as the bam file was not included in the process of SVPreprocess.

    Sincerely,

    Zheng Zhuqing

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk