Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Genomestrip preprocessing error

0

13 comments

  • Avatar
    Genevieve Brandt (she/her)

    Thank you for your post. Bob Handsaker has been tagged and will get back to you shortly.

    0
    Comment actions Permalink
  • Avatar
    Bob Handsaker

    You need to look for the log files for the failed jobs and provide information about what went wrong for me to be able to help you.

    0
    Comment actions Permalink
  • Avatar
    chae chae

    I found that I used the reference bundle file "Homo_sapiens_assembly38_12Oct2016.tar.gz", so I updated the file with "Homo_sapiens_assembly38_30Aug2021.tar.gz".

    Then I got new errors below when I ran with 2 samples.

    Do you have any idea..?

    INFO  03:39:45,722 QJobsReporter - Plotting JobLogging GATKReport to file /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/SVPreprocess.jobreport.pdf
    WARN  03:39:50,974 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.
    INFO  03:39:50,977 QCommandLine - Done with errors
    INFO  03:39:51,000 QGraph - -------
    INFO  03:39:51,000 QGraph - Failed:   'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/.queue/tmp'  '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit//lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/Queue.jar'  '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit/lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.IndexReadCountFile'  '-I' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/rccache.merge/P0084.rccache.bin'  '-O' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/rccache.merge/P0084.rccache.bin.idx'
    INFO  03:39:51,000 QGraph - Log:     /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-195.out
    INFO  03:39:51,001 QGraph - -------
    INFO  03:39:51,001 QGraph - Failed:   'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/.queue/tmp'  '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit//lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/Queue.jar'  '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit/lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.MergeReadCounts'  '-R' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  -L chr11:60000001-70000000 '-I' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/rccache.list'  '-O' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/rccache.merge/P0185.rccache.bin'
    INFO  03:39:51,001 QGraph - Log:     /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-396.out
    INFO  03:39:51,001 QGraph - -------
    INFO  03:39:51,001 QGraph - Failed:   'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/.queue/tmp'  '-cp' '/kimlab_wd/yuo1996/tools/svtoolkit//lib/SVToolkit.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/GenomeAnalysisTK.jar:/kimlab_wd/yuo1996/tools/svtoolkit//lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVCommandLine '-T' 'ComputeReadSpanCoverageWalker'  '-R' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-I' '/kimlab_wd/yuo1996/C4_analysis/1.marked_bamfiles/batch1_100/KOREA1K-168/KOREA1K-168.marked_deduplicates.printread.bam'  '-O' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/spans/KOREA1K-168.marked_deduplicates.printread.spans.txt'  '-disableGATKTraversal' 'true'  '-md' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference//metadata/batch1_ref2021_headn2/'  '-configFile' '/kimlab_wd/yuo1996/tools/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt'  '-P' 'chimerism.use.correction:false'  '-maxInsertSizeStandardDeviations' '3'
    INFO  03:39:51,001 QGraph - Log:     /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-17.out
    INFO  03:39:51,001 QGraph - -------
    INFO  03:39:51,001 QGraph - Failed:  samtools index /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/headers.bam
    INFO  03:39:51,001 QGraph - Log:     /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-4.out
    INFO  03:39:51,002 QCommandLine - Script failed: 58 Pend, 0 Run, 4 Fail, 643 Done
    ------------------------------------------------------------------------------------

    0
    Comment actions Permalink
  • Avatar
    Bob Handsaker

    You need to look at the referenced log files to see what failed in the individual jobs.

    E.g. /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/log/SVPreprocess-4.out and the others.

    This is a samtools index command, though, so probably either samtools isn't installed correctly or there is something wrong with /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/batch1_ref2021_headn2/headers.bam

    0
    Comment actions Permalink
  • Avatar
    chae chae

    I have samtools installed (lastest version ; 1.15 ) and there's no printed words in SVPreprocess-4.out....

    I got the intact headers.bam/ headers.bam.bai file and I checked 'samtools index headers.bam' not only worked very well but also made the same .bai file that the program returned..

    what can I do..?

     

    0
    Comment actions Permalink
  • Avatar
    Bob Handsaker

    My best guess would be that either it was a transient error or somehow Queue could not access the job completion status (which is a failure mode I have seen on some clusters) or for some reason thinks the job status was non-zero.

    The Queue software that genome strip uses for running pipelines is like "make" in that if there are failed jobs and you rerun it will only retry the failed jobs and any dependencies. If there is no error in the log, and you can run the same command manually, then I would try to just rerun the pipeline and see if it succeeds.

    You can apply this same methodology to the other errors as well. If they are transient and do not recur when you rerun, then you can ignore them.

    0
    Comment actions Permalink
  • Avatar
    chae chae

    I do not run it by 'cluster',. I mean I didnt install any workflow manager,.

    And I use bash scrpit to run it, like below.

    Is it same with running again as running the bash script again? 

    or should I run only the command line?

     

    #!/bin/bash

    LIST=$1
    # ex) /kimlab_wd/yuo1996/C4_korean1K/finish/batch1_100/batch1_100.list
    BATCH=$2
    # batch1_100

    #set SV_DIR="/kimlab_wd/yuo1996/tools/svtoolkit"
    REF_DIR=/kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/

    # $BATCH.. ㅜㅜ
    mkdir ${REF_DIR}/metadata/${BATCH}/
    paste -d "/" ${LIST} ${LIST} | sed 's/$/.marked_deduplicates.printread.bam/g' | sed 's/^/\/kimlab_wd\/yuo1996\/C4_analysis\/1.marked_bamfiles\/batch1_100\//g' | head -n 2 > ${REF_DIR}/metadata/${BATCH}/input_bam_files.list


    mkdir ${REF_DIR}/metadata/${BATCH}/log
    cd ${REF_DIR}/metadata/${BATCH}/

    #PATH=$PATH:/${PENNCNV_PATH_WARE}/PennCNV-1.0.5/
    #PATH=$PATH:/${PENNCNV_PATH_WARE}/gw6/bin

    # These executables must be on your path.
    which java > /dev/null
    which Rscript > /dev/null
    which samtools > /dev/null

    export SV_DIR=/kimlab_wd/yuo1996/tools/svtoolkit/
    export PATH=${SV_DIR}/bwa:${PATH}
    export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
    classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"

    java -Xmx4g -cp ${classpath} org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVPreprocess.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -cp ${classpath} \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
    -R ${REF_DIR}/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
    -I /kimlab_wd/yuo1996/C4_analysis/2.Genomestrip_preprocessing_reference/metadata/${BATCH}/input_bam_files.list \
    -md ${REF_DIR}/metadata/${BATCH}/ \
    -bamFilesAreDisjoint true \
    -jobLogDir ${REF_DIR}/metadata/${BATCH}/log/ \
    -rmd ${REF_DIR}/Homo_sapiens_assembly38/ \
    -computeGCProfiles true \
    -computeReadCounts true \
    -reduceInsertSizeDistributions true \
    -disableGATKTraversal \
    -useMultiStep \
    -P chimerism.use.correction:false \
    -run

    0
    Comment actions Permalink
  • Avatar
    Bob Handsaker

    I see. Yes, you can try just running it again. Let me know if it runs the second time through. It is possible there is a dependency that is not correct (so it is running some things out of order).

    0
    Comment actions Permalink
  • Avatar
    chae chae

    After 4 repetitions, I succeeded!! 

    Now I'm trying to run with my whole samples (100 samples in a batch).

    Thanks a lot!!

    I have one more small question, 

    Is there any option to  set the number of threads? (I want to increase..! because of large sample sets.)

    0
    Comment actions Permalink
  • Avatar
    Bob Handsaker

    Glad you were able to get it to work.

    I was also able to recreate this behavior, so I will look to see if there is a problem with the dependencies and try to fix it.

    A similar option to what you are doing would be to use

    -jobRunner ParallelShell -maxConcurrentRun N

    which will limit the number of concurrent jobs to N.

    For more scalability, Queue is designed to run jobs on a local HPC cluster. If your cluster as DRMAA support installed, for example, you can use

    -jobRunner Drmaa

    and this will dispatch jobs to the cluster. You usually need to set up some options specific to your cluster, because every HPC cluster is different.

    We also have a version of the pipelines in Terra that run on the google cloud platform. The WDLs are included with the Genome STRiP code so you can take a look there. I don't think we currently have a public Terra workspace you can clone from, but if you want to clone from a sample workspace (e.g. for 1000 Genomes) I can give you access.

    0
    Comment actions Permalink
  • Avatar
    chae chae

    Thank you for further information!!

    Well actually I did this preprocessing for analysis of C4 copy number which is provided in Terra workspace. 

    Maybe it can be very helpful to deal with whole pipeline if you gave me the access!?!

    My whole sample set is about 1K too. 

    0
    Comment actions Permalink
  • Avatar
    Bob Handsaker

    Send me your terra (google) ID or email address. The workflows are all public in Terra, but it can be helpful to have an example usage to work from. I don't have an example currently that is locked down with RequesterPays, etc.

    0
    Comment actions Permalink
  • Avatar
    chae chae

    yuo1996@gmail.com 

    This is my address! 

    I'm gonna try and check if I can get faster or convenient result for next step from it. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk