Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Can I continue to run Haplotypecaller when it is broken?

1

5 comments

  • Avatar
    danilovkiri

    Hi Sin Lee

    No you can not. However, since HC is able to operate over regions, you can look into the produced GVCF file and find the last chromosome which was fully and successfully genotyped, e.g. HC was able to genotype chromosomes 1,2,3 and raised an exception while calling chromosome 4. Then you should subset the VCF output for chromosomes 1,2,3 and discard the rest. Then run HC again on the same data but specify -XL chr1 -XL chr2 -XL chr3 (-XL removes the region from analysis) as well as use another name for this new output. Finally, combine the first and the second output via multiple available tools (CombineVariants, for instance).

    -L and -XL options can operate with chromosomal regions like -XL chr1:100-200, however, I do not recommend to use it here for simplicity.

    The most important thing here is to find out the reason for a server error as you say. Besides, GATK4 does not support parallelization in alpha mode (SparkHC is in beta as I recall) which you desperately need to speed up the analysis. I suggest you find out the number of CPUs/threads for your server and run multiple HC commands in parallel (using screen or tmux sessions) per each chromosome. I mean each command should be identical but have a -L argument with chrN value. Then you can combine the produced files using GATK CombineGVCFs. 

     

    1
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Thank you for your input danilovkiri !!!

    0
    Comment actions Permalink
  • Avatar
    Wan-Ting Huang

    Hi,

    I have the same issue. My HaplotypeCaller process was terminated due to the time limit on the cluster. I tried to subset the successfully called variants using SelectVariants as you suggested, but it requires an index file which I don't have since the process did not finish properly, and I can't index separately with IndexFeatureFile or samtools neither.

    What can I do? Is the only way to discard the generated gVCF files that took me 3 days and run again with multiple jobs working on non-overlapping regions?

    Thank you for your time and suggestions.

    Best regards,

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi. 

    The way we recommend splitting your variant calls is through parallel jobs of non-overlapping segments preferably split by long stretches of Ns in the reference genome. Once each segment is complete it will have its own index file and finally when all parts are complete you may be able to gather all parts into a single GVCF file. SelectVariants is not needed during this task. 

    If you are sure about completed parts in your prematurely stopped run, you may be able to recover those variants from that file (although you may need to fiddle with bash/python/awk scripting a little bit) and restart running parts from where it left of and exclude the already completed parts. HaplotypeCaller does not have a continuum function to proceed when interrupted therefore it is up to you to start where you left. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Wan-Ting Huang

    Thank you for the fast reply!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk