Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

HaplotypeCaller Input Inquiry

0

5 comments

  • Avatar
    Pamela Bretscher

    Hi Arosato,

    As far as I know, there isn't any "preference" of HaplotypeCaller for a certain file type. CRAM/BAM/SAM files should all work just fine depending on which one you have/would like to use. I have seen some previous posts comparing the results from different input file types, and it seems that there can be some minor discrepancies in results due to changes in the files during conversion. However, HaplotypeCaller doesn't inherently work better or worse with a certain file type. Please let me know if this helps answer your question.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Arosato

    Hi Pamela,

    Thanks for the response, unfortunately this answer leaves me with more questions. I am running the https://app.terra.bio/#workspaces/help-gatk/GATK4-Germline-Preprocessing-VariantCalling-JointCalling/workflows/help-gatk/1-2-Haplotypecaller-HG38 wdl pipeline. In this pipeline there are three major sub pipelines run in the following order (when a CRAM, rather than a BAM file if provided as input to the workflow. If a BAM is provided the CRAMToBam step is not run):

    call-CramToBamTask, call-HaplotypeCaller, call-MergeGVCFs

    During the pipeline runtime the CramToBam task takes the majority of the execution time (~6hrs) while the other steps take 1-2 hrs (for the HaplotyperCaller to run on each shard, but these can run in parallel so it's not a bottleneck) and minuets to run, respectively. I checked the mergeGVCFs command and it does not use the BAM file as input so I'm wondering why the pipeline would run an unnecessary CRAM to BAM conversion step if the HaplotypeCaller program has no preference to input type?

    Best, 

    Andrew

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Arosato,

    I'm sorry for any confusion that my response may have caused. SAM, BAM, or CRAM files can all be used as input for HaplotypeCaller and should produce the same end results. However, there are issues with performance if HaplotypeCaller were to work directly from a CRAM file. Therefore, if you do use a CRAM file as input, then the CRAMtoBam step is necessary because HaplotypeCaller needs to work directly from a BAM file to produce optimal results. Based on what you are asking, I guess this would mean that HaplotypeCaller does "prefer" a BAM file, but all three can technically successfully be used as input. It's just that the extra CRAMtoBAM step will be necessary for a CRAM file. This article might help explain a little bit more. Please let me know if this is more clear.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Arosato

    Hi Pamela, 

    That's exactly the information I was looking for! Thanks for your response and the article link!

    Best, 

    Andrew

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Great! Glad I could help. Let me know if you have any further questions.

    Kind regards,

    Pamela

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk