HaplotypeCaller Input Inquiry
Hi GATK Team,
I was looking at the Tool page for the 4.2.4.1 HaplotypeCaller found at https://gatk.broadinstitute.org/hc/en-us/articles/4414586765723?page=1#comment_4638698103451
I noticed the message "Input bam file(s) from which to make variant calls" in the Input subsection. Further down, I noticed in the Haplotype Specific Arguments section that the --input argument can take CRAM/SAM/BAM files as input. I was wondering if the program has a preference for input file type? By that I mean, "under the hood" does the program have to do some kind of conversion between file types if one input is offered versus another? Is it computationally more efficient to give the program one type over the others?
Best,
Andrew Rosato
Partners Personal Medicine (MGB)
-
Hi Arosato,
As far as I know, there isn't any "preference" of HaplotypeCaller for a certain file type. CRAM/BAM/SAM files should all work just fine depending on which one you have/would like to use. I have seen some previous posts comparing the results from different input file types, and it seems that there can be some minor discrepancies in results due to changes in the files during conversion. However, HaplotypeCaller doesn't inherently work better or worse with a certain file type. Please let me know if this helps answer your question.
Kind regards,
Pamela
-
Hi Pamela,
Thanks for the response, unfortunately this answer leaves me with more questions. I am running the https://app.terra.bio/#workspaces/help-gatk/GATK4-Germline-Preprocessing-VariantCalling-JointCalling/workflows/help-gatk/1-2-Haplotypecaller-HG38 wdl pipeline. In this pipeline there are three major sub pipelines run in the following order (when a CRAM, rather than a BAM file if provided as input to the workflow. If a BAM is provided the CRAMToBam step is not run):
call-CramToBamTask, call-HaplotypeCaller, call-MergeGVCFs
During the pipeline runtime the CramToBam task takes the majority of the execution time (~6hrs) while the other steps take 1-2 hrs (for the HaplotyperCaller to run on each shard, but these can run in parallel so it's not a bottleneck) and minuets to run, respectively. I checked the mergeGVCFs command and it does not use the BAM file as input so I'm wondering why the pipeline would run an unnecessary CRAM to BAM conversion step if the HaplotypeCaller program has no preference to input type?
Best,
Andrew
-
Hi Arosato,
I'm sorry for any confusion that my response may have caused. SAM, BAM, or CRAM files can all be used as input for HaplotypeCaller and should produce the same end results. However, there are issues with performance if HaplotypeCaller were to work directly from a CRAM file. Therefore, if you do use a CRAM file as input, then the CRAMtoBam step is necessary because HaplotypeCaller needs to work directly from a BAM file to produce optimal results. Based on what you are asking, I guess this would mean that HaplotypeCaller does "prefer" a BAM file, but all three can technically successfully be used as input. It's just that the extra CRAMtoBAM step will be necessary for a CRAM file. This article might help explain a little bit more. Please let me know if this is more clear.
Kind regards,
Pamela
-
Hi Pamela,
That's exactly the information I was looking for! Thanks for your response and the article link!
Best,
Andrew
-
Great! Glad I could help. Let me know if you have any further questions.
Kind regards,
Pamela
Please sign in to leave a comment.
5 comments