I am new to variant calling and could use some guidance on a pipeline I've been working on for SNP calling from Hi-C data in our lab.
We have five biological replicates, each resulting in a BAM file after alignment. Our objective is to pool information from these five files to enhance SNP calling and avoid losing variants due to low coverage or data sparsity in individual files.
After reviewing the GATK documentation, my initial approach involved performing joint variant calling (using `-ERC GVCF`) after processing the BAM files separately. This included steps like marking duplicates, applying BQSR, sorting, indexing, and ensuring consistent SM tags across the five files.
However, when I reached the HaplotypeCaller step, I encountered the error: "A USER ERROR has occurred: Argument emit-ref-confidence has a bad value: Can only be used in single-sample mode currently. Use the --sample-name argument to run on a single sample out of a multi-sample BAM file." I assumed this means that we need more than one sample for joint variant calling, not just multiple files per one sample. Is my understanding correct?
In response, I took a step back and merged all the BAM files together. However, I am uncertain whether running HaplotypeCaller in GVCF mode would bring any advantage to the pipeline in this case or if it's unnecessary.
For your reference, here are the versions of the tools I've used:
- The Genome Analysis Toolkit (GATK) v18.104.22.168
I have added the final version of my pipeline.
I would greatly appreciate your insights, any adjustment suggestions, or clarifications regarding the steps I've followed.