How do I call variants from somatic RNA data?
AnsweredHi,
I have fish liver and brain mRNAseq reads that I would like to call variants from. I am following the "RNAseq short variant discovery (SNPs + Indels)" GATK Best practices workflow which can be found here.
I wonder:
1. If I should use HaplotypeCaller, or Mutect2 to call variants? HaplotypeCaller is designed for germline samples, while Mutect2 is designed for somatic samples, but tumor samples must be present in the dataset. So technically my dataset does not fit 100% to either tool's assumptions.
2. If using Mutect2, should I perform joint GVCF calling (with -ERC GVCF), or per-sample variant calling (with all bam files input at once in the command)?
3. If using HaplotypeCaller in per-sample mode, as suggested by the Best practices workflow, what is the command like when I have multiple bam inputs? e.g., Should I do:
gatk --java-options "-Xmx4g" HaplotypeCaller \ -R referece.fasta \ -I input1.bam \
-I input2.bam \
-I input3.bam \ -O output.vcf.gz
Thank you so much!
Peiwen
-
Hi Peiwen Li,
I am going to move your post into our Community Discussions -> Special GATK Use Cases topic, as the Non-Human topic is for reporting bugs and issues with GATK.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Best,
Genevieve
-
Hi Peiwen Li, it depends on if you are looking to find differences between your samples and a reference? Or, your samples to a normal + reference? You can read more about the distinction in this article: Somatic calling is NOT simply a difference between two callsets
-
Hi Genevieve Brandt (she/her),
Thank you for your replies and I am sorry I posted my questions to the wrong place.
I am looking to find differences between my samples and a reference. And after reading the article you shared, I think my situation is more fitted into the HaplotypeCaller's algorithm?
Thank you,
Peiwen
-
I think so! Glad it helped to clarify!
Have a good weekend,
Genevieve
-
Hi Genevieve Brandt (she/her),
I have a confusion on the "RNAseq short variant discovery (SNPs + Indels)" GATK Best practices workflow. It suggests to use per-sample variant calling for RNAseq data. What does that mean exactly? If I have multiple BAM inputs, should I run HaplotypeCaller for each of them one at a time, and get multiple VCF outputs? Then how can I get ONE single concordant multi-sample VCF for all VCF outputs?
Thank you so much and I hope you have a great weekend!
Peiwen
-
Hi Peiwen,
Joint calling is not supported yet for this method. With this best practices pipeline you will get multiple VCF outputs if you have multiple BAM inputs.
Genevieve
Please sign in to leave a comment.
6 comments