Somatic - Not cancer - RNA
I have RNA-seq data from ~30 related mice (C57Bl/6 strain). I am not studying cancer. Instead I have RNA from a specific brain region, plus RNA that is enriched for a specific cell type in this region.
I want to identify the somatic mutations present in each sample.
I have followed the Best Practices workflow for preprocessing RNA seq data, up to the production of base-recalibrated BAM files. I have run haplotypecaller on all samples to identify germline variants to exclude from a somatic analysis. I will perform hard filtering on these.
To find somatic variants, I came across this workflow for comparing two samples. Is this the right route to go down for such an analysis? I am confused as to how to structure my comparisons to get somatic mutation information for every sample.
Peter Kilfeather Mutect2 is the correct tool for this. Several groups at the Broad routinely use it for somatic mosaicism and other non-cancer low-AF applications.
The details of the workflow will depend on what data you have. How many samples do you have for each mouse? What is the depth of your RNA sequencing? Do you have any DNA sequencing, and if so, from where and what depth? Do you have any RNA data from outside the brain?
Thanks for your help. I have
- I have two types of sample: Cell type-specific and brain region-derived (both samples are from the same brain region). For cell type-specific samples, I have one per mouse, with 32 mice in total, spread over 8 conditions (4 mice per condition). For brain region-derived samples, I have only pooled samples; one sample per 4 mice in each condition.
- The samples are sequenced a depth of 40 million.
- I have no DNA sequencing data for these mice.
- I have no RNA data from outside the brain.
- If this helps, I know the identity of the mice, and can annotate which mice are siblings or cousins.
So far, I have run haplotypecaller on these samples, using dbSNP v150, and have done a first pass of each file with Mutect2, in accordance with the workflow for comparing two samples.
I am trying to work out exactly how I should compare groups. Do I select a control sample(s)/condition against which I compare every other sample?
EDIT: I should add that I do have tissue from other brain regions available that could be DNA sequenced, if you think there is a strong need.
It's important to distinguish between two kinds of comparison. Comparing between different individuals is an analysis that comes after Mutect2. The responsibility of Mutect2 is to compare between different samples from the same individual. Therefore, I think the optimal workflow here might be simply to run Mutect2 in tumor-only mode for each mouse independently and then post-process however you want.
The great risk of tumor-only mode, as usual, is that germline variants may appear as somatic variants. With humans, you can catch variants that appear in gnomAD but can't do much for rare variants. Laboratory mice, however, might be an easier situation. For example, if your mice are extremely inbred then germline variants are easy to identify because they are homozygous and because variants would be shared by multiple mice.
Assuming that these mice are from the same inbred or cross-bred population, I think the best thing to do is to generate the -germline-resource argument by running the HaplotypeCaller joint calling workflow on all mice in order to get a VCF with the allele frequency (AF) INFO field annotation. If your mice are not inbred then I would hope there exists such a germline resource VCF for the population that they are drawn from. Running Mutect2 with the -germline-resource will go a long way toward removing germline variants.
Since you know which mice are related, an additional step may be to consider any variant that occurs in multiple siblings or cousins to be germline. If somatic variants aren't expected to recur in exactly the same location then this filter is probably not too strict. In fact, in order to catch sequencing and mapping artifacts you might even want to filter out any variant that occurs in multiple unrelated samples.
To summarize, my advice (probably) is: 1) run HaplotypeCaller on all samples to make a -germline-resource. 2) Run Mutect2 is tumor-only mode with this germline resource on all samples. 3) Filter variants that occur in multiple related samples, unrelated samples if you're not worried about losing sensitivity.
Hi David Benjamin, thank you very much for this information!
Since my original post, I have been gradually coming to a similar conclusion, which is reassuring! I am documenting what I have done below and would be grateful if you could let me know your thoughts.
- For now, I am not trying to identify somatic mutations that occur in multiple individuals, so I am happy to filter sites that occur across multiple individuals.
- The mice are inbred (C57BL6) and all mice are either siblings, cousins, or second cousins, within a family that consists of parents and grandparents that were each siblings themselves.
- I came across a paper profiling the extent of germline variation in inbred littermates compared to the reference genome and between littermates. It was interesting to note that a small fraction of between-littermate variation exists within coding exons and UTRs; the regions I am restricting my investigation to, on account of using RNA-Seq data. So I am somewhat reassured that at least the true germline variation between my samples should be small.
- I performed joint germline calling on the whole cohort to generate a germline-resource with an allele frequency that I can use in Mutect2 tumor-only mode. I have reached the stage with this approach where I am performing hard filtering to remove artifacts. I am looking at the raw alignment data in IGV to decide whether filtered in/out calls look appropriate.
- In parallel, I have constructed a PoN from tumor-only Mutect2 runs on every sample, thinking that this would do the same job, albeit with a loss of sensitivity. Should I expect a PoN to behave similarly to the joint called germline-resource without hard filtering?
- On account of low expected germline variation, I am testing running Mutect2 in tumor-normal mode with the constructed PoN using one control sample as a reference normal sample for all runs. Would this be any better than running in tumor-only with a PoN or germline-resource, as you suggest?
I'm glad to hear that inbreeding is really working in your favor. A few things to add:
- Any artifact called by HaplotypeCaller is worth filtering because it will be an artifact if called by Mutect2 as well. Therefore, don't worry too much about hard filtering in the germline part of your pipeline. Not to say you shouldn't do it, because it's better to know exactly why variants are filtered, but don't stress over getting it exactly right and definitely err on the side of filtering too little.
- The PoN will behave differently from the germline resource, even without hard filtering, because HaplotypeCaller makes diploid calls and excludes variants and artifacts with low allele fractions that Mutect2 emits. The similarity will increase as you include less and less confident germline calls.
- The germline resource and panel of normals are complementary and if you have both you should use both. In your case hard-filtering variants that appear in more than one sample may render the PoN redundant, however. I would still use the PoN since it can't hurt and makes for a more generalizable pipeline.
- [In case others are reading this thread, I am not endorsing the use of an unmatched normal for somatic calling in humans.] Since your mice are so inbred I think using an unmatched (really an almost-matched) normal is justified. A normal sample complements the panel of normals in that it helps Mutect2 detect some artifacts that don't make it into the PoN. I would run Mutect2 with your germline resource, your PoN, and this control sample (just to be absolutely explicit, I mean all at once in a single command). As this is not exactly off-the-shelf I would sanity-check your results in case I have overlooked some pitfall. For example, if it differs greatly from results without the control sample, I would be worried that this approach is wrong.
- If you feel like experimenting, Mutect2 actually lets you specify an arbitrary number of normal samples — just put eg -I normal1.bam -normal normal_sample_1 -I normal2.bam -normal normal_sample_2 etc in your command. It's possible that including more than one mouse as the control "normal" would improve precision a bit.
It's worth stating that you now appear to be in a position where all of the options you are considering will yield reasonable results, so please don't let my verbose comments distract from that.
Please sign in to leave a comment.