What variant calling method should I use?
I have WES data from a highly mutagenic tumor cell line. I have the data from day zero cells, and also from daughter cells after some time.
I would like to find out what variants are in my daughter cells, but not present in my day zero cells.
Do I need to use germline variant calling (HaplotypeCaller), or somatic variant calling (Mutect2), or both?
And how does joint genotyping come into play here? I have multiple samples from the two conditions (day zero vs daughter).
I feel like what I need to do is maybe something like:
1. Run germline variant calling (HaplotypeCaller) individually on each of my daughter cell samples and output gVCF files
2. Joint genotype these gVCF files into a single VCF file
3. Do the same thing for my day zero samples (or alternatively, run CreateSomaticPanelOfNormals after running Mutect2 in "tumor-only" mode individually on my day zero samples)
4. Run somatic variant calling (Mutect2) in tumor-only mode individually on each of my daughter cell samples, using the joint genotyped day zero VCF file from above as my Panel of Normals.
5. And then what would I do here? Take the union between the germline and somatic call set? The intersect? Subtract one from the other?
Does this make sense?
-
Chris It's a lot simpler than that. Before suggesting an answer, I have a few unsolicited (sorry!) public service announcements.
PSA #1: Only make your own panel of normals if you have both hundreds of samples and a very good reason to do so. Otherwise, use one from our best practices google bucket gs://gatk-best-practices/somatic-hg38 (there's also one for the hg19 reference).
PSA #2: The panel of normals captures technical artifacts, not germline variants.
PSA #3: Tumor-only calling is always much less reliable than tumor-normal.
Anyway, you have two options, depending on whether you want to resolve the difference between daughter cells. Both involve Mutect2.
Option #1: If you want to discover all mutations that exist in any daughter cell (ie you want to pool the data from all daughter cells) and do not exist in the zero-day cells you should run Mutect2 as follows
gatk Mutect2 -R ref.fasta \
-I zero-day1.bam -I zero-day2.bam \
-I daughter1.bam -I daughter2.bam \
--normal zero-day-sample1 --normal zero-day-sample2 \
--germline-resource gs://gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz \
-pon gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz \
-O pooled-daughters.vcf
Option #2: If you want to treat the daughters as distinct populations run Mutect2 separately for each daughter, the only difference between the above command being that you only input one daughter at a time.
Note that in both cases it is essential to run FilterMutectCalls afterwards. "zero-day-sample1" and "zero-day-sample2" should be replaced by their sample names from the BAM header.
Please sign in to leave a comment.
1 comment