Mutect2 resulting variant numbers
Hello,
I am using gatk/4.5.0.0. I am getting a really large number of variants between my crispr edited and non-edited sample (same cell line), which seems off. I wanted to check in about this before proceeding to ensure I am not going wrong elsewhere. After preparing my Bam files according to Gatk best practices, I ran the following:
#this is the ncbi reference genome
ref="GRCh38_latest_genomic.fna"
#this is the tumor sample (with ncbi naming)
tumor="wgs_edited/edited_CKDN230037261-1A_HCT72DSX7_aligned_scratch_threaded_2_sorted_faster_dup_rg.bam"
#this is the pre-edited sample with ncbi naming
norm="wgs_pre_edit/pre_edit_CKDN230037259-1A_HG57MDSX7-aligned_scratch_threaded_2_sorted_faster_dup_rg.bam"
#the norm is pre_edit
norm_name="pre_edit"
#germ + table of pons, converted to ncbi format naming
germ="somatic-hg38_af-only-gnomad_decomp_ncbi.hg38.vcf.gz"
pon="somatic-hg38_1000g_pon_decomp_ncbi.hg38.vcf.gz"
#outout is the variant file but neglecting the base quality score recalibration
out="somatic_mt_wo_bqsr.vcf.gz"
#mutect2 is designed to call somatic variants only
gatk Mutect2 -R "$ref" -I "$tumor" -I "$norm" -normal "$norm_name" --germline-resource "$germ" --panel-of-normals "$pon" -O "$out"
The summary stats of my resulting VCF is as follows:
# SN [2]id [3]key [4]value
SN 0 number of samples: 2
SN 0 number of records: 461585
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 222327
SN 0 number of MNPs: 7316
SN 0 number of indels: 238359
SN 0 number of others: 0
SN 0 number of multiallelic sites: 91655
SN 0 number of multiallelic SNP sites: 2620
This seems way too large considering the cell line is exactly the same. Is there anything I'm doing clearly incorrectly. Thanks!
-
Hi
Mutect2 is highly sensitive and ploidy independent variant caller which will call any possible variants possible with given proper parameters. Since you are dealing with a whole genome sample the number of variants you receive without any pre-filtration is quite plausible. Also since you have not provided any intervals to call variants from you will have lots of variation especially from intergenic regions with many repeats and structural changes.
Also you mentioned that you have not performed bqsr on your bam files which may result in increased number of over/under scored systematic errors which will clutter all the variant calls.
Even if you think 2 samples are derived from the same cell line you can never have 100% pure homogenous collection of cells within your samples to begin with due to characteristics of cell lines. One possible solution might be to have multiple technical and/or biological replicates to call variants from and see common patterns of errors and true variants.
These are just the ones that came to my mind but I will also ask the team to see if they have any further suggestions.
Regards.
-
Thank you, further when I run filter mutect2 I see no changes. I am also running mutect with bqsr corrected, and though it is still running, based off the current numbers I don't expect the order of the number of identified variants to drastically change. Below are my commands for the filtering
#! /usr/bin/env bash
# # lines starting with #$ is an instruction to the job scheduler
#$ -S /bin/bash # the shell language when run via the job scheduler [IMPORTANT]
#$ -cwd # job should run in the current working directory
#$ -j y # STDERR and STDOUT should be joined
#$ -R yes
#$ -l h_rt=300:00:00 # job requires up to 3 hours of runtime
#$ -r y # if job crashes, it should be restarted
#A simple script to align read files (fastq) to a reference file (.fna) resulting in a (bam) format.
#this is currently for the non-recalibrated files
module load CBI
module load gatk
input="somatic_mt_wo_bqsr.vcf.gz"
output="somatic_mt_wo_bqsr_PE_filtered.vcf.gz"
ref="GRCh38_latest_genomic.fna"
gatk FilterMutectCalls -R "$ref" -V "$input" -O "$output"
The stats for the "somatic_mt_wo_bqsr_PE_filtered.vcf.gz" are the same
-
What are the sequencing depths of these samples?
Depending on the depth of the samples certain variant calls will have increased or decreased sensitivity.
Also without a proper filtering applied to those calls it would be quite early to talk about the validity of those calls. We would suggest running a thorough variant filtration using FilterMutectCalls and may be using some other thresholds that you may think of depending on the results you get.
Regards.
-
Thank you for the reply I'll have to check the depth, in my above post I was attempting to do the FilterMutectCalls with the pasted commands, however, the number of resulting records in the outputted vcf seems to be the same
-
Number of records will be the same but with Filters applied this time you can select the Filter=PASS variants to reduce the numbers.
Please sign in to leave a comment.
5 comments