Genome Analysis Toolkit

richtege · January 09, 2020 12:38

Dear GATK support team,

I just ran the GATK VariantRecalibrator (in GATK 4.1.4.0) on whole exome data for 9 samples and 21 ethnically matched 1000Genome samples, all processed together according to the GATK germline short variant discovery pipeline.

My command:

gatk VariantRecalibrator \

--output ParoWES2020_1000Genomes_VariantRecal.recal \

--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hg38_v0_hapmap_3.3.hg38.vcf.gz \

--resource:omni,known=false,training=true,truth=truth,prior=12.0 hg38_v0_1000G_omni2.5.hg38.vcf.gz \

--resource:1000G,known=false,training=true,truth=false,prior=10.0 resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz \

--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf \

--tranches-file ParoWES2020_1000Genomes_VariantRecal.tranches \

-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \

--max-gaussians 4 \

-titv 3 \

--variant ParoWES2020_incl1000Genomes_GenotypeGVCFs_output.g.vcf \

-R resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta \

--rscript-file ParoWES2020_1000Genomes_VariantRecal.plots.R

At a truth sensitivity threshold of 90, a Ti/Tv ratio of 1.35 was estimated for the combined dataset. I also ran VariantCallingMetrics and got the following Ti/Tv ratios (averaged over the two different datasets each): for my own samples 2.49 for known (dbSNP) SNPs and 1.44 for novel SNPs, for the 21 1000Genomes samples 2.41 (known SNPs) and 0.91 (novel SNPs). These numbers deviate considerably from the expected TiTv ratio of around 3 for exome seq data. Also, VariantCallingMetrics estimated less than half as many SNPs (known and novel) than VariantRecalibrator. What could be the reason for this?

Moreover, in the tranches plot, a large number of false positives is estimated already for the first tranche, and in the third tranche, the bars are completely disarranged. The specificity/tranch truth sensitivity also plot looks particularly strange.

Several threads in the (old) forum suggest that under certain circumstances, a lower Ti/Tv ratio can be expected (see below).

Among the circumstances leading to lowered TiTv ratios that are discussed in these threads, the following apply to my experimental setup:

analysing samples represented in dbSNP (i.e., the samples from 1000Genomes); link1)
subsetting the data to coding regions (i.e., using an intersected interval list plus 100bp padding for targeted exome; link 1)
using dbSNP138 (instead of dbSNP135) (for novel SNPs; link 2, 3, and 4)
different capture targets across datasets (i.e., for my own samples and the 1000Genomes sampes; link 5)

Is it possible that these technical “flaws” of my analysis pipeline really create such a heavy deviation from expected Ti/Tv ratios? Or might there be another problem that I am not aware of, as might also be suggested by the tranches plots? Concerning the plots, as recommended in thread no. 4 by @Geraldine_VdAuwera, I did a re-run excluding mapping quality-annotations, with some neater plots and slightly higher TiTv ratios (see the two latter plots), however still with a large number of estimated FP.

Also, expecially using dbSNP138 seems to cause problems, also stated in thread no. 4: "...the VQSR plotting routines use hard-coded expectations about novel vs known TiTv and variant counts that were formulated before the 1000G results were added to dbsnp. It only affects the plots, so the underlying data may be perfectly ok even if the tranche plots look terrible."

Beyond that, I found another, more recent discussion on VQSR that puzzled me (https://gatkforums.broadinstitute.org/gatk/discussion/21345/new-to-the-forum-ask-your-questions-here):

“1.) VQSR should not be run on two different exome kits combined, so VQSR should be run on these exome kits separately. 2.) 15 exomes is a small number of samples, and may not be enough data for VQSR. Is there any way to increase the number of samples inside each kit?” (…)

“Unfortunately I can't increase the number of samples. I could add 1000 genomes data, but these exomes wouldn't use the same kit either. It seems like hard filtering might be the way to go here.” (…)

“The expectations for TiTv are based on what parts of the genome are being examined. If a different part of the exome is captured by one kit versus another, it would be impossible to have a basis for predicting what the value should be. This is true of any protocol where two different types of preparations, kits or sequencing are used. It is difficult to cross-compare them based on the assumptions made in the kit itself.”

In the VQSR documentation, is it explicitly recommended to pad small datasets (also exome) with publicly available data.

What are the current suggestions for small scale experiments? I can imagine that general recommendations are hard to make. But how can I judge if my experimental setup and my data meet the requirements for VQSR or if I should just leave the whole step out and go for hard filtering instead?

Thank you so much, also for all the efforts you guys make. I really appreciate your work.

Best, Gesa

Genome Analysis Toolkit

Need Help?

Community Forum

Using VQSR for small scale experiments

6 comments

Welcome

Didn't find what you were looking for?

Quick Links

Recent GATK News

About the GATK community