Using VQSR for small scale experiments
Dear GATK support team,
I just ran the GATK VariantRecalibrator (in GATK 4.1.4.0) on whole exome data for 9 samples and 21 ethnically matched 1000Genome samples, all processed together according to the GATK germline short variant discovery pipeline.
My command:
gatk VariantRecalibrator \
--output ParoWES2020_1000Genomes_VariantRecal.recal \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hg38_v0_hapmap_3.3.hg38.vcf.gz \
--resource:omni,known=false,training=true,truth=truth,prior=12.0 hg38_v0_1000G_omni2.5.hg38.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf \
--tranches-file ParoWES2020_1000Genomes_VariantRecal.tranches \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
--max-gaussians 4 \
-titv 3 \
--variant ParoWES2020_incl1000Genomes_GenotypeGVCFs_output.g.vcf \
-R resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta \
--rscript-file ParoWES2020_1000Genomes_VariantRecal.plots.R
At a truth sensitivity threshold of 90, a Ti/Tv ratio of 1.35 was estimated for the combined dataset. I also ran VariantCallingMetrics and got the following Ti/Tv ratios (averaged over the two different datasets each): for my own samples 2.49 for known (dbSNP) SNPs and 1.44 for novel SNPs, for the 21 1000Genomes samples 2.41 (known SNPs) and 0.91 (novel SNPs). These numbers deviate considerably from the expected TiTv ratio of around 3 for exome seq data. Also, VariantCallingMetrics estimated less than half as many SNPs (known and novel) than VariantRecalibrator. What could be the reason for this?
Moreover, in the tranches plot, a large number of false positives is estimated already for the first tranche, and in the third tranche, the bars are completely disarranged. The specificity/tranch truth sensitivity also plot looks particularly strange.
Several threads in the (old) forum suggest that under certain circumstances, a lower Ti/Tv ratio can be expected (see below).
- https://gatkforums.broadinstitute.org/gatk/discussion/23388/understanding-my-vqsr-results
- https://gatkforums.broadinstitute.org/gatk/discussion/5421/after-calling-and-variant-recalibration-data-shows-very-low-titv-ratios-what-could-be-the-cause
- https://gatkforums.broadinstitute.org/gatk/discussion/1259/which-training-sets-arguments-should-i-use-for-running-vqsr
- https://gatkforums.broadinstitute.org/gatk/discussion/5221/tranches-plot-issue
- https://gatkforums.broadinstitute.org/gatk/discussion/5564/collected-questions-about-vqsr
Among the circumstances leading to lowered TiTv ratios that are discussed in these threads, the following apply to my experimental setup:
- analysing samples represented in dbSNP (i.e., the samples from 1000Genomes); link1)
- subsetting the data to coding regions (i.e., using an intersected interval list plus 100bp padding for targeted exome; link 1)
- using dbSNP138 (instead of dbSNP135) (for novel SNPs; link 2, 3, and 4)
- different capture targets across datasets (i.e., for my own samples and the 1000Genomes sampes; link 5)
Is it possible that these technical “flaws” of my analysis pipeline really create such a heavy deviation from expected Ti/Tv ratios? Or might there be another problem that I am not aware of, as might also be suggested by the tranches plots? Concerning the plots, as recommended in thread no. 4 by @Geraldine_VdAuwera, I did a re-run excluding mapping quality-annotations, with some neater plots and slightly higher TiTv ratios (see the two latter plots), however still with a large number of estimated FP.
Also, expecially using dbSNP138 seems to cause problems, also stated in thread no. 4: "...the VQSR plotting routines use hard-coded expectations about novel vs known TiTv and variant counts that were formulated before the 1000G results were added to dbsnp. It only affects the plots, so the underlying data may be perfectly ok even if the tranche plots look terrible."
Beyond that, I found another, more recent discussion on VQSR that puzzled me (https://gatkforums.broadinstitute.org/gatk/discussion/21345/new-to-the-forum-ask-your-questions-here):
“1.) VQSR should not be run on two different exome kits combined, so VQSR should be run on these exome kits separately. 2.) 15 exomes is a small number of samples, and may not be enough data for VQSR. Is there any way to increase the number of samples inside each kit?” (…)
“Unfortunately I can't increase the number of samples. I could add 1000 genomes data, but these exomes wouldn't use the same kit either. It seems like hard filtering might be the way to go here.” (…)
“The expectations for TiTv are based on what parts of the genome are being examined. If a different part of the exome is captured by one kit versus another, it would be impossible to have a basis for predicting what the value should be. This is true of any protocol where two different types of preparations, kits or sequencing are used. It is difficult to cross-compare them based on the assumptions made in the kit itself.”
In the VQSR documentation, is it explicitly recommended to pad small datasets (also exome) with publicly available data.
What are the current suggestions for small scale experiments? I can imagine that general recommendations are hard to make. But how can I judge if my experimental setup and my data meet the requirements for VQSR or if I should just leave the whole step out and go for hard filtering instead?
Thank you so much, also for all the efforts you guys make. I really appreciate your work.
Best, Gesa




-
Hi @Gesa ,
I will have to follow up with the team on this one.
If anyone in the community has any comments, please add!
-
Hi @richtege
"What are the current suggestions for small scale experiments? I can imagine that general recommendations are hard to make. But how can I judge if my experimental setup and my data meet the requirements for VQSR or if I should just leave the whole step out and go for hard filtering instead?"
Is available the CNN pipeline for single sample analysis or in general when you have few samples, it is a deep learning method to filter germline variants. Personally I think that it is better than the hard filtering. link1 and link2
-
Thank you @manolis for your helpful comment!
Would this be part of the current “official” suggestions for small-scale whole exome pipelines, because of the above mentioned downsides of padding datasets with available data generated with different protocols?
However, I am currently trying to figure out what might be the cause for my high FP rate (with indeed some strange plots regarding the QD), I guess I need to fix that before dealing with the CNN pipeline...
-
Hi richtege - I imagine you already read this article and this one which recommends at least 30 exomes for VQSR and hard filtering for smaller datasets. In general, the info I've gathered from the team is that VQSR isn't that great for your number of samples, so you could try hard-filtering or running the CNN pipeline on single samples. I wish I had better advice!
Thanks, manolis for chiming in.
-
Hi Tiffany,
thanks a lot for reconfirming. However, I am still a bit puzzled. In the VQSR documentation, it is recommended to pad small exome datasets, preferably with 1000Genomes data.
In contradiction to this, @AdelaideR stated that this should be avoided when dealing with datasets that were generated using different exome capture kits and explained it as follows:
“The expectations for TiTv are based on what parts of the genome are being examined. If a different part of the exome is captured by one kit versus another, it would be impossible to have a basis for predicting what the value should be. This is true of any protocol where two different types of preparations, kits or sequencing are used. It is difficult to cross-compare them based on the assumptions made in the kit itself.”
I found this important remark only once - to be specific in the thread “New to the forum? Ask your questions here!”, among many other different issues. However, it holds a lot of implications if I understood it correctly.
Hence, I have two major questions regarding the combination of datasets from different experiments:
- I used an intersection of the interval lists, I assume this wouldn’t minimize a potential bias from different targets?
- the 1000Genomes data itself was generated in different sequencing centres using different exome kits (see here), so this should accordingly be taken into account when choosing extra samples for padding?
And, if I take the above recommendation seriously, the only (theoretical) approach for a small scale whole exome experiment involving VQSR would thus be:
- before starting, choose samples from 1000Genomes that were sequenced in only one centre with the same exome capture kit to later pad own experiment
- then sequence own samples with the same exome kit (given it is still available)
- in any other scenario, VQSR is not suitable for small whole exome datasets? Or did I get the whole discussion terribly wrong?
Best,
Gesa
-
Hi richtege
One of the solution you could try is to remove the 100bp padding. This might be extending the genomic regions to fall in intronic space causing this weird Ti/Tv ration.
If this solution does not work, then you should fall back on using the CNN tools on single samples or hard filtering as suggested above.
Please sign in to leave a comment.
6 comments