Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Using VQSR for small scale experiments

0

6 comments

  • Avatar
    Tiffany Miller

    Hi @Gesa ,

    I will have to follow up with the team on this one. 

    If anyone in the community has any comments, please add!

    0
    Comment actions Permalink
  • Avatar
    manolis

    Hi @richtege

    "What are the current suggestions for small scale experiments? I can imagine that general recommendations are hard to make. But how can I judge if my experimental setup and my data meet the requirements for VQSR or if I should just leave the whole step out and go for hard filtering instead?"

     

    Is available the CNN pipeline for single sample analysis or in general when you have few samples, it is a deep learning method to filter germline variants. Personally I think that it is better than the hard filtering. link1 and link2

     

    0
    Comment actions Permalink
  • Avatar
    richtege

    Thank you @manolis for your helpful comment!

    Would this be part of the current “official” suggestions for small-scale whole exome pipelines, because of the above mentioned downsides of padding datasets with available data generated with different protocols?

    However, I am currently trying to figure out what might be the cause for my high FP rate (with indeed some strange plots regarding the QD), I guess I need to fix that before dealing with the CNN pipeline...

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi richtege - I imagine you already read this article and this one which recommends at least 30 exomes for VQSR and hard filtering for smaller datasets. In general, the info I've gathered from the team is that VQSR isn't that great for your number of samples, so you could try hard-filtering or running the CNN pipeline on single samples. I wish I had better advice!

    Thanks, manolis for chiming in. 

    0
    Comment actions Permalink
  • Avatar
    richtege

    Hi Tiffany,

    thanks a lot for reconfirming. However, I am still a bit puzzled. In the VQSR documentation, it is recommended to pad small exome datasets, preferably with 1000Genomes data.

    In contradiction to this, @AdelaideR stated that this should be avoided when dealing with datasets that were generated using different exome capture kits and explained it as follows:

    “The expectations for TiTv are based on what parts of the genome are being examined. If a different part of the exome is captured by one kit versus another, it would be impossible to have a basis for predicting what the value should be. This is true of any protocol where two different types of preparations, kits or sequencing are used. It is difficult to cross-compare them based on the assumptions made in the kit itself.”

    I found this important remark only once - to be specific in the thread “New to the forum? Ask your questions here!”, among many other different issues. However, it holds a lot of implications if I understood it correctly.

    Hence, I have two major questions regarding the combination of datasets from different experiments:

    • I used an intersection of the interval lists, I assume this wouldn’t minimize a potential bias from different targets?
    • the 1000Genomes data itself was generated in different sequencing centres using different exome kits (see here), so this should accordingly be taken into account when choosing extra samples for padding?

     

    And, if I take the above recommendation seriously, the only (theoretical) approach for a small scale whole exome experiment involving VQSR would thus be:

    1. before starting, choose samples from 1000Genomes that were sequenced in only one centre with the same exome capture kit to later pad own experiment
    2. then sequence own samples with the same exome kit (given it is still available)

    - in any other scenario, VQSR is not suitable for small whole exome datasets? Or did I get the whole discussion terribly wrong?

     

    Best,

    Gesa

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi richtege

     

    One of the solution you could try is to remove the 100bp padding. This might be extending the genomic regions to fall in intronic space causing this weird Ti/Tv ration.

    If this solution does not work, then you should fall back on using the CNN tools on single samples or hard filtering as suggested above.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk