Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

About RNA seq snp filtration

0

4 comments

  • Avatar
    Gökalp Çelik

    Hi 周尤佳

    You seem to be missing 2 additional parameters included in the best practices workflow. These parameters are included to filter out clustered events that may clutter your files. 

    --cluster-size,-cluster <Integer>
                                  The number of SNPs which make up a cluster. Must be at least 2  Default value: 3. 

    --cluster-window-size,-window <Integer>
                                  The window size (in bases) in which to evaluate clustered SNPs  Default value: 0.

     

    Below is the sample wdl code piece for VariantFiltration tool. 

    command <<<
             ${gatk_path} \
                VariantFiltration \
                --R ${ref_fasta} \
                --V ${input_vcf} \
                --window 35 \
                --cluster 3 \
                --filter-name "FS" \
                --filter "FS > 30.0" \
                --filter-name "QD" \
                --filter "QD < 2.0" \
                -O ${base_name}
        >>>

    Reverse transcription, multiple PCR enrichment steps and possible RNA modifications make SNP calling from RNASEQ samples a unique approach therefore filtering criteria is also different compared to DNA sequencing. As you can see we cannot include read position bias to these filters since those reads you call variants from are also generated from raw RNA reads by splitting therefore read position becomes not feasible to track. One thing that you may need to pay attention is that although HaplotypeCaller parameters are set to avoid regions beyond exon intron boundaries towards introns due to internal calculations for SNPs and INDELs you may be observing superficial SNPs around exon-intron boundaries in your results. You may be able to filter them out using a bed file for selecting against those regions. However pay attention to those sites that are real SNPS around those boundaries due to gene model or real splice events. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    周尤佳

    Hi, Gökalp Çelik

    thanks for the advices. I added these 2 parameters and run again, and still get 965719 out of 1113553 passed filtration, also my rna-seq includes many intronic reads that I don't want to discard, how can I differentiate superficial SNPs and real mutations in intron regions?

     

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi 

    If you are not discarding intronic variants that number is just about right since you are almost calling all genomic regions that are covered by reads. If you have an enriched library for mRNAs with Poly-A tails then I would say most of the intronic calls are superficial. But if you are sequencing total RNA without any particular enrichment of a subgroup then you need to try prioritizing and filtering them based on depth, MQ etc. However those MQ filters may not be as useful as in DNA sequencing because those filters a more likely to be tried mostly on BWA mapped reads. Our RNASeq variant calling workflows are optimized for calling within coding regions so for intronic RNASeq variants your mileage may vary for any kind of filtering you might apply. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    周尤佳

    Hi Gökalp Çelik ,

    Thank you for your help, i will consider filtering out low reads sites and calculate mutation rate at each mutation site that have higher reads in vcf to validate.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk