About RNA seq snp filtration
REQUIRED for all errors and issues:
a) GATK version used:v4.4.0.0
b) Exact command used: gatk VariantFiltration \
-R GRCh38.p14.genome.fa \
-V wt1.rdup.SNC.RG.filtered.vcf \
-O wt1.rdup.SNC.RG.filtered.ann.vcf \
--filter-name "my_filter1" \
--filter-expression "FS > 30.0" \
--filter-name "my_filter2" \
--filter-expression "QD < 2.0"
c) Entire program log:
hi,I am using this pipeline for RNA seq 'RNAseq short variant discovery (SNPs + Indels) – GATK (broadinstitute.org)'. And I think maybe this hard filtration is too wide? My wt control group also generates a lot of snps.
Is there any better options for snp filtration?
Thanks!
-
Hi 周尤佳
You seem to be missing 2 additional parameters included in the best practices workflow. These parameters are included to filter out clustered events that may clutter your files.
--cluster-size,-cluster <Integer>
The number of SNPs which make up a cluster. Must be at least 2 Default value: 3.
--cluster-window-size,-window <Integer>
The window size (in bases) in which to evaluate clustered SNPs Default value: 0.Below is the sample wdl code piece for VariantFiltration tool.
command <<<
${gatk_path} \
VariantFiltration \
--R ${ref_fasta} \
--V ${input_vcf} \
--window 35 \
--cluster 3 \
--filter-name "FS" \
--filter "FS > 30.0" \
--filter-name "QD" \
--filter "QD < 2.0" \
-O ${base_name}
>>>Reverse transcription, multiple PCR enrichment steps and possible RNA modifications make SNP calling from RNASEQ samples a unique approach therefore filtering criteria is also different compared to DNA sequencing. As you can see we cannot include read position bias to these filters since those reads you call variants from are also generated from raw RNA reads by splitting therefore read position becomes not feasible to track. One thing that you may need to pay attention is that although HaplotypeCaller parameters are set to avoid regions beyond exon intron boundaries towards introns due to internal calculations for SNPs and INDELs you may be observing superficial SNPs around exon-intron boundaries in your results. You may be able to filter them out using a bed file for selecting against those regions. However pay attention to those sites that are real SNPS around those boundaries due to gene model or real splice events.
Regards.
-
Hi, Gökalp Çelik
thanks for the advices. I added these 2 parameters and run again, and still get 965719 out of 1113553 passed filtration, also my rna-seq includes many intronic reads that I don't want to discard, how can I differentiate superficial SNPs and real mutations in intron regions?
Thanks!
-
Hi
If you are not discarding intronic variants that number is just about right since you are almost calling all genomic regions that are covered by reads. If you have an enriched library for mRNAs with Poly-A tails then I would say most of the intronic calls are superficial. But if you are sequencing total RNA without any particular enrichment of a subgroup then you need to try prioritizing and filtering them based on depth, MQ etc. However those MQ filters may not be as useful as in DNA sequencing because those filters a more likely to be tried mostly on BWA mapped reads. Our RNASeq variant calling workflows are optimized for calling within coding regions so for intronic RNASeq variants your mileage may vary for any kind of filtering you might apply.
Regards.
-
Hi Gökalp Çelik ,
Thank you for your help, i will consider filtering out low reads sites and calculate mutation rate at each mutation site that have higher reads in vcf to validate.
Please sign in to leave a comment.
4 comments