GVCF vs VCF INFO tags, relevance for filtering RNAseq SNPs
AnsweredHello all,
I am a relatively new user of GATK so my question may be considered somewhat basic. I am wondering if someone can provide me with an explanation of why the tags in the INFO field of GVCF files (as output by HaplotypeCaller) and standard VCF files (as output by GenotypeGVCF) differ. I'd be unsurprised if the answer was to some extent in my question, but I'm struggling to find any information online as to how these parameters are derived in each case.
The reason I ask this is that I am currently attempting to use GATK to conduct SNP and indel calling from RNAseq data from pathogen infected plants. I am at a stage where i need to conduct hard filtering (as I am working on a non-model organism with no prior data and thus VQSR is not an option), and I am confused by the following circumstances. GATK best practices for variant calling from RNAseq data seem dictate that I conduct VariantFiltration directly following use of HaplotypeCaller (i.e. without using GenotypeGVCFs to generate standard VCF file). However, guidance from the GATK website for such filtering discusses filtering by many parameters that are not present in GVCF files, like FisherStrand (FS) and StrandOddsRatio (SOR) for example. Since these tags are not present in GVCF files I'm assuming running a command like the one below would not actually do anything to the data?
gatk VariantFiltration \
-R reference.fasta \
-V sorted_dupsmarked.g.vcf \
-O sorted_dupsmarked_filtered.g.vcf
--filter-expression " --filter-name "FS60" \
--filter-expression "FS > 60"
--filter-name "SOR" \
--filter-expression "SOR > 3"
Furthermore, examples of methods in papers by other researchers frequently filter by these parameters, usually after using HaplotypeCaller and GenotypeGVCFs to generate their multi-sample VCF. So I wonder, are they doing something wrong by using GenotypeGVCFs on GVCFs derived from RNAseq data?
What am i missing here?
Thanks in advance for any attempts to help.
-
Thank you for your post, thomas welch! I want to let you know we have received your question. Our GATK support team goes through the Community Discussion questions in the order they are received - we'll get back to you if we have any updates or follow up questions.
Please see our Support Policy for more details.
-
Hi thomas welch,
Thanks for your question! Yes, GVCF files and VCF files have different info tags. You can find examples of these files in our documents about them: GVCF - Genomic Variant Call Format & VCF - Variant Call Format.
Yes, the RNA seq best practices involves running HaplotypeCaller with the default parameter -ERC NONE, which will output a VCF instead of a GVCF. So, the steps of combining GVCFs and genotyping GVCFs is not needed.
If you want to genotype multiple samples you can do joint calling by running HaplotypeCaller in GVCF mode with this argument: -ERC GVCF. Then, you will combine your GVCFs using GenomicsDBImport or GenotypeGVCFs. Finally, you will run GenotypeGVCFs to get a multi sample VCF. Then, you can run VariantFiltration on this VCF.
Let me know if this answers your question or if you have any other further questions.
Best,
Genevieve
-
Hello Genevieve,
Thank you very much for clarifying this for me.
Kind Regards,
Tom
-
You're welcome!
Please sign in to leave a comment.
4 comments