Seeking Clarification on RNA-seq Variant Calling Pipeline
I am currently working on variant calling from RNA-seq data and have encountered some discrepancies between the pipeline I am using and the one recommended by GATK. I am reaching out to seek clarification and guidance on these differences.
The pipeline I am currently using was adapted from a script that a colleague used for their exome data. However, upon comparing it with the recommended GATK protocol, I noticed several differences. For example, my script includes steps like VariantRecalibrator and GenotypeGVCFs, which are not part of the recommended GATK pipeline for RNA-seq data. On the other hand, the GATK pipeline includes steps like VariantFiltration, which are not present in my script.
Below is the pipeline I am currently using:
1. AddOrReplaceReadGroups to name reads in bam files (not included in the GATK recommended pipeline)
2. MarkDuplicates
3. AddOrReplaceReadGroups to organize reads and index the bam file (not included in the GATK recommended pipeline)
4. SplitNCigarReads (originally I was not doing this step and then one of the GATK people recommended it for RNAseq data).
5. BaseRecalibrator
6. ApplyBQSR
7. HaplotypeCaller
8. GenomicsDBImport and GenotypeGVCFs (not included in the GATK recommended pipeline)
9. VariantAnnotator (not included in the GATK recommended pipeline)
10. VariantRecalibrator (not included in the GATK recommended pipeline)
11. ApplyVQSR (not recommended to run twice in the GATK pipeline)
9. VariantAnnotator (not included in the GATK recommended pipeline).
As you can see, there are several steps that I'm not running that are included in GATK recommended pipeline, like VariantFiltration, MergeVCFs, and ScatterIntervalList.
So I don't really know how to harmonize my script with the one recommended by GATK. Which steps should I not follow? and which steps I should start running?
Hopefully, you can help me with these questions.
REQUIRED for all errors and issues:
a) GATK version used: v4.2.2.0
-
Exome sequencing and transcriptome sequencing have quite different fundamentals in terms of data quality, handling and variant metrics. We recommend against using any DNA sequencing workflow to be applied onto variant calling from RNA-seq data as is.
We do have a wdl in github that contains all the steps that are in our current practice.
For the variant filtering part, due to lack of extensive data to train specifically for RNASeq variant calling we recommend single sample hard filtering with thresholds given in the wdl.
I hope this helps.
Please sign in to leave a comment.
1 comment