Identify short variants (SNPs and Indels) in RNAseq data.
Reference Implementations
Pipeline | Summary | Notes | Github | Terra |
---|---|---|---|---|
RNAseq short variant per-sample calling | BAM to VCF | universal (expected) | yes | TBD |
Expected input
This workflow is designed to operate on a set of samples (uBAM files) one-at-a-time; joint calling RNAseq is not supported.
Main Steps
Mapping to the Reference
Tools involved: STAR
We begin with mapping RNA reads to a reference, we recommend using STAR aligner because it increased sensitivity compared to TopHat (especially for INDELS). We use STAR’s two-pass mode to get better alignments around novel splice junctions.
Data Cleanup
Tools involved:
MergeBamAlignment MarkDuplicates
We use MergeBamAlignment and MarkDuplicates (similarly to our DNA pre-processing best practices pipeline
SplitNCigarReads
Tools involved: SplitNCigarReads
Because RNA aligners have different conventions than DNA aligners, we need to reformat some of the alignments that span introns for HaplotypeCaller. This step splits reads with N in the cigar into multiple supplementary alignments and hard clips mismatching overhangs. By default this step also reassigns mapping qualities for good alignments to match DNA conventions.
Base Quality Recalibration
Tools involved:
BaseRecalibrator, Apply Recalibration, AnalyzeCovariates
This step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.
Variant Calling
Tools involved: HaplotypeCaller
HaplotypeCaller doesn’t need any specific changes to run with RNA once the bam has been run through SplitNCigarReads. We do adjust the minimum phred-scaled confidence threshold for calling variants to 20, but this value will depend on your specific use case.
Variant Filtering
Tools involved: VariantFiltration
We recommend specific hard filters, since VQSR and CNNScoreVariants require truth data for training that we don’t yet have for RNA.
13 comments
Hi Team,
Thanks for your incredible work. I'm looking for RNA-Seq somatic SNPs and indels calling pipeline and I found this article. However when I enter the github pages (https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels) it shows "germline short variant discovery", not somatic. Do you have somatic short variant discovery pipeline? Thank you.
Zheng
Hi All,
I was wondering if this pipeline has been moved to Terra yet? I am a Terra user (still new) and would love to use this via Terra on a bunch of my data!!
Thanks,
Britt
An Zheng curious if you found a solution to your question regarding somatic calls?
https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels
Hi,
I am using gatk for somatic cell mutation using RNAseq data, I have download reference genome fasta and gtf from the ensemble and as I cannot find known site variation in vcf format there, on ensemble variation file are in the gvf folder so I take the vcf from the gatk resource bundle. But when am trying to run a baserecalibrator it shoes
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
The error message you are seeing indicates that the reference genome fasta file and the GTF file you are using have different contigs or chromosome names than the VCF file you are using as a known site variation file. The BaseRecalibrator tool requires all of these files to have the same contig names in order to proceed.
To resolve this issue, you will need to ensure that all of your input files have the same contig names. One way to do this is to use the same reference genome fasta file and GTF file that were used to generate the VCF file in the GATK resource bundle.
If you are unable to obtain the original reference genome fasta and GTF files, you may need to modify your input files to match the contig names in the VCF file. One way to do this is to use the Picard tools utility "ReorderSam" to modify your RNA-seq BAM file to match the contig names in the reference genome fasta file. Then, you can use the modified BAM file as input for the BaseRecalibrator tool.
Alternatively, you can use the "FastaAlternateReferenceMaker" tool in GATK to create a new reference genome fasta file that includes only the contigs present in your VCF file. Then, you can use this new reference genome fasta file as input for the BaseRecalibrator tool.
It is important to ensure that all of your input files have the same contig names in order to avoid compatibility issues like the one you are experiencing.
I have downloaded the hg38 fasta and gtf files from the ensemble for the STAR index. After that when I reached at BQSR step I need known variations VCF file but I cannot find the vcf file for the same hg38 on fasta and if I try to use the resource bundle's vcf file it shows incompatible contigs. So, I decided to even use the reference genome fasta file from the resource bundle but this time I can't find it file for the resource bundles' reference genome in the bundle which I require for STAR index formation, and again if I try to use ensemble's gtf file for hg38 with resource bundle's fasta file error comes up that incompatible contigs.
Can anyone help me with where can I get all 3 file fasta, gtf, and vcf with the same contigs?
Hello,
I am trying to run this workflow ("RNAseq short variant discovery (SNPs + Indels)") on a plant species I work with; raspberry (Rubus idaeus). I have RNAseq reads from fruits at various maturity stages. Now, I want to use these reads to find snps and indels. I have reads from four cultivars. I have concatenated reads from each cultivar and aligned them with STAR to a reference genome. This is what I do:
STAR
MarkDuplicates
SortSam
SplitNCigarReads
But when I come to BaseRecalibrator an error is thrown: "Argument known-sites was missing: Argument 'known-sites' is required"
However, this is information that I am not in the posession of. So, is there a way around this one or should I just omit the BaseRecalibrator step and move on to HaplotypeCaller?
Feedback is very much appreciated.
Thanks,
jahn
The GATK BaseRecalibrator tool is used to recalibrate the base quality scores of a sequencing dataset, based on known variant sites in a VCF file. If you do not have a known sites VCF file, you can still run the BaseRecalibrator tool, but the resulting recalibration may not be as accurate as if you had used a known sites file.
In the absence of a known sites VCF file, you can still use the tool by using the
--unsorted
or--interval
options. The--unsorted
option tells BaseRecalibrator to use all sites in the input BAM file for recalibration, while the--interval
option allows you to specify a genomic interval for recalibration.It is important to note that using a known sites VCF file is highly recommended for accurate base quality score recalibration. If you do not have a known sites file, you may consider using an alternative tool such as the GATK VariantRecalibrator, which can perform recalibration without a known sites file by modeling the distribution of variant quality scores.
I have RNAseq reads from four plant cultivars. I want to identify SNPs using this material. The reads come from various experiments done with these cultivars. However, for the SNP discovery purpose I want to treat cultivar#1 as one sample, cultivar#2 as another. When I concatenate all reads from cultivar#1 and do the alignment to the reference genom with STAR I use this option
--outSAMattrRGline "ID:readgroup1 LB:lib1 SM:Veten PL:ILLUMINA PU:unit1".
The samtools view command confirms this readgroup. However, when I run gatk SortSam, this errer is thrown "Error parsing SAM header. @RG line missing SM tag."
I anticipate that this may cause trouble further down the pipeline so I would like to correct my apparent mistake.
Thank you,
jahn
Hello,
I am unsure if I should run Base quality recalibration in my case. I have RNA-seq data without matching DNA-seq data want to try to find novel RNA-editing sites. In the overview section on BaseRecalibrator, there is written "We assume that all reference mismatches we see are therefore errors and indicative of poor base quality." Will this mark any potential RNA-editing variant as an error? In that case I'm not sure if I should run base quality recalibration as is or not.
I am also running RNA-seq data without matching DNA-seq data and am stuck on the BaseRecalibrator portion because I do not have anything to put in the --known-sites portion. I have RNA-seq of parent 1 and parent 2 and I also have their offsprings. Anyone who can help with this?
We're planning to use this workflow, but (like some others) would like to use GRCh38. Are there plans to adapt the workflow for that? (I realize that GRCh38 has more contigs than GRCh37, so this may not be so easy.)
In the meantime, it sounds like we should be able to modify the input files, to use files from gs://gcp-public-data--broad-references/hg38/v0/, such as the reference files mentioned in the GATK4 germline SNPs and indels workflow:
Homo_sapiens_assembly38.fasta
and variant files such as
Homo_sapiens_assembly38.dbsnp138.vcf.gz
plus indels from
Homo_sapiens_assembly38.known_indels.vcf.gz
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
Does this sound right?
Thanks,
Josh Burdick
Please sign in to leave a comment.