Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

RNAseq short variant discovery (SNPs + Indels) Follow

13 comments

  • Avatar
    An Zheng

    Hi Team,

     

    Thanks for your incredible work. I'm looking for RNA-Seq somatic SNPs and indels calling pipeline and I found this article. However when I enter the github pages (https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels) it shows "germline short variant discovery", not somatic. Do you have somatic short variant discovery pipeline? Thank you.

     

    Zheng

    6
    Comment actions Permalink
  • Avatar
    Brittany Goods

    Hi All, 

    I was wondering if this pipeline has been moved to Terra yet? I am a Terra user (still new) and would love to use this via Terra on a bunch of my data!! 

     

    Thanks, 

    Britt 

    1
    Comment actions Permalink
  • Avatar
    Owen Wilkins

    An Zheng curious if you found a solution to your question regarding somatic calls?

    5
    Comment actions Permalink
  • 0
    Comment actions Permalink
  • Avatar
    Rahul Yadav

    Hi,

    I am using gatk for somatic cell mutation using RNAseq data, I have download reference genome fasta and gtf from the ensemble and as I cannot find known site variation in vcf format there, on ensemble variation file are in the gvf folder so I take the vcf from the gatk resource bundle. But when am trying to run a baserecalibrator it shoes 

    A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.

    1
    Comment actions Permalink
  • Avatar
    Ambu

    The error message you are seeing indicates that the reference genome fasta file and the GTF file you are using have different contigs or chromosome names than the VCF file you are using as a known site variation file. The BaseRecalibrator tool requires all of these files to have the same contig names in order to proceed.

    To resolve this issue, you will need to ensure that all of your input files have the same contig names. One way to do this is to use the same reference genome fasta file and GTF file that were used to generate the VCF file in the GATK resource bundle.

    If you are unable to obtain the original reference genome fasta and GTF files, you may need to modify your input files to match the contig names in the VCF file. One way to do this is to use the Picard tools utility "ReorderSam" to modify your RNA-seq BAM file to match the contig names in the reference genome fasta file. Then, you can use the modified BAM file as input for the BaseRecalibrator tool.

    Alternatively, you can use the "FastaAlternateReferenceMaker" tool in GATK to create a new reference genome fasta file that includes only the contigs present in your VCF file. Then, you can use this new reference genome fasta file as input for the BaseRecalibrator tool.

    It is important to ensure that all of your input files have the same contig names in order to avoid compatibility issues like the one you are experiencing.

    0
    Comment actions Permalink
  • Avatar
    Rahul Yadav

    I have downloaded the hg38 fasta and gtf files from the ensemble for the STAR index. After that when I reached at BQSR step I need known variations VCF file but I cannot find the vcf file for the same hg38 on fasta and if I try to use the resource bundle's vcf file it shows incompatible contigs. So, I decided to even use the reference genome fasta file from the resource bundle but this time I can't find it file for the resource bundles' reference genome in the bundle which I require for STAR index formation, and again if I try to use ensemble's gtf file for hg38 with resource bundle's fasta file error comes up that incompatible contigs.

    Can anyone help me with where can I get all 3 file fasta, gtf, and vcf with the same contigs?

     

    2
    Comment actions Permalink
  • Avatar
    Jahn Davik

    Hello,

    I am trying to run this workflow  ("RNAseq short variant discovery (SNPs + Indels)") on a plant species I work with; raspberry (Rubus idaeus). I have RNAseq reads from fruits at various maturity stages. Now, I want to use these reads to find snps and indels. I have reads from four cultivars. I have concatenated reads from each cultivar and aligned them with STAR to a reference genome. This is what I do:

    STAR

    MarkDuplicates

    SortSam

    SplitNCigarReads

    But when I come to BaseRecalibrator an error is thrown: "Argument known-sites was missing: Argument 'known-sites' is required"

    However, this is information that I am not in the posession of. So, is there a way around this one or should I just omit the BaseRecalibrator step and move on to HaplotypeCaller?

    Feedback is very much appreciated.

    Thanks,

    jahn

     

    0
    Comment actions Permalink
  • Avatar
    Rahul Yadav

    The GATK BaseRecalibrator tool is used to recalibrate the base quality scores of a sequencing dataset, based on known variant sites in a VCF file. If you do not have a known sites VCF file, you can still run the BaseRecalibrator tool, but the resulting recalibration may not be as accurate as if you had used a known sites file.

    In the absence of a known sites VCF file, you can still use the tool by using the --unsorted or --interval options. The --unsorted option tells BaseRecalibrator to use all sites in the input BAM file for recalibration, while the --interval option allows you to specify a genomic interval for recalibration.

    It is important to note that using a known sites VCF file is highly recommended for accurate base quality score recalibration. If you do not have a known sites file, you may consider using an alternative tool such as the GATK VariantRecalibrator, which can perform recalibration without a known sites file by modeling the distribution of variant quality scores.

    0
    Comment actions Permalink
  • Avatar
    Jahn Davik

    I have RNAseq reads from four plant cultivars. I want to identify SNPs using this material. The reads come from various experiments done with these cultivars. However, for the SNP discovery purpose I want to treat cultivar#1 as one sample, cultivar#2 as another. When I concatenate all reads from cultivar#1 and do the alignment to the reference genom with STAR I use this option

    --outSAMattrRGline "ID:readgroup1 LB:lib1 SM:Veten PL:ILLUMINA PU:unit1".

    The samtools view command confirms this readgroup. However, when I run gatk SortSam, this errer is thrown "Error parsing SAM header. @RG line missing SM tag."

    I anticipate that this may cause trouble further down the pipeline so I would like to correct my apparent mistake.

    Thank you,

    jahn

     

     

    0
    Comment actions Permalink
  • Avatar
    Jordan Mifsud

    Hello,

    I am unsure if I should run Base quality recalibration in my case. I have RNA-seq data without matching DNA-seq data want to try to find novel RNA-editing sites. In the overview section on BaseRecalibrator, there is written "We assume that all reference mismatches we see are therefore errors and indicative of poor base quality." Will this mark any potential RNA-editing variant as an error? In that case I'm not sure if I should run base quality recalibration as is or not.    

    0
    Comment actions Permalink
  • Avatar
    Mae Mercado

    I am also running RNA-seq data without matching DNA-seq data and am stuck on the BaseRecalibrator portion because I do not have anything to put in the --known-sites portion. I have RNA-seq of parent 1 and parent 2 and I also have their offsprings. Anyone who can help with this?

    0
    Comment actions Permalink
  • Avatar
    Joshua Burdick

    We're planning to use this workflow, but (like some others) would like to use GRCh38. Are there plans to adapt the workflow for that? (I realize that GRCh38 has more contigs than GRCh37, so this may not be so easy.)

    In the meantime, it sounds like we should be able to modify the input files, to use files from gs://gcp-public-data--broad-references/hg38/v0/, such as the reference files mentioned in the GATK4 germline SNPs and indels workflow:
    Homo_sapiens_assembly38.fasta

    and variant files such as
    Homo_sapiens_assembly38.dbsnp138.vcf.gz

    plus indels from
    Homo_sapiens_assembly38.known_indels.vcf.gz
    Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

    Does this sound right?

    Thanks,
    Josh Burdick

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk