Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Inconsistent output from SplitNCigarReads

0

6 comments

  • Avatar
    SkyWarrior

    Hi Mickaël Mendez

    Can you check whether those softclipped reads to see if their pair is actually is mapping beyond it's starting point?

    This is a known practice for many aligners therefore your issue could be related to this. 

    Also what tag are you using to filter your bam file?. 

    0
    Comment actions Permalink
  • Avatar
    Mickaël Mendez

    Thanks for your response.

    I'm a bit unclear about the known practice and what I should be checking in the read's mate.

    I'm filtering the BAM file for reads with the STAR-assigned WASP tag vW:1. This might remove some read mates. If proper pairs end up missing their mates, would that affect SplitNCigarReads' output?

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi again. 

    Below is such an example 

    Though both reads are properly mapped and their protruding ends match the reference genome perfectly those ends are clipped to prevent pairs to exceed each other's starting point.

    Can you share the view of your read as well as the insert size from IGV?

     

    0
    Comment actions Permalink
  • Avatar
    Mickaël Mendez

    Thanks,

    here is the original read

    0
    Comment actions Permalink
  • Avatar
    Mickaël Mendez

    To summarize my previous observation, I noticed that SplitNCigarReads performs soft clipping on one read when I don't filter the BAM file. Surprisingly, when I filter the BAM file, the same read isn't subjected to soft clipping.

    To further understand this behavior, I conducted experiments using SplitNCigarReads on specific reads—particularly focusing on read A, which was originally soft clipped.

    In these experiments:

    • Running SplitNCigarReads on the two reads mentioned did not soft clip read A.
    • Specifically running SplitNCigarReads on the first read in the pair (read A) also resulted in no soft clipping.

    Thus, the absence or presence of the read's mate does not explain why read A was initially soft clipped.

    Additionally, I performed an ablation study on 200 reads from the unfiltered original BAM file. In each run of SplitNCigarReads, removing only one read, there was consistent behavior:

    • In 199 out of 200 runs, read A was soft clipped.
    • In just one run, read A was not soft clipped. This pattern was replicated twice with the same consistent outcome.

    My question is: Is this behavior expected from SplitNCigarReads?

    0
    Comment actions Permalink
  • Avatar
    James Emery

    Hello Mickaël Mendez. SplitNCigar reads has a few extra steps that you might not know about that can result in the sort of behavior you are seeing here. Specifically SplitNCigar reads is internally building a consensus for where there are splice points and it will softclip reads that reach into that overhang based on some conditions like the number of bases it overhangs by and how many of those bases mismatch with the reference. Exactly how the overhang is produced depends on all of the reads present at that site which is a possible explanation for why one or two reads can make such a big difference on the outcome. 

    If you would like to disable this behavior, the tool has this argument `--do-not-fix-overhangs` to disable the behavior alltogether. If you would like to adjust its behavior you can use the arguments `--max-mismatches-in-overhang` and `--max-bases-in-overhang` to reduce the strictness for softclipping your reads. 

    It is expected behavior to softclip reads as RNA pipelines can become confused by reads reads that overhang the splice site due to mapping errors/artifacts. Without seeing more information about the reads in your sample its hard to tell exactly if this is how it should be behaving or if this behavior is somewhat flawed here. I hope this answers your question. 

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk