Inconsistent output from SplitNCigarReads
Hey there,
I've got a puzzler about SplitNCigarReads and bam file filters.
In my original bam file, I've got a read with a CIGAR string of `76M`
. But when I use `SplitNCigarReads` with the default settings, that string changes to `38S38M
`. There aren't any N
s in the original read, so I wasn't expecting it to change.
But here's the twist: When I filter the original bam file for a specific tag and then use SplitNCigarReads, that same read keeps its original CIGAR string (76M
).
Any ideas on what's going on here? I'd love to hear your thoughts!
Thanks.
The Genome Analysis Toolkit (GATK) v4.4.0.0
HTSJDK Version: 3.0.5
Picard Version: 3.0.0
-
Can you check whether those softclipped reads to see if their pair is actually is mapping beyond it's starting point?
This is a known practice for many aligners therefore your issue could be related to this.
Also what tag are you using to filter your bam file?.
-
Thanks for your response.
I'm a bit unclear about the known practice and what I should be checking in the read's mate.
I'm filtering the BAM file for reads with the STAR-assigned WASP tag vW:1. This might remove some read mates. If proper pairs end up missing their mates, would that affect SplitNCigarReads' output?
-
Hi again.
Below is such an example
Though both reads are properly mapped and their protruding ends match the reference genome perfectly those ends are clipped to prevent pairs to exceed each other's starting point.
Can you share the view of your read as well as the insert size from IGV?
-
Thanks,
here is the original read
-
To summarize my previous observation, I noticed that
SplitNCigarReads
performs soft clipping on one read when I don't filter the BAM file. Surprisingly, when I filter the BAM file, the same read isn't subjected to soft clipping.To further understand this behavior, I conducted experiments using
SplitNCigarReads
on specific reads—particularly focusing on read A, which was originally soft clipped.In these experiments:
- Running
SplitNCigarReads
on the two reads mentioned did not soft clip read A. - Specifically running
SplitNCigarReads
on the first read in the pair (read A) also resulted in no soft clipping.
Thus, the absence or presence of the read's mate does not explain why read A was initially soft clipped.
Additionally, I performed an ablation study on 200 reads from the unfiltered original BAM file. In each run of
SplitNCigarReads
, removing only one read, there was consistent behavior:- In 199 out of 200 runs, read A was soft clipped.
- In just one run, read A was not soft clipped. This pattern was replicated twice with the same consistent outcome.
My question is: Is this behavior expected from
SplitNCigarReads
? - Running
-
Hello Mickaël Mendez. SplitNCigar reads has a few extra steps that you might not know about that can result in the sort of behavior you are seeing here. Specifically SplitNCigar reads is internally building a consensus for where there are splice points and it will softclip reads that reach into that overhang based on some conditions like the number of bases it overhangs by and how many of those bases mismatch with the reference. Exactly how the overhang is produced depends on all of the reads present at that site which is a possible explanation for why one or two reads can make such a big difference on the outcome.
If you would like to disable this behavior, the tool has this argument `--do-not-fix-overhangs` to disable the behavior alltogether. If you would like to adjust its behavior you can use the arguments `--max-mismatches-in-overhang` and `--max-bases-in-overhang` to reduce the strictness for softclipping your reads.
It is expected behavior to softclip reads as RNA pipelines can become confused by reads reads that overhang the splice site due to mapping errors/artifacts. Without seeing more information about the reads in your sample its hard to tell exactly if this is how it should be behaving or if this behavior is somewhat flawed here. I hope this answers your question.
Please sign in to leave a comment.
6 comments