SplitNCigarReads does not seem to split reads properly as it was before 3.8-1?
Hi
I have a question regarding the definition given in the SplitNCigarReads tool.
This tool should generate K+1 reads where K is the number of N Cigars in the mapped read for RNAseq data. However the bam file that is generated by the tool tells me that those reads are not split but rather softclipped after N Cigar as if they were mapped by a mapper that was unaware of splice sites such as BWA.
Here is the look of the bam file
What should I do?
-
I tried the same command with gatk 3.8-1 and 4.2.0.0 as well and gatk 3.8-1 works but 4.2.0.0 does not work.
Here is the look of the split files.
-
Hello,
I believe SplitNCigarReads should still be splitting reads, it just might be soft clipping instead of hard clipping the ends in GATK 4 vs 3. Do you see supplemental alignments for these reads that now have soft clipped bases in your bam? We do recommend using -dont-use-soft-clipped-bases in HaplotypeCaller after running SplitNCigarReads. I see that the documentation for this tool is incorrect about what is hard clipped vs soft clipped so I'll make an issue to address that. Does this solve your problem or do you need the reads to be hard clipped for another purpose?
Thanks,
Megan
-
Hi.
Thanks for the reply Megan. However the issue is not the softclips. I am posting the split read from 3.8-1 and 4.2.0.0 here.
This is the split by 3.8-1 (Check the red colored read. TOP is the untouched STAR alignment and bottom is split.)
As you can see the read is properly split as if it maps exactly the same way as it was before and coverage histograms are not affected yet coverages are still good.
This is the same read split by 4.2.0.0. After splitting, the right arm after the intron does not start at the right spot and some of its bases seemed to be clipped as oppossed to how it was mapped by STAR, yet the coverages are lacking at the left side of exons and pileups are drastically affected. I am already not making use of any softclips from RNAseq data as they tend to generate lots of false positive calls especially around splice sites. I am using bcftools to call SNPs from RNAseq data since I noticed some problems with HaplotypeCaller at the splice junctions even when softclipped bases are ignored.
It was my mistake to not to post the title properly as I did not pay attention to the supplemental reads before however even if the splitting is still there, the way it is done does not seem to be normal. Normally you would not want to make any changes to the alignment positions and change the coverages per base after splitting but gatk 4.2.0.0 and 4.1.9.0 seems to be doing exactly the opposite of this expectation.
-
Ah, thank you I didn't understand the issue, but I think I see what you're saying now. I'm not sure exactly why this would have changed, but can I ask you to provide the cigar strings for that read? Looking at the arguments for SplitNCigarReads I'm wondering if you've tried the --do-not-fix-overhangs argument. If you've already tried that and it still changes where the alignment was, then I think I'll need to ask you for a small bam as a test case if you have one that you are able to share.
-
Of course. I did not use the --do-not-fix-overhangs option by the way as it was not in the gatk4 wdl for the best practice script. I was trying to adjust my workflow according to what was on the github but I can try with that option as well.
Before trying with that option here are the CIGARS produced by STAR vs 3.8-1 split and 4.2.0.0 split.
This is the STAR read with its pair
V350007172L4C004R0430876168 99 chr17 7673835 255 3M343N110M568N37M = 7674214 1097 CCACTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA EFFFFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FFFFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF MC:Z:77M568N73M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303
V350007172L4C004R0430876168 147 chr17 7674214 255 77M568N73M = 7673835 -1097 GGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGTTTCTGTCATCCAAATACTCCACACGCAAA >GGEGFGGGF*GGFGFFFGBEGFGDGGGFFDFFFFGFFEGEFE@GGGFFGFGFGFFFFGFGGGFGFFGFGFGFGG@FGFFFGGFFGF<EFGFFFFGFGGFFFGFGGGGGFFGGFFFFGEGFFFGGGFFAFGFFAGEFFFFFFFFGGFFFF MC:Z:3M343N110M568N37M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303This is 3.8-1 split.
V350007172L4C004R0430876168 99 chr17 7673835 60 3M1058H = 7674214 1097 CCA EFF MC:Z:77M568N73M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255UQ:i:0 AS:i:303
V350007172L4C004R0430876168 99 chr17 7674181 60 346H110M605H = 7674214 1097 CTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAAC FFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FF MC:Z:77M568N73M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303
V350007172L4C004R0430876168 147 chr17 7674214 60 77M641H = 7673835 -1097 GGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAAC >GGEGFGGGF*GGFGFFFGBEGFGDGGGFFDFFFFGFFEGEFE@GGGFFGFGFGFFFFGFGGGFGFFGFGFGFGG@F MC:Z:3M343N110M568N37M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303
V350007172L4C004R0430876168 99 chr17 7674859 60 1024H37M = 7674214 1097 CTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA FFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF MC:Z:77M568N73M MD:Z:150PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303
V350007172L4C004R0430876168 147 chr17 7674859 60 645H73M = 7673835 -1097 CTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGTTTCTGTCATCCAAATACTCCACACGCAAA GFFFGGFFGF<EFGFFFFGFGGFFFGFGGGGGFFGGFFFFGEGFFFGGGFFAFGFFAGEFFFFFFFFGGFFFF MC:Z:3M343N110M568N37M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303This is the 4.2.0.0 split
V350007172L4C004R0430876168 99 chr17 7673835 60 3M147S = 7674214 1097 CCACTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA EFFFFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FFFFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF SA:Z:chr17,7674181,+,3S110M37S,60,*;chr17,7674859,+,113S37M,60,*; MC:Z:77M73S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303
V350007172L4C004R0430876168 2147 chr17 7674181 60 3S110M37S = 7674214 1097 CCACTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA EFFFFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FFFFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF SA:Z:chr17,7673835,+,3M147S,60,*;chr17,7674859,+,113S37M,60,*; MC:Z:77M73S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303
V350007172L4C004R0430876168 147 chr17 7674214 60 77M73S = 7673835 -1097 GGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGTTTCTGTCATCCAAATACTCCACACGCAAA >GGEGFGGGF*GGFGFFFGBEGFGDGGGFFDFFFFGFFEGEFE@GGGFFGFGFGFFFFGFGGGFGFFGFGFGFGG@FGFFFGGFFGF<EFGFFFFGFGGFFFGFGGGGGFFGGFFFFGEGFFFGGGFFAFGFFAGEFFFFFFFFGGFFFF SA:Z:chr17,7674859,-,77S73M,60,*; MC:Z:3M147S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303
V350007172L4C004R0430876168 2147 chr17 7674859 60 113S37M = 7674214 1097 CCACTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA EFFFFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FFFFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF SA:Z:chr17,7673835,+,3M147S,60,*;chr17,7674181,+,3S110M37S,60,*; MC:Z:77M73S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303
V350007172L4C004R0430876168 2195 chr17 7674859 60 77S73M = 7673835 -1097 GGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGTTTCTGTCATCCAAATACTCCACACGCAAA >GGEGFGGGF*GGFGFFFGBEGFGDGGGFFDFFFFGFFEGEFE@GGGFFGFGFGFFFFGFGGGFGFFGFGFGFGG@FGFFFGGFFGF<EFGFFFFGFGGFFFGFGGGGGFFGGFFFFGEGFFFGGGFFAFGFFAGEFFFFFFFFGGFFFF SA:Z:chr17,7674214,-,77M73S,60,*; MC:Z:3M147S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303 -
Oops big oops. My mistake.
So I see the biggest change now. 3.8-1 splits the reads and hardclips edges but does not mark the split read as supplementary but rather keeps it as primary read. 4.2.0.0 splits and softclips edges and marks new splits as supplementary. My IGV was set to filter supplementary reads so I am sorry this was totally my mistake.
Now I am wondering if keeping them supplementary has any effect on the variant calling metrics. I will test that and see if I observe anything unusual.
Sorry for bothering you all with this issue.
Thanks again.
-
Ah gotcha, great. For what it's worth, HaplotypeCaller does look at supplementary reads. If you're using a different caller it would definitely be a good idea to make sure that supplementary reads are used as evidence. I believe this change was in an effort to make the BAM files coming out of SplitNCigar reads spec compliant.
-
Hi,
I am a bit of a novice here so I apologize in advance if I am not understanding. I am observing the same thing with soft clipping as mentioned above. However I am not clear on why this is the expected behavior. The read depth on the left of exons still seems to be lower than expected from the number of reads providing support for it. Also, given the longer read lengths now available, wouldnt one want to be able to use that data? It seems like tossing all of the supplementary reads out of consideration for SNV calling or calculating read depth is not optimal for processing RNAseq data.
-
Hi Rich Roberts,
SplitNCigarReads should only be used for variant calling. HaplotypeCaller should be looking at supplementary reads so the read depth should be unaffected. You will want to run HaplotypeCaller with the -dont-use-soft-clipped-bases argument so that the soft clipped bases will be ignored. The bases will still be accounted for in the corresponding supplementary alignment which will be used by HaplotypeCaller. Let me know if that doesn't address your question (and if not please include a test case example if possible to help me understand the issue you're seeing).
Thanks!
Please sign in to leave a comment.
9 comments