Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

SplitNCigarReads does not seem to split reads properly as it was before 3.8-1?

0

9 comments

  • Avatar
    SkyWarrior

    I tried the same command with gatk 3.8-1 and 4.2.0.0 as well and gatk 3.8-1 works but 4.2.0.0 does not work. 

     

    Here is the look of the split files. 

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Hello,

    I believe SplitNCigarReads should still be splitting reads, it just might be soft clipping instead of hard clipping the ends in GATK 4 vs 3. Do you see supplemental alignments for these reads that now have soft clipped bases in your bam? We do recommend using -dont-use-soft-clipped-bases in HaplotypeCaller after running SplitNCigarReads. I see that the documentation for this tool is incorrect about what is hard clipped vs soft clipped so I'll make an issue to address that. Does this solve your problem or do you need the reads to be hard clipped for another purpose? 

    Thanks,

    Megan

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi. 

    Thanks for the reply Megan. However the issue is not the softclips. I am posting the split read from 3.8-1 and 4.2.0.0 here. 

    This is the split by 3.8-1 (Check the red colored read. TOP is the untouched STAR alignment and bottom is split.)

    As you can see the read is properly split as if it maps exactly  the same way as it was before and coverage histograms are not affected yet coverages are still good. 

    This is the same read split by 4.2.0.0. After splitting, the right arm after the intron does not start at the right spot and some of its bases seemed to be clipped as oppossed to how it was mapped by STAR,  yet the coverages are lacking at the left side of exons and pileups are drastically affected. I am already not making use of any softclips from RNAseq data as they tend to generate lots of false positive calls especially around splice sites. I am using bcftools to call SNPs from RNAseq data since I noticed some problems with HaplotypeCaller at the splice junctions even when softclipped bases are ignored. 

    It was my mistake to not to post the title properly as I did not pay attention to the supplemental reads before however even if the splitting is still there, the way it is done does not seem to be normal. Normally you would not want to make any changes to the alignment positions and change the coverages per base after splitting but gatk 4.2.0.0 and 4.1.9.0 seems to be doing exactly the opposite of this expectation. 

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Ah, thank you I didn't understand the issue, but I think I see what you're saying now. I'm not sure exactly why this would have changed, but can I ask you to provide the cigar strings for that read? Looking at the arguments for SplitNCigarReads I'm wondering if you've tried the --do-not-fix-overhangs argument. If you've already tried that and it still changes where the alignment was, then I think I'll need to ask you for a small bam as a test case if you have one that you are able to share.

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Of course. I did not use the --do-not-fix-overhangs option by the way as it was not in the gatk4 wdl for the best practice script. I was trying to adjust my workflow according to what was on the github but I can try with that option as well. 

    Before trying with that option here are the CIGARS produced by STAR vs 3.8-1 split and 4.2.0.0 split. 

    This is the STAR read with its pair

    V350007172L4C004R0430876168	99	chr17	7673835	255	3M343N110M568N37M	=	7674214	1097	CCACTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA	EFFFFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FFFFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF	MC:Z:77M568N73M	MD:Z:150	PG:Z:MarkDuplicates.3	RG:Z:RG2	NH:i:1	NM:i:0	MQ:i:255	UQ:i:0	AS:i:303
    V350007172L4C004R0430876168 147 chr17 7674214 255 77M568N73M = 7673835 -1097 GGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGTTTCTGTCATCCAAATACTCCACACGCAAA >GGEGFGGGF*GGFGFFFGBEGFGDGGGFFDFFFFGFFEGEFE@GGGFFGFGFGFFFFGFGGGFGFFGFGFGFGG@FGFFFGGFFGF<EFGFFFFGFGGFFFGFGGGGGFFGGFFFFGEGFFFGGGFFAFGFFAGEFFFFFFFFGGFFFF MC:Z:3M343N110M568N37M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303

    This is 3.8-1 split.

    V350007172L4C004R0430876168	99	chr17	7673835	60	3M1058H	=	7674214	1097	CCA	EFF	MC:Z:77M568N73M	MD:Z:150	PG:Z:MarkDuplicates.3	RG:Z:RG2	NH:i:1	NM:i:0	MQ:i:255UQ:i:0	AS:i:303
    V350007172L4C004R0430876168 99 chr17 7674181 60 346H110M605H = 7674214 1097 CTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAAC FFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FF MC:Z:77M568N73M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303
    V350007172L4C004R0430876168 147 chr17 7674214 60 77M641H = 7673835 -1097 GGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAAC >GGEGFGGGF*GGFGFFFGBEGFGDGGGFFDFFFFGFFEGEFE@GGGFFGFGFGFFFFGFGGGFGFFGFGFGFGG@F MC:Z:3M343N110M568N37M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303
    V350007172L4C004R0430876168 99 chr17 7674859 60 1024H37M = 7674214 1097 CTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA FFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF MC:Z:77M568N73M MD:Z:150PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303
    V350007172L4C004R0430876168 147 chr17 7674859 60 645H73M = 7673835 -1097 CTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGTTTCTGTCATCCAAATACTCCACACGCAAA GFFFGGFFGF<EFGFFFFGFGGFFFGFGGGGGFFGGFFFFGEGFFFGGGFFAFGFFAGEFFFFFFFFGGFFFF MC:Z:3M343N110M568N37M MD:Z:150 PG:Z:MarkDuplicates.3 RG:Z:RG2 NH:i:1 NM:i:0 MQ:i:255 UQ:i:0 AS:i:303

    This is the 4.2.0.0 split

    V350007172L4C004R0430876168	99	chr17	7673835	60	3M147S	=	7674214	1097	CCACTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA	EFFFFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FFFFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF	SA:Z:chr17,7674181,+,3S110M37S,60,*;chr17,7674859,+,113S37M,60,*;	MC:Z:77M73S	PG:Z:MarkDuplicates.3	RG:Z:RG2	MQ:i:255	UQ:i:0	AS:i:303
    V350007172L4C004R0430876168 2147 chr17 7674181 60 3S110M37S = 7674214 1097 CCACTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA EFFFFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FFFFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF SA:Z:chr17,7673835,+,3M147S,60,*;chr17,7674859,+,113S37M,60,*; MC:Z:77M73S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303
    V350007172L4C004R0430876168 147 chr17 7674214 60 77M73S = 7673835 -1097 GGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGTTTCTGTCATCCAAATACTCCACACGCAAA >GGEGFGGGF*GGFGFFFGBEGFGDGGGFFDFFFFGFFEGEFE@GGGFFGFGFGFFFFGFGGGFGFFGFGFGFGG@FGFFFGGFFGF<EFGFFFFGFGGFFFGFGGGGGFFGGFFFFGEGFFFGGGFFAFGFFAGEFFFFFFFFGGFFFF SA:Z:chr17,7674859,-,77S73M,60,*; MC:Z:3M147S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303
    V350007172L4C004R0430876168 2147 chr17 7674859 60 113S37M = 7674214 1097 CCACTGGAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA EFFFFFF@FFFFFFFF1FFFFDFFFFEFFFFFFEEFFFFEFFFFEEFEEFFFFFDFFFFFFFFFBF<FFFFEDFFFFFBFFFFFFFFFFFFFFEBFBAGFGFFEGF5FFF?FFFFF@FBFDEEFF?EFFFFFF;FFFFFFFCFGFFFFEF SA:Z:chr17,7673835,+,3M147S,60,*;chr17,7674181,+,3S110M37S,60,*; MC:Z:77M73S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303
    V350007172L4C004R0430876168 2195 chr17 7674859 60 77S73M = 7673835 -1097 GGCCTCCGGTTCATGCCGCCCATGCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGTTTCTGTCATCCAAATACTCCACACGCAAA >GGEGFGGGF*GGFGFFFGBEGFGDGGGFFDFFFFGFFEGEFE@GGGFFGFGFGFFFFGFGGGFGFFGFGFGFGG@FGFFFGGFFGF<EFGFFFFGFGGFFFGFGGGGGFFGGFFFFGEGFFFGGGFFAFGFFAGEFFFFFFFFGGFFFF SA:Z:chr17,7674214,-,77M73S,60,*; MC:Z:3M147S PG:Z:MarkDuplicates.3 RG:Z:RG2 MQ:i:255 UQ:i:0 AS:i:303
    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Oops big oops. My mistake. 

    So I see the biggest change now. 3.8-1 splits the reads and hardclips edges but does not mark the split read as supplementary but rather keeps it as primary read. 4.2.0.0 splits and softclips edges and marks new splits as supplementary. My IGV was set to filter supplementary reads so I am sorry this was totally my mistake. 

    Now I am wondering if keeping them supplementary has any effect on the variant calling metrics. I will test that and see if I observe anything unusual. 

    Sorry for bothering you all with this issue. 

    Thanks again. 

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Ah gotcha, great. For what it's worth, HaplotypeCaller does look at supplementary reads. If you're using a different caller it would definitely be a good idea to make sure that supplementary reads are used as evidence. I believe this change was in an effort to make the BAM files coming out of SplitNCigar reads spec compliant. 

    0
    Comment actions Permalink
  • Avatar
    Rich Roberts

    Hi,

    I am a bit of a novice here so I apologize in advance if I am not understanding. I am observing the same thing with soft clipping as mentioned above. However I am not clear on why this is the expected behavior. The read depth on the left of exons still seems to be lower than expected from the number of reads providing support for it. Also, given the longer read lengths now available, wouldnt one want to be able to use that data? It seems like tossing all of the supplementary reads out of consideration for SNV calling or calculating read depth is not optimal for processing RNAseq data. 

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Hi Rich Roberts,

    SplitNCigarReads should only be used for variant calling. HaplotypeCaller should be looking at supplementary reads so the read depth should be unaffected. You will want to run HaplotypeCaller with the -dont-use-soft-clipped-bases argument so that the soft clipped bases will be ignored. The bases will still be accounted for in the corresponding supplementary alignment which will be used by HaplotypeCaller. Let me know if that doesn't address your question (and if not please include a test case example if possible to help me understand the issue you're seeing).

    Thanks!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk