Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

HaplotypeCaller and SoftClipped bases

0

3 comments

  • Avatar
    Gökalp Çelik

    Hi LucaB

    This looks like a bad assembly in that region however have you looked at where those softclipped bases match on the genome or do they even match anywhere? Those 3 T sites at the end of reads were not clipped therefore triggered an assembly over that region but those softclips are quite horrendous and may mean a larger deletion, insertion or translocation if they match anywhere precisely or worse they could be the result of adapter contamination since all looked the same. 

    Can you confirm whether that is the case? 

    0
    Comment actions Permalink
  • Avatar
    LucaB

    Hi,

    The adapters were removed before aligning with BWA, and the clipped reads have no supplementary alignments. However, the clipped portions still align perfectly ~200 bases upstream.

    Beyond whether this represents a structural variant or not, do you know why this variant is still called with 28 alternative "T" bases despite using --dont-use-soft-clipped-bases true?

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi LucaB

    This is due to reassembly and local realignment performed by HaplotypeCaller. Sometimes assembly errors may manifest themselves as such erroneous calls most of which can be filtered out by hard filtering or VQSR. In your case a retro-integration product of the same gene in the genome is contaminating your gene of interest therefore those retro-integration fragments cause this particular variant to showup in your sample. If this is a single sample it may be possible to get rid of these kinds of calls by hard filtering thresholds however if not these variants most certainly get eliminated in a cohort by VQSR or VETS due to their inbreeding coefficients 

    Closing softclipped bases does not guarantee that you won't get such variants since the other pair will still provide the necessary kmers to bring up the very same assembly in the region. You may need to enable ExcessiveEndClipping read filter to get rid of those reads completely to remove such reads. 

    To ultimately prevent this you may need to remove reads arising from those retro integration sites but I am not sure if there are any such tools to easily remove them from a dataset. 

    I hope this helps.

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk