HaplotypeCaller and SoftClipped bases
Hi,
When analyzing a WES sample for variant calling, we observed a heterozygous SNV with an allele depth (AD) of 36,28 in the gVCF/VCF. However, upon inspection in IGV, only 3 alternative bases are visible, while multiple reads carrying the alternative allele appear as clipped.
We are aware that HaplotypeCaller can utilize soft-clipped bases, but in this case, we have explicitly specified --dont-use-soft-clipped-bases true
.
Here is the command line used:
"""
/home/tools/gatk-4.6.0.0/gatk HaplotypeCaller --tmp-dir /home/temp/ -R FASTA -I $bam -ERC GVCF --output $id.snps.raw.g.vcf.gz --standard-min-confidence-threshold-for-calling 30.0 --dont-use-soft-clipped-bases true --sample-ploidy 2 -bamout HC_output.bam
"""
This is the gVCF line:
chr6 136596980 . C T,<NON_REF> 774.64 . BaseQRankSum=0.364;DP=68;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-4.287;RAW_MQandDP=220126,68;ReadPosRankSum=-1.638 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:36,28,0:64:99:0|1:136596980_C_T:782,0,1355,890,1439,2330:136596980:19,17,20,8
This is the IGV screen displaying the softclipped bases:
This is the IGV screen without the softclipped bases:
Thanks in advance!
Luca
-
Hi LucaB
This looks like a bad assembly in that region however have you looked at where those softclipped bases match on the genome or do they even match anywhere? Those 3 T sites at the end of reads were not clipped therefore triggered an assembly over that region but those softclips are quite horrendous and may mean a larger deletion, insertion or translocation if they match anywhere precisely or worse they could be the result of adapter contamination since all looked the same.
Can you confirm whether that is the case?
-
Hi,
The adapters were removed before aligning with BWA, and the clipped reads have no supplementary alignments. However, the clipped portions still align perfectly ~200 bases upstream.
Beyond whether this represents a structural variant or not, do you know why this variant is still called with 28 alternative "T" bases despite using
--dont-use-soft-clipped-bases true
?Thanks!
-
Hi LucaB
This is due to reassembly and local realignment performed by HaplotypeCaller. Sometimes assembly errors may manifest themselves as such erroneous calls most of which can be filtered out by hard filtering or VQSR. In your case a retro-integration product of the same gene in the genome is contaminating your gene of interest therefore those retro-integration fragments cause this particular variant to showup in your sample. If this is a single sample it may be possible to get rid of these kinds of calls by hard filtering thresholds however if not these variants most certainly get eliminated in a cohort by VQSR or VETS due to their inbreeding coefficients
Closing softclipped bases does not guarantee that you won't get such variants since the other pair will still provide the necessary kmers to bring up the very same assembly in the region. You may need to enable ExcessiveEndClipping read filter to get rid of those reads completely to remove such reads.
To ultimately prevent this you may need to remove reads arising from those retro integration sites but I am not sure if there are any such tools to easily remove them from a dataset.
I hope this helps.
Regards.
Please sign in to leave a comment.
3 comments