HaplotypeCaller: data generated from amplicon sequencing
Hello,
I'm using GATK 4.1.4.1.
I performed:
1. alignment with BWA mem
2. GATK BQSR
3. HaplotypeCaller GVCF mode
gatk --java-options "-Xmx4G" HaplotypeCaller -R hg19.fa -I file_recal_reads.bam --emit-ref-confidence GVCF -L /interval.bed --dbsnp dbsnp_138.hg19.vcf.gz -O file.g.vcf --bamout file.bam
4. GenomicsDBImport
5. GenotypeGVCs
We have data generated from amplicon sequencing (MIP, Molecular Inversion Probes) and thus we cannot perform duplicate marking or filtering steps like end-distance bias or strand-bias on called variants, because these sites are generally covered by reads in only one direction.
What I observed is that the depth (AD and DP) is lower in the gvcf or bamout respect to the original bam.
example for a site
original bam: total count 222, Allele A 222
vcf:
AC=2;AF=1.00;AN=2;DP=50;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=35.61;SOR=7.864 GT:AD:DP:GQ:PL 1/1:0,50:50:99:1823,150,0
This for many loci.
I saw some older posts (2016) describing some validated variants missed by HC using MIP data.
any improvements?
I was wondering if it is some options to add for an analysis with amplicon sequencing data. Any guidelines?
thanks
-
Data like this usually deserves a different approach such as pileup.
I would strongly recommend using ABRA2 for SW realignment and freebayes or bcftools to capture variants.
HaplotypeCaller seems to take some more time to decide how to handle data like this one.
-
Hi Erika Salvi,
I faced the same problem and i solved by setting
--dont-use-soft-clipped-bases true
Please sign in to leave a comment.
2 comments