Dear GATK developers,
I’m analysing 1549 inbred lines using Genotyping By Sequencing approach. I called the variants in the cohort using Haplotype caller v188.8.131.52 and UnifiedGenotyper v3.8-1-0 and compared the output. Briefly, using HaplotypeCaller I firstly generate a gVCF file for each accession, than I created a database with all the 1549 samples using GenomicsDBImport and then I performed joint-genotyping using GenotypeGVCFs, while using UnifiedGenotper I joint-called varints using all the bam files. Both the vcfs were filtered using GATK hard filters reported in this page:
What I noticed that a position is filter out using HaplotpeCaller (SOR filter) while it is retained using UnifiedGenotyper.
ctg40_segment0_pilon_pilon_pilon 1480093 . C T 14881.59 SOR3 AC=56;AF=0.549;AN=102;DP=243;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.3469;MLEAC=1591;MLEAF=1.00;MQ=60.00;QD=29.58;SOR=3.524
ctg40_segment0_pilon_pilon_pilon 1480093 . C T 5761.19 PASS AC=90;AF=0.662;AN=136;BaseQRankSum=5.812;DP=243;Dels=0.00;ExcessHet=-0.0000;FS=42.489;HaplotypeScore=0.1212;InbreedingCoeff=0.6242;MLEAC=91;MLEAF=0.669;MQ=60.00;MQ0=0;MQRankSum=0.000;QD=28.64;ReadPosRankSum=-2.569;SOR=0.005
I might expect differences in the alleles number because of different variant calling throughout the samples but I did not expect a such difference in SOR value which cause the exclusion of the position.
So, my first question is if the two software apply a different method to calculate this value.
I obtained another unexpected result when I integrated in the analysis the WGS data of the parental genomes of these inbred lines. The two gVCFs were imported into the database that I already created and then I performed the joint genotyping using GenotypeGVCFs and all the 1551 samples. I compared the results obtained HaplotypeCaller with all the inbred lines and all the inbred lines plus the parental genomes. I noticed a position the was filtered out (SOR filter) before the integration of the parental genomes while it was retained when the parental genomes were included.
#Without parental genomes
ctg40_segment0_pilon_pilon_pilon 1470012 . C T 918220.22 SOR3 AC=1480;AF=0.665;AN=2226;BaseQRankSum=0.456;DP=34409;ExcessHet=-0.0000;FS=0.000;InbreedingCoeff=0.5472;MLEAC=2103;MLEAF=0.945;MQ=60.00;MQRankSum=0.00;QD=28.68;ReadPosRankSum=0.00;SOR=3.771
#With parental genomes
ctg40_segment0_pilon_pilon_pilon 1470012 . C T 919099.66 PASS AC=1482;AF=0.665;AN=2230;BaseQRankSum=0.456;DP=34445;ExcessHet=-0.0000;FS=0.000;InbreedingCoeff=0.5479;MLEAC=2105;MLEAF=0.944;MQ=60.00;MQRankSum=0.00;QD=27.29;ReadPosRankSum=0.00;SOR=0.354
Also in this case the SOR parameter changes significantly. This change was unexpected considering that I added just two more samples and that the coverage on this position was alreadyvery high.
I was wondering how can I calculate the SB_TABLE annotation to calculate myself the SOR value for some positions of interest? And then, Is this change in SOR value expected even after including only two more samples?
Any insight would be much appreciated,
Thanks in advance,
Please sign in to leave a comment.