Inconsistant QD values
We have used GATK v4.4.0.0 to halpotype call and then joint call 10 Whole Genome in a Bottle samples. The workflow was run twice with the only difference being that one call was called and then combined, using genomicDBimport and genotypeGVCF for each interval whereas the other call was called then the gVCF merged for each sample before running genomicDBimport and genotypeGVCF. The intervals were the same across the two calls.
The resulting number of variants was the same between the two different runs but from the 9372067 total variants there were 211675 differences between the variant annotation. Close inspection of the INFO field revealed that all the differences were limited to the AS_QD and QD values which, although similar were different. However, when we investigated, the GATK help describes the QD as being calculated from the QUAL value and the AD value. These values were the same between the two calls so why were the QD values different?
example command lines:
Pipeline using combined gVCFs-
gatk --java-options "-Xmx6553M -XX:-UsePerfData -XX:+UseSerialGC" \
HaplotypeCaller \
--input 33783267.DSXY.paired316.c0e8a07d87.cram \
--output RefStds_PCR-free8023507.haplotypecaller.chr1_120450582-120654461.g.vcf.gz \
--reference hs38DH.fa \
--dbsnp dbsnp_146.hg38.vcf.gz \
--intervals chr1_120450582-120654461.bed \
\
\
--tmp-dir . \
-ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation
gatk --java-options "-Xmx6553M -XX:+UseSerialGC -XX:-UsePerfData" \
GenomicsDBImport \
--variant RefStds_PCR-free8023443.haplotypecaller.g.vcf.gz --variant RefStds_PCR-free8023459.haplotypecaller.g.vcf.gz --variant RefStds_PCR-free8023475.haplotypecaller.g.vcf.gz --variant RefStds_PCR-free8023491.haplotypecaller.g.vcf.gz --variant RefStds_PCR-free8023507.haplotypecaller.g.vcf.gz --variant RefStds_PCR8021331.haplotypecaller.g.vcf.gz --variant RefStds_PCR8021347.haplotypecaller.g.vcf.gz --variant RefStds_PCR8021363.haplotypecaller.g.vcf.gz --variant RefStds_PCR8021379.haplotypecaller.g.vcf.gz --variant RefStds_PCR8021395.haplotypecaller.g.vcf.gz \
--genomicsdb-workspace-path $WORKSPACE \
--intervals chr1_120450582-120654461.bed \
--batch-size 50 --reader-threads 1 -ip 500
gatk --java-options "-Xmx6553M -XX:+UseSerialGC -XX:-UsePerfData" \
GenotypeGVCFs \
--variant gendb://$WORKSPACE \
--output chr5_96781655-96895012.vcf.gz \
--reference hs38DH.fa \
\
--dbsnp dbsnp_146.hg38.vcf.gz \
-G StandardAnnotation -G AS_StandardAnnotation
Pipeline using intervals, start to finish-
gatk --java-options "-Xmx8g -XX:+UseSerialGC" HaplotypeCaller \
--input 33783267.DSXY.paired316.c0e8a07d87.cram \
--output RefStds_PCR-free8023507.haplotypecaller.chr1_120450582-120654461.g.vcf.gz \
--reference hs38DH.fa \
--dbsnp dbsnp_146.hg38.vcf.gz \
--intervals chr1_120450582-120654461.bed \
\
--tmp-dir . \
-ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation
gatk --java-options "-Xmx8g -XX:+UseSerialGC" GenomicsDBImport \
--variant RefStds_PCR-free8023507.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR-free8023491.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR-free8023459.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR-free8023443.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR8021379.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR-free8023475.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR8021347.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR8021331.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR8021395.haplotypecaller.chr8_144581198-144807204.g.vcf.gz --variant RefStds_PCR8021363.haplotypecaller.chr8_144581198-144807204.g.vcf.gz \
--genomicsdb-workspace-path ${WORKSPACE} \
--intervals chr8_144581198-144807204.bed \
--batch-size 50 \
--reader-threads 1 \
-ip 500
gatk --java-options "-Xmx12g -XX:+UseSerialGC" GenotypeGVCFs \
--variant gendb://$WORKSPACE \
--output chr8_144581198-144807204.vcf.gz \
--reference hs38DH.fa \
\
--dbsnp dbsnp_146.hg38.vcf.gz \
--tmp-dir . \
-G StandardAnnotation -G AS_StandardAnnotation
Examples of differences between variant annotations -
(First line,in each case, is the joint calling without gVCF merging )
< chr1 96347263 rs10875002 G A 4670.53 . AC=4;AF=0.2;AN=20;AS_BaseQRankSum=.;AS_FS=0;AS_InbreedingCoeff=1;AS_MQ=60;AS_MQRankSum=.;AS_QD=25.36;AS_ReadPosRankSum=.;AS_SOR=0.761;DB;DP=486;ExcessHet=0;FS=0;InbreedingCoeff=1;MLEAC=4;MLEAF=0.2;MQ=60;QD=28.73;SOR=0.761 GT:AD:DP:GQ:PL 0/0:53,0:53:99:0,120,1800 0/0:42,0:42:99:0,105,1575 0/0:44,0:44:99:0,113,1755 0/0:39,0:39:99:0,100,1617 1/1:0,54:54:99:2163,162,0 0/0:52,0:52:99:0,120,1800 0/0:49,0:49:99:0,101,1800 0/0:46,0:46:99:0,99,1800 0/0:39,0:39:99:0,105,1575 1/1:0,66:66:99:2540,197,0
> chr1 96347263 rs10875002 G A 4670.53 . AC=4;AF=0.2;AN=20;AS_BaseQRankSum=.;AS_FS=0;AS_InbreedingCoeff=1;AS_MQ=60;AS_MQRankSum=.;AS_QD=30.97;AS_ReadPosRankSum=.;AS_SOR=0.761;DB;DP=486;ExcessHet=0;FS=0;InbreedingCoeff=1;MLEAC=4;MLEAF=0.2;MQ=60;QD=27.24;SOR=0.761 GT:AD:DP:GQ:PL 0/0:53,0:53:99:0,120,1800 0/0:42,0:42:99:0,105,1575 0/0:44,0:44:99:0,113,1755 0/0:39,0:39:99:0,100,1617 1/1:0,54:54:99:2163,162,0 0/0:52,0:52:99:0,120,1800 0/0:49,0:49:99:0,101,1800 0/0:46,0:46:99:0,99,1800 0/0:39,0:39:99:0,105,1575 1/1:0,66:66:99:2540,197,0
****************************************************
< chr1 96370367 rs695201 C G 4448.53 . AC=4;AF=0.2;AN=20;AS_BaseQRankSum=.;AS_FS=0;AS_InbreedingCoeff=1;AS_MQ=60;AS_MQRankSum=.;AS_QD=25.41;AS_ReadPosRankSum=.;AS_SOR=0.729;DB;DP=505;ExcessHet=0;FS=0;InbreedingCoeff=1;MLEAC=4;MLEAF=0.2;MQ=60;QD=28.5;SOR=0.729 GT:AD:DP:GQ:PL 0/0:59,0:59:99:0,116,1800 0/0:54,0:54:99:0,105,1575 0/0:58,0:58:99:0,105,1575 0/0:33,0:33:93:0,93,1395 1/1:0,52:52:99:2093,156,0 0/0:37,0:37:99:0,99,1485 0/0:48,0:48:99:0,111,1800 0/0:58,0:58:99:0,107,1755 0/0:45,0:45:99:0,110,1800 1/1:0,60:60:99:2388,180,0
> chr1 96370367 rs695201 C G 4448.53 . AC=4;AF=0.2;AN=20;AS_BaseQRankSum=.;AS_FS=0;AS_InbreedingCoeff=1;AS_MQ=60;AS_MQRankSum=.;AS_QD=35.19;AS_ReadPosRankSum=.;AS_SOR=0.729;DB;DP=505;ExcessHet=0;FS=0;InbreedingCoeff=1;MLEAC=4;MLEAF=0.2;MQ=60;QD=30.73;SOR=0.729 GT:AD:DP:GQ:PL 0/0:59,0:59:99:0,116,1800 0/0:54,0:54:99:0,105,1575 0/0:58,0:58:99:0,105,1575 0/0:33,0:33:93:0,93,1395 1/1:0,52:52:99:2093,156,0 0/0:37,0:37:99:0,99,1485 0/0:48,0:48:99:0,111,1800 0/0:58,0:58:99:0,107,1755 0/0:45,0:45:99:0,110,1800 1/1:0,60:60:99:2388,180,0
*****************************************************
< chr2 48066431 rs1878081 T C 23527.80 . AC=18;AF=0.9;AN=20;AS_BaseQRankSum=-1.55;AS_FS=2.014;AS_InbreedingCoeff=-0.1111;AS_MQ=60;AS_MQRankSum=0;AS_QD=30.02;AS_ReadPosRankSum=0.85;AS_SOR=0.55;BaseQRankSum=-1.46;DB;DP=673;ExcessHet=0.2348;FS=2.014;InbreedingCoeff=-0.1111;MLEAC=18;MLEAF=0.9;MQ=60;MQRankSum=0;QD=31.98;ReadPosRankSum=0.944;SOR=0.55 GT:AD:DP:GQ:PL 1/1:0,65:65:99:2553,195,0 1/1:0,64:64:99:2592,193,0 1/1:0,75:75:99:3094,226,0 1/1:0,55:55:99:2263,166,0 0/1:27,23:50:99:783,0,977 1/1:0,79:79:99:3172,238,0 1/1:0,57:57:99:2276,171,0 1/1:0,71:71:99:2923,214,0
1/1:0,71:71:99:2860,213,0 0/1:36,30:66:99:974,0,1279
> chr2 48066431 rs1878081 T C 23527.80 . AC=18;AF=0.9;AN=20;AS_BaseQRankSum=-1.55;AS_FS=2.014;AS_InbreedingCoeff=-0.1111;AS_MQ=60;AS_MQRankSum=0;AS_QD=27.51;AS_ReadPosRankSum=0.85;AS_SOR=0.55;BaseQRankSum=-1.46;DB;DP=673;ExcessHet=0.2348;FS=2.014;InbreedingCoeff=-0.1111;MLEAC=18;MLEAF=0.9;MQ=60;MQRankSum=0;QD=29.11;ReadPosRankSum=0.944;SOR=0.55 GT:AD:DP:GQ:PL 1/1:0,65:65:99:2553,195,0 1/1:0,64:64:99:2592,193,0 1/1:0,75:75:99:3094,226,0 1/1:0,55:55:99:2263,166,0 0/1:27,23:50:99:783,0,977 1/1:0,79:79:99:3172,238,0 1/1:0,57:57:99:2276,171,0 1/1:0,71:71:99:2923,214,0
1/1:0,71:71:99:2860,213,0 0/1:36,30:66:99:974,0,1279
-
Hi Allan Daly,
This is not unexpected. There's some (perhaps undocumented) random behavior of the QD annotation that applies here: https://github.com/broadinstitute/gatk/blob/cfd4d87ec29ac45a68f13a37f30101f326546b7d/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/QualByDepth.java#L93. Specifically, we expect the quality of each read to be on the order of 30, which is a single mismatch with good base quality according to Illumina's quality estimates. So het variants will get QD values centered around 15 (half qual 30 read, half qual zero for ref, divide by the full depth) and hom-var variants will get QD values around 30. When there is more than one variant within the span of a read, then there are two mismatches that support the alternate haplotype and look different from the reference. The pairHMM algorithm determines that the quality for reads like that is much higher, i.e. we're much more confident they didn't come from the reference haplotype. However, these cases are relatively rare and the resulting QD values are obvious outliers because they're much higher. In order to prevent variants in phase with other variants from being filtered out, we "correct" their QD values by assigning them a value drawn from a Gaussian distribution centered around 30. Unfortunately because we're replacing these annotations with random values any additional call to the (pseudo) random number generator will change the values. It's hard to say what that additional call was in your case, but we've definitely encountered this previously in similar scenarios.
Please sign in to leave a comment.
1 comment