Germline pipeline - incorrect read depth reported after filtering variants.
After filtering variants, the read depth is incorrect:
a) GATK version used: 4.5.0.0
b) Exact command used:
HaplotypeCaller on one sample (C419T_Matrigel)
export OMP_NUM_THREADS=4; gatk --java-options \"-Xmx14g -XX:ParallelGCThreads=4\" HaplotypeCaller -I BSQR.dir/C419T_Matrigel_sorted.bsqr.bam -R /data/genome/hg38ucsc/hg38_no_alt.fa -ERC GVCF -O /tmp/tmpogpvl_qf.vcf.gz 2> variants.dir/C419T_Matrigel.log && mv /tmp/tmpogpvl_qf.vcf.gz variants.dir/C419T_Matrigel.vcf.gz
I get the following (outputting only the relevant variant):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LH00129_122_22JJGMLT3_2_CGTAGTAA
chr1 94903018 . T TA,TAA,TAAA,TAAAA,<NON_REF> 1145.60 . BaseQRankSum=-0.382;DP=106;ExcessHet=0.0000;MLEAC=0,0,1,0,0;MLEAF=0.00,0.00,0.500,0.00,0.00;MQRankSum=0.000;RAW_MQandDP=381600,106;ReadPosRankSum=0.303 GT:AD:DP:GQ:PL:SB 0/3:34,16,14,26,6,0:96:99:1153,565,1158,287,766,1244,0,426,842,1378,571,915,1240,1423,2153,859,1230,1429,1356,1879,2040:22,12,47,15
I am not entirely sure how to interpret this line, but I think that the depth is 106 (DP=106).
CombineGVCFs with other samples
export OMP_NUM_THREADS=8; gatk --java-options \"-Xmx30g -XX:ParallelGCThreads=8\" CombineGVCFs -R /data/genome/hg38ucsc/hg38_no_alt.fa -V variants.dir/C419T_Matrigel.vcf.gz -V variant_sample2.vcf.gz -V variant_sample3.vcf.gz -O /tmp/tmp48ygy3gg.vcf.gz 2> variants.dir/mergedSamples.log && mv /tmp/tmp48ygy3gg.vcf.gz variants.dir/mergedSamples.vcf.gz
I get the following result for that variant and sample in this file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LH00129_122_22JJGMLT3_2_CGTAGTAA
chr1 94903018 . TAAAA TAAA,TAAAAA,TAAAAAA,T,TA,TAA,TAAAAAAA,TAAAAAAAA,<NON_REF> .. BaseQRankSum=0.720;DP=2381;ExcessHet=0.00;MQRankSum=0.00;RAW_MQandDP=8130676,2260;ReadPosRankSum=0.298 GT:AD:DP:GQ:MIN_DP:PGT:PID:PL:PS:SB ./.:34,0,16,14,0,0,0,26,6,0:96:99:.:.:.:1153,859,2040,565,1230,1158,287,1429,766,1244,859,2040,1230,1429,2040,859,2040,1230,1429,2040,2040,859,2040,1230,1429,2040,2040,2040,0,1356,426,842,1356,1356,1356,1378,571,1879,915,1240,1879,1879,1879,1423,2153,859,2040,1230,1429,2040,2040,2040,1356,1879,2040:.:22,12,47,15
I am already seeing here the REF allele: TAAAA, the alternative (8 assuming <NON_REF> doesn't mean anything): TAAA,TAAAAA,TAAAAAA,T,TA,TAA,TAAAAAAA,TAAAAAAAA,<NON_REF>
So 9 total alleles, but I see 10 numbers, then ":" and what I am assuming is the total: 96:
34,0,16,14,0,0,0,26,6,0:96
Which, if I am assuming correctly, the sum is correct: 34 + 0 + 16 + 14 + 0 + 0 + 0 + 26 + 6 = 96.
What is the extra number standing for?
After this, Variant Quality Score Recalibration is performed as per guidelines (first on SNPs and then on INDELs + MIXED):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LH00129_122_22JJGMLT3_2_CGTAGTAA
chr1 94903018 . TAAA TAA,TAAAA,TAAAAA,T,TA,TAAAAAA 9176.09 PASS AC=4,5,5,1,2,1;AF=0.067,0.083,0.083,0.017,0.033,0.017;AN=60;BaseQRankSum=0.720;DP=2381;ExcessHet=16.1063;FS=0.000;InbreedingCoeff=-0.4380;MLEAC=4,5,5,1,2,1;MLEAF=0.067,0.083,0.083,0.017,0.033,0.017;MQ=59.98;MQRankSum=0.00;QD=5.69;ReadPosRankSum=0.298;SOR=0.719;VQSLOD=2.42;culprit=QD GT:AD:DP:GQ:PGT:PID:PL:PS 0/6:34,0,16,14,0,0,26:96:99:.:.:1153,859,2040,565,1230,1158,287,1429,766,1244,859,2040,1230,1429,2040,859,2040,1230,1429,2040,2040,0,1356,426,842,1356,1356,1378
Now the reference allele has changed from TAAAA to TAAA
How is this possible? Are we know considering a shorter reference sequencing (removing one A)?
The alternate alleles seem to have changed from 8:
TAAA,TAAAAA,TAAAAAA,T,TA,TAA,TAAAAAAA,TAAAAAAAA,<NON_REF>
To 6:
TAA,TAAAA,TAAAAA,T,TA,TAAAAAA
Consistently with this, now we have 7 comma separated fields, but the total depth is still 96:
34,0,16,14,0,0,26:96
Shouldn't this be corrected after disregarding alleles to reflect the correct sum: 34 + 0 + 16 + 14 + 0 + 0 + 26 = 90 instead of 96?
This is carried on in the hardfiltering, annotation and output to table.
c) Entire program log:
-
The numbers are quite correct and although they may not match exactly what you observe under IGV, they are derived from the local reassembly and realignment operation. We have a short writing to explain that in the link below.
Looking at those numbers in the GVCF which also includes reference confidence values inside the distribution of those values are as follows
|----|----|-----|------|-------|-----------|
| T | TA | TAA | TAAA | TAAAA | <NON_REF> |
|----|----|-----|------|-------|-----------|
| 34 | 16 | 14 | 26 | 6 | 0 |
|----|----|-----|------|-------|-----------|T is the REF allele but that is also included and <NON_REF> is for anything that is not REFERENCE but also not those chosen alternates.
Looking at those values in the combined GVCF
|-------|------|--------|---------|---|----|-----|----------|-----------|-----------|
| TAAAA | TAAA | TAAAAA | TAAAAAA | T | TA | TAA | TAAAAAAA | TAAAAAAAA | <NON_REF> |
|-------|------|--------|---------|---|----|-----|----------|-----------|-----------|
| 34 | 0 | 16 | 14 | 0 | 0 | 0 | 26 | 6 | 0 |
|-------|------|--------|---------|---|----|-----|----------|-----------|-----------|Values match those alleles like the above. The reason why the reference allele changes when combined is due to other samples showing more tendency towards the deletion alelles rather than insertion allele found in the sample you showed. Since all sites are represented as the left-most aligned versions in the GVCF/VCF file once the sample is alone you have T to TAAA insertion picked for your sample originally. When combined with others which prefer deletion there then your insertion alelle get shifted towards right side and becomes TAAAA to TAAAAAAA which is not a wrong representation for that single sample.
Once genotyped some more of those alleles are prunned therefore the reference allele may show slight changes in the final VCF so whatever you observe here is the correct and the expected behavior for GATK.
I hope this helps.
Please sign in to leave a comment.
1 comment