Difference in AD/GT output for HaplotypeCaller GVCF vs VCF
Hi,
I've been trying to process some planarian whole genome sequencing data and wanted to take advantage of the CNN filtration tool. I initially did the multi-sample pipeline using HaplotypeCaller with the -ERC GVCF flag (first command below), but then wanted to go back and run the samples independently so they can be used for CNNScoreVariants. I noticed that some of the sites disagree between the HaplotypeCaller outputs in these two steps. For instance, here are some calls at the same site for the sample sample (JG1) from each of the two runs.
The single-sample GVCF has this line
dd_Smes_g4_1 45120 . C <NON_REF> . . END=45120 GT:DP:GQ:MIN_DP:PL 0/0:20:0:20:0,0,0
The merged VCF after using GenomicsDBImport and GenotypeGVCFs has this line, where JG1 is the first sample
dd_Smes_g4_1 45120 . C T 425.36 . AC=2;AF=0.5;AN=4;BaseQRankSum=4.81;DP=380;ExcessHet=0.9691;FS=1.04;InbreedingCoeff=-0.0185;MLEAC=5;MLEAF=1;MQ=39.98;MQRankSum=1.14;QD=7.33;ReadPosRankSum=1.38;SOR=0.965 GT:AD:DP:GQ:PGT:PID:PL:PS ./.:20,0:20:0:.:.:0,0,0:../.:20,0:20:0:.:.:0,0,283:. ./.:27,0:27:0:.:.:0,0,126:. ./.:30,0:30:0:.:.:0,0,0:. ./.:20,0:20:0:.:.:0,0,58:. ./.:27,0:27:0:.:.:0,0,0:../.:25,0:25:0:.:.:0,0,35:. 0|1:16,16:32:76:0|1:45091_T_C:283,0,76:45091 0|1:17,9:26:99:0|1:45091_T_C:137,0,104:45091 ./.:23,0:23:0:.:.:0,0,0:. ./.:23,0:23:0:.:.:0,0,0:. ./.:27,0:27:0:.:.:0,0,0:. ./.:20,0:20:0:.:.:0,0,105:. ./.:30,0:30:0:.:.:0,0,105:. ./.:27,0:27:0:.:.:0,0,0:.
These two seem to totally agree on the allelic depth, total depth, and genotype of sample JG1, which makes sense. When I re-run JG1, without the -ERC GVCF flag, and providing a set of candidate variants (second command below) however, I find this line in the output
dd_Smes_g4_1 45120 . C T 160.64 . AC=1;AF=0.5;AN=2;BaseQRankSum=2.964;DP=20;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=39.41;MQRankSum=0.541;QD=8.03;ReadPosRankSum=-1.363;SOR=0.693 GT:AD:DP:GQ:PL 0/1:10,10:20:66:168,0,66
Somehow, although HaplotypeCaller is consistent in finding the depth to be 20, it has now assigned 10 reads to the alternate allele. Am I misinterpreting something or could this be some kind of bug?
REQUIRED for all errors and issues:
a) GATK version used: Running GATK v4.6.0.0 through the broadinstitute Docker image on Dockerhub
b) Exact command used: Two different versions of HaplotypeCaller
1. gatk HaplotypeCaller -R $GROUP_SCRATCH/Refs/dd_Smes_g4/final_dd_Smed_g4.fa -I $SCRATCH/Planarian_Genotyping/Alignments/JG1.recal.bam -O $SCRATCH/Planarian_Genotyping/Variants/JG1.hc_init.g.vcf.gz -ERC GVCF --max-reads-per-alignment-start 0
2. gatk HaplotypeCaller -I /scratch/data/PlanarianWGS20240117/Alignments/JG1.recal.bam --alleles /scratch/data/PlanarianWGS20240117/Variants/hc_merged_top_variants.vcf.gz -L /scratch/data/PlanarianWGS20240117/Variants/top_variants.bed -R /scratch/data/Refs/dd_Smes_g4/final_dd_Smed_g4.fa -O /scratch/data/PlanarianWGS20240117/Variants/hapcaller/JG1.hc.vcf.gz
These were done on different machines, both using the same Docker image, but the JG1.recal.bam file is the same in both cases and has been processed following best practices as best as possible (bwa-mem -> MarkDuplicates -> BQSR). After running the first command, I combined 15 samples with GenomicsDBImport and GenotypeGVCFs, and took the most promising candidate variants to make the hc_merged_top_variants.vcf.gz and top_variants.bed hoping to genotype those specific sites independently to then filter with CNNScoreVariants.
c) Entire program log: Not sure which specific output would be useful here since it's comparing different runs, please let me know.
-
Hi Jesse Gibson
Here you are using HaplotypeCaller in 2 different ways. One is without any interval restrictions and all reads are accessible to determine active regions and another is forcing certain alleles to be called and reads are restricted with an interval file.
Looking at the original site in the GVCF and joint genotyped VCF it is pretty much a NO_CALL due to ties seen in the PL field. It is possible that the region is a repeat rich region therefore assembly could not decide either haplotypes and resulted in a tie yet a NO_CALL.
Once you limit the number of reads with an interval file and force calling within that site it is possible that HaplotypeCaller gets a restricted amount of reference and reads which in turn favors for one of the haplotypes that the original call called as tie.
This is the known non-deterministic behavior of HaplotypeCaller and this result could totally be expected.
You might want to call the very same region without a bed file and without an allele file to see if there is a call coming up as a resulting VCF without the -ERC GVCF mode. Alternatively you may want to add padding you your intervals using -ip parameter. Preferably adding around 150 to 200bps may change the result that you are observing with the original bed file and alleles file.
I hope this helps.
Regards.
-
Hi Gökalp Çelik, thanks for the quick reply! Could you clarify how you're interpreting the PL field there? I don't totally understand what's supposed to be stored there.
I tried re-running the last command but just adding -ip 200 as you suggested and it gave a brand new result with allele depths of 6 and 9, but still closest to the heterozygous call I suppose. Here's the complete line
dd_Smes_g4_1 45120 . C T 170.64 . AC=1;AF=0.5;AN=2;BaseQRankSum=2.423;DP=15;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=40.62;MQRankSum=-0.577;QD=11.38;ReadPosRankSum=-1.1;SOR=0.418 GT:AD:DP:GQ:PL 0/1:6,9:15:46:178,0,46
I can also try just doing the non-GVCF HaplotypeCaller as the initial round of calling, but I really would like to be able to genotype the same set of sites in all samples and that seems most straightforward by doing the joint genotyping. Adding a bit more confusion to the pot, I tried running Mutect2 on these samples since there could be some level of heterogeneity in the cells from each (although if I can only call the ones that are nearly heterozygous or homozygous alt for a subset of samples that wouldn't be the worst), and it seems like it agrees with the heterozygous 10/10 call. Here's the command and the output at that line. Since Mutect2 was run on all 15 samples, the one I've been focusing on is the first entry here
gatk Mutect2 -R $GROUP_SCRATCH/Refs/dd_Smes_g4/final_dd_Smed_g4.fa \
-O dd_Smes_g4_1.vcf.gz \
-L dd_Smes_g4_1 \
--f1r2-tar-gz dd_Smes_g4_1.f1r2.tar.gz \
-max-af 0.5 --max-reads-per-alignment-start 0 \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG10.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG11.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG12.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG13.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG14.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG15.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG17.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG1.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG2.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG3.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG4.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG6.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG7.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG8.recal.bam \
-I $SCRATCH/Planarian_Genotyping_20240118/Alignments/JG9.recal.bamdd_Smes_g4_1 45120 . C T . . AS_SB_TABLE=84,115|69,97;DP=378;ECNT=3;ECNTH=3;MBQ=16,18;MFRL=345,346;MMQ=40,40;MPOS=36;POPAF=7.3;TLOD=281.18 GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB 0|1:10,10:0.537:20:0,1:0,2:10,10:0|1:45091_T_C:45091:6,4,6,4 0|1:16,3:0.216:19:0,0:0,0:15,3:0|1:45091_T_C:45091:7,9,2,1 0|1:17,10:0.421:27:0,1:0,0:17,10:0|1:45091_T_C:45091:10,7,7,3 0|1:13,15:0.57:28:0,0:0,0:13,14:0|1:45091_T_C:45091:5,8,7,8 0|1:11,8:0.487:19:0,0:0,0:10,8:0|1:45091_T_C:45091:5,6,3,5 0|1:14,13:0.536:27:0,1:0,0:14,13:0|1:45091_T_C:45091:7,7,6,7 0|1:12,11:0.533:23:0,0:0,0:12,11:0|1:45091_T_C:45091:5,7,8,3 0|1:16,16:0.535:32:0,1:0,1:16,15:0|1:45091_T_C:45091:7,9,6,10 0|1:17,9:0.405:26:0,0:0,0:16,9:0|1:45091_T_C:45091:7,10,3,6 0|1:9,12:0.615:21:0,0:0,0:9,12:0|1:45091_T_C:45091:2,7,1,11 0|1:12,11:0.543:23:0,0:0,0:11,11:0|1:45091_T_C:45091:2,10,4,7 0|1:13,14:0.571:27:0,0:0,0:12,13:0|1:45091_T_C:45091:9,4,4,10 0|1:11,7:0.43:18:0,1:0,0:11,7:0|1:45091_T_C:45091:4,7,3,4 0|1:17,12:0.466:29:0,0:0,0:17,12:0|1:45091_T_C:45091:7,10,2,10 0|1:11,15:0.635:26:0,0:0,0:11,15:0|1:45091_T_C:45091:1,10,7,8
-
Hi Jesse Gibson
PL values are Phred scaled (-10*log(P(Genotype | Data))). So the lower the value the higher the probability of a genotype to be. You may refer to the below document for a more detailed explanation.
When you see a PL array as 0,0,0 for a diploid locus it means that variant caller could not decide whether that locus is HOMREF or HET or HOMVAR since all probabilities are equal. You can also notice that GQ value is 0 for that site which means the quality of the genotype present at that locus is low. This could be due to a region with high repeat count, homopolymers, high number of read or basecalling errors present at that region.
Forcing calling a particular variant at the loci may show the presence of it but it does not remove the probability of some other variants to be present at the same time.
Using Mutect2 may help getting allele fractions for potentially contaminated regions however Mutect2 is so sensitive that it may return any difference from the reference contig as a variant in the file and it does not report any zygosity at all. GT field is pretty much useless when we are speaking about Mutect2. Also we do not recommend using Mutect2 alone as it requires advanced filtering settings provided by FilterMutectCalls tool.
We do not want to interfere with your interpretation of the data however you may need to perform a proper variant filtration to remove any sites with low GQ and ambiguous PL results.
Regards.
Please sign in to leave a comment.
3 comments