Genome Analysis Toolkit

Lorena · December 10, 2021 15:53

Hello! I am aware of this previous question (https://gatk.broadinstitute.org/hc/en-us/community/posts/360056098271-Missing-variant-in-VCF-from-GenotypeGVCFs) with a similar theme, and of [this explanation of AD vs DP](https://gatk.broadinstitute.org/hc/en-us/articles/360035532252-Allele-Depth-AD-is-lower-than-expected). However, I still do not know how to deal with the situation I found.

I have 159 Illumina HiSeq samples that are almost all pool sequencing of yeast cultures, except for two homozygous samples that represent the parental genotypes of the cultures. I am interested in seeing how the allele frequencies changes in the cultures. I mapped all samples with BWA v. 0.7.17 and GATK v. 4.1.4.1 with Java v. 1.8.0_151 to obtain the vcf file.

The variant calling was originally done like this:

gatk --java-options "-Xmx20g" HaplotypeCaller -R {GENOME} -I {sample}.bam -O {sample}.g.vcf.gz -ERC GVCF

gatk GenomicsDBImport --java-options "-Xmx{JavaMem}G" --tmp-dir=tmp --genomicsdb-workspace-path {output.rawvcf} -L {input.intervals} {variantlist}

Where `{variantlist}` is the list of samples in the style `--variant {sample.g.vcf.gz}`.

gatk --java-options "-Xmx{JavaMem}G" GenotypeGVCFs -R {input.ref} -V gendb://{input.my_database} -O {output.rawvcf} -G StandardAnnotation --new-qual

While analyzing the results, I encountered some variants that look unexpectedly "homozygous" (it's poolseq, so in other words one allele is fixed), but when I go back to the original BAM file in IGV, the sites looks very much heterozygous.

In the attached screen shot of IGV, I have three relevant variants with the two parents on the top and two samples on the bottom, let's call them sample A and B.

These two samples are very similar to each other (A is one time point in the experiment and B is the next time point). In the fist sample (A) the sites are called as heterozygous, but in the second sample (B) the AD values for the alternative allele are all 0. Here are the three sites in the vcf file (only A and B shown for simplicity):

BK006935.2 30066 . T A 364362 PASS AC=117;AF=0.368;AN=318;BaseQRankSum=0.071;DP=31584;ExcessHet=2.3616;FS=6.140;InbreedingCoeff=0.0129;MLEAC=117;MLEAF=0.368;MQ=59.95;MQRankSum=0.00;QD=26.28;ReadPosRankSum=-2.440e-01;SOR=0.405 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:177,92:269:99:0|1:30051_C_T:3480,0,7395:30051 0/0:451,0:451:99:.:.:0,120,1800

BK006935.2 30071 . G A 402006 PASS AC=117;AF=0.368;AN=318;BaseQRankSum=-1.092e+00;DP=32115;ExcessHet=2.3616;FS=5.226;InbreedingCoeff=0.0129;MLEAC=117;MLEAF=0.368;MQ=59.94;MQRankSum=0.00;QD=27.97;ReadPosRankSum=-4.400e-01;SOR=0.415 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:179,98:277:99:0|1:30051_C_T:3939,0,7404:30051 0/0:451,0:451:99:.:.:0,120,1800

BK006935.2 30091 . A G 453104 PASS AC=117;AF=0.368;AN=318;BaseQRankSum=-1.762e+00;DP=33347;ExcessHet=2.3616;FS=0.000;InbreedingCoeff=0.0129;MLEAC=117;MLEAF=0.368;MQ=59.89;MQRankSum=0.00;QD=29.07;ReadPosRankSum=0.171;SOR=0.664 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:169,110:279:99:0|1:30051_C_T:4254,0,7041:30051 0/0:451,0:451:99:.:.:0,120,1800

In the case of B, the variant was not even recorded in its gvcf file. So, from what I understand, the three sites got the AD value (451) from their hom-ref block. However, in IGV I can clearly see that the site has the two alleles in more or less balanced frequencies, just like the A sample. The only difference I can see is that the coverage of the B sample is much higher than in A.

I produced the bamout file for the offended sample (in the IGV screenshot above), and indeed the result is monomorphic, but I don't quite understand why the reads of one allele were discarded or considered uninformative in the sample B but not in A.

I wondered if the issue was the ploidy specified in the original call, since we used the default and hence obtained diploid calls. Because it's a culture, we don't know the actual ploidy of the samples. Hence, I rerun HaplotypeCaller with `-ploidy` set to 20, 50 or 100.

With these new settings, the three sites were called in the gvcf files only when `-ploidy 20`, but not with 50 or 100. In the 20n case, the site in the gvcf looks like this (for site `BK006935.2:30066`):

BK006935.2 30066 . T A 14003.2 . AC=20;AF=1;AN=20;DP=291;FS=0;MLEAC=20;MLEAF=1;MQ=60;QD=33.12;SOR=1.105 GT:AD:DP:GQ:PL 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,291:291:69:13943,4032,3100,2554,2167,1866,1621,1413,1234,1075,933,805,688,580,480,387,300,219,142,69,0

So although it's called as variable, the AD for the reference is 0. Please find attached also an IGV screen shot of the sites with the different ploidies.

The area where the SNPs are doesn't seem to be super problematic, although there are several indels around the SNPs. Here there is a zoomout of the same area, with arrow heads pointing at the focal SNPs.

Notice that other sites are affected too. For example, the big 33bp INDELs to the left of the focal 3 SNPs above is gone in 50n and 100n (but not 2n or 20n).

So it seems that the different ploidies returned different alignments in the BAM file of the B sample. Why were the focal sites present only in 20n, but not in 2n, 50n or 100n? The alignments of the reads in the bamout files are also different in between ploidies, and I am not sure how to interpret that (can you explain the ups and downs in coverage in the bamout files?). I am not ready to accept that the original BAM file is wrong (i.e. that some reads are uninformative), because the mapping is consistent with the genotype of the parents.

So in other words, I get sites called sometimes (20n), and the allele frequencies are typically 0 for one of the alleles in the gvcf output of HaplotypeCaller, even though the site looks polymorphic with the original mapping.

Admittedly, other variants in more pristine mapping areas are consistent across different ploidy levels and samples, so those look good. But certainly there are some sites that seem to be behaving weirdly. If those sites cannot be rescued because the genomic region is problematic, please advice as to how to filter such sites out.

I hope my explanation is clear enough. Thank you very much in advance for any insights.

Lorena

Genome Analysis Toolkit

Need Help?

Community Forum

Inconsistent variant call in some pool-seq samples

3 comments

Welcome

Didn't find what you were looking for?

Quick Links

Recent GATK News

About the GATK community