Inconsistent variant call in some pool-seq samples
AnsweredHello! I am aware of this previous question (https://gatk.broadinstitute.org/hc/en-us/community/posts/360056098271-Missing-variant-in-VCF-from-GenotypeGVCFs) with a similar theme, and of [this explanation of AD vs DP](https://gatk.broadinstitute.org/hc/en-us/articles/360035532252-Allele-Depth-AD-is-lower-than-expected). However, I still do not know how to deal with the situation I found.
I have 159 Illumina HiSeq samples that are almost all pool sequencing of yeast cultures, except for two homozygous samples that represent the parental genotypes of the cultures. I am interested in seeing how the allele frequencies changes in the cultures. I mapped all samples with BWA v. 0.7.17 and GATK v. 4.1.4.1 with Java v. 1.8.0_151 to obtain the vcf file.
The variant calling was originally done like this:
gatk --java-options "-Xmx20g" HaplotypeCaller -R {GENOME} -I {sample}.bam -O {sample}.g.vcf.gz -ERC GVCF
gatk GenomicsDBImport --java-options "-Xmx{JavaMem}G" --tmp-dir=tmp --genomicsdb-workspace-path {output.rawvcf} -L {input.intervals} {variantlist}
Where `{variantlist}` is the list of samples in the style `--variant {sample.g.vcf.gz}`.
gatk --java-options "-Xmx{JavaMem}G" GenotypeGVCFs -R {input.ref} -V gendb://{input.my_database} -O {output.rawvcf} -G StandardAnnotation --new-qual
While analyzing the results, I encountered some variants that look unexpectedly "homozygous" (it's poolseq, so in other words one allele is fixed), but when I go back to the original BAM file in IGV, the sites looks very much heterozygous.
In the attached screen shot of IGV, I have three relevant variants with the two parents on the top and two samples on the bottom, let's call them sample A and B.
These two samples are very similar to each other (A is one time point in the experiment and B is the next time point). In the fist sample (A) the sites are called as heterozygous, but in the second sample (B) the AD values for the alternative allele are all 0. Here are the three sites in the vcf file (only A and B shown for simplicity):
BK006935.2 30066 . T A 364362 PASS AC=117;AF=0.368;AN=318;BaseQRankSum=0.071;DP=31584;ExcessHet=2.3616;FS=6.140;InbreedingCoeff=0.0129;MLEAC=117;MLEAF=0.368;MQ=59.95;MQRankSum=0.00;QD=26.28;ReadPosRankSum=-2.440e-01;SOR=0.405 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:177,92:269:99:0|1:30051_C_T:3480,0,7395:30051 0/0:451,0:451:99:.:.:0,120,1800
BK006935.2 30071 . G A 402006 PASS AC=117;AF=0.368;AN=318;BaseQRankSum=-1.092e+00;DP=32115;ExcessHet=2.3616;FS=5.226;InbreedingCoeff=0.0129;MLEAC=117;MLEAF=0.368;MQ=59.94;MQRankSum=0.00;QD=27.97;ReadPosRankSum=-4.400e-01;SOR=0.415 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:179,98:277:99:0|1:30051_C_T:3939,0,7404:30051 0/0:451,0:451:99:.:.:0,120,1800
BK006935.2 30091 . A G 453104 PASS AC=117;AF=0.368;AN=318;BaseQRankSum=-1.762e+00;DP=33347;ExcessHet=2.3616;FS=0.000;InbreedingCoeff=0.0129;MLEAC=117;MLEAF=0.368;MQ=59.89;MQRankSum=0.00;QD=29.07;ReadPosRankSum=0.171;SOR=0.664 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:169,110:279:99:0|1:30051_C_T:4254,0,7041:30051 0/0:451,0:451:99:.:.:0,120,1800
In the case of B, the variant was not even recorded in its gvcf file. So, from what I understand, the three sites got the AD value (451) from their hom-ref block. However, in IGV I can clearly see that the site has the two alleles in more or less balanced frequencies, just like the A sample. The only difference I can see is that the coverage of the B sample is much higher than in A.
I produced the bamout file for the offended sample (in the IGV screenshot above), and indeed the result is monomorphic, but I don't quite understand why the reads of one allele were discarded or considered uninformative in the sample B but not in A.
I wondered if the issue was the ploidy specified in the original call, since we used the default and hence obtained diploid calls. Because it's a culture, we don't know the actual ploidy of the samples. Hence, I rerun HaplotypeCaller with `-ploidy` set to 20, 50 or 100.
With these new settings, the three sites were called in the gvcf files only when `-ploidy 20`, but not with 50 or 100. In the 20n case, the site in the gvcf looks like this (for site `BK006935.2:30066`):
BK006935.2 30066 . T A 14003.2 . AC=20;AF=1;AN=20;DP=291;FS=0;MLEAC=20;MLEAF=1;MQ=60;QD=33.12;SOR=1.105 GT:AD:DP:GQ:PL 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,291:291:69:13943,4032,3100,2554,2167,1866,1621,1413,1234,1075,933,805,688,580,480,387,300,219,142,69,0
So although it's called as variable, the AD for the reference is 0. Please find attached also an IGV screen shot of the sites with the different ploidies.
The area where the SNPs are doesn't seem to be super problematic, although there are several indels around the SNPs. Here there is a zoomout of the same area, with arrow heads pointing at the focal SNPs.
Notice that other sites are affected too. For example, the big 33bp INDELs to the left of the focal 3 SNPs above is gone in 50n and 100n (but not 2n or 20n).
So it seems that the different ploidies returned different alignments in the BAM file of the B sample. Why were the focal sites present only in 20n, but not in 2n, 50n or 100n? The alignments of the reads in the bamout files are also different in between ploidies, and I am not sure how to interpret that (can you explain the ups and downs in coverage in the bamout files?). I am not ready to accept that the original BAM file is wrong (i.e. that some reads are uninformative), because the mapping is consistent with the genotype of the parents.
So in other words, I get sites called sometimes (20n), and the allele frequencies are typically 0 for one of the alleles in the gvcf output of HaplotypeCaller, even though the site looks polymorphic with the original mapping.
Admittedly, other variants in more pristine mapping areas are consistent across different ploidy levels and samples, so those look good. But certainly there are some sites that seem to be behaving weirdly. If those sites cannot be rescued because the genomic region is problematic, please advice as to how to filter such sites out.
I hope my explanation is clear enough. Thank you very much in advance for any insights.
Lorena
-
Hi Lorena,
I am a bit confused regarding your post. Do you think we could start with one question and example at a time?
Generally for pool seq we recommend Mutect2 because it can handle variable ploidies much better than HaplotypeCaller. Have you tried out Mutect2?
Best,
Genevieve
-
Hi Genevieve, yes! Apologies if it's confusing. I guess the question boils down to why is it that some samples get called as variable, and some don't, even though they are so similar.
I have not tried Mutect2, I became aware of it only recently. I wanted to see if I could rescue the variants that we already have rather than spending a lot of time learning and preparing a whole new pipeline for Mutect2. Also, a few aspects of it made me feel like it was a significant amount of work to re-start the variant calling (please correct me if I am wrong!):
1) It seems to me that if I input the parents, for example, as "normal" samples, and the pool-seq samples as "tumors", I would only get the de novo mutations, to the exclusion of the sites in the parents ("germline"). If instead I choose to run everything in tumor-only mode to get all sites, then I have to run each of the 159 samples independently and then join them? or can Mutect2 do joint called now of only "tumors"?
2) My impression was also that the Mutect2 output has different metrics for filtering, so that is something I would need time to get familiar with. Is it a normal vcf format?
Thanks!
Lorena
-
Hi Lorena,
Ok, thanks for summarizing your question. Could we go over one variant site of interest and see why it was called in some samples and not in others? I would recommend first taking a look at this article: When HaplotypeCaller and Mutect2 do not call an expected variant. The article contains troubleshooting recommendations and common debug options. If the article does not help understand better, please provide details for the site of interest including the read depth support of the variant and the output VCF line if you call the variant with the --alleles option.
I am interested to see if you get better results if you use Mutect2 for pool seq, but again, its not a recommended use case for Mutect2. We don't have many resources regarding how to use it and you would have to implement your own filtering rules because FilterMutectCalls is meant for tumor analysis.
1) You would have to run everything individually, so it would be time intensive.
2) The output VCF from Mutect2 is a normal VCF with a few changes that make sense for tumor analysis.
Best,
Genevieve
Please sign in to leave a comment.
3 comments