YES! In general most GATK tools don't care about ploidy.
The major exception is, of course, at the variant calling step in germline short variant discovery: the variant caller needs to know what is the expected ploidy for a given sample in order to perform the appropriate calculations. For somatic short variant discovery this is not necessary since we have no way to estimate ploidy in that context.
Ploidy-related capabilities
The HaplotypeCaller and GenotypeGVCFs assume that a sample is diploid by default, but they are able to deal with non-diploid organisms if you tell them otherwise (whether the sample is haploid or more or less exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy
argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF
workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT
field, so they don’t require you to specify the -ploidy
argument.
Cases where ploidy needs to be specified
- Native variant calling in haploid or polyploid organisms.
- Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
- Pooled validation/genotyping at known sites.
For normal organism ploidy, you just set the -ploidy
argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool)
.
Important limitations
Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL
, QD
, SB
, FS
, AC
, AF
, and Genotype annotations such as PL
, AD
, GT
, and so on.
You should also be aware of the fundamental accuracy limitations of high-ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
2 comments
I had a quick question regarding the calculation of P(D|G) for polyploids. In GATK's article on how HaplotypeCaller calculates genotype likelihoods, it states that you assume the organism is diploid (two haplotypes) when calculating P(D|G) and that this feature will be generalized in future versions. Has this been resolved? If not, how might this bias the PL values?
On GitHub, you mentioned that the problem with haploid GenpmicGBImport has already been solved: https://github.com/broadinstitute/gatk/issues/3342.
Is it just not updated news here?
Please sign in to leave a comment.