Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Has too many alleles in the combined VCF record


1 comment

  • Avatar
    Louis Bergelson

    Hi rq m,

    There are a good reasons to limit the number of alleles at any given site. 

    1.  Any site with 50+ alleles is likely to be a repetitive region and very hard to make accurate calls for, there's probably little information in the extra alleles that are all variants of AAAAAC, AAAAAAAAC, etc.

    2.  The size of the PLs in the vcf file become intractable since they grow superlinearly with the number of alleles and ploidy.  It becomes impossible to store them in memory.

    So I don't recommend that you increase the number of alleles at a given site unless you want to spend a ton of time and computer cost and data with extremely questionable value.  

    The various options are indeed confusing.  They're intended to allow LOWERING the limit on alleles instead of using it to increase them.  There is a hardcoded 50 allele limit in GenotypeGVCFs that comes into effect no matter how high you put the other values. Im not sure about Gnarly but it's very possible it restricts it to a much lower value to save space and time.

    We could definitely improve documentation / UI around these options but in general it's recommended to not use them unless you have a very specific need.

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk