Has too many alleles in the combined VCF record
I used GATK 4.3.0.0 to call SNPs and INDELs according to the following process:
HaplotypeCaller + GenomicsDBImport + GenotypeGVCFs
gatk GenotypeGVCFs \
--reference $ref_path \
--include-non-variant-sites \
--variant gendb://$variant_dir \
--output $out_dir/1_WGS.vcf.gz
When I run GenotypeGVCFs, I receive the following information:
Chromosome NC_001133.9 position 31504 (TileDB column 31503) has too many alleles in the combined VCF record : 86 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.
22:47:54.846 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location NC_001133.9:31504
Chromosome NC_001133.9 position 206811 (TileDB column 206810) has too many alleles in the combined VCF record : 65 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.
22:58:56.038 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location NC_001133.9:206811
Chromosome NC_001134.8 position 35603 (TileDB column 265820) has too many alleles in the combined VCF record : 54 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.
Sample/Callset 112y4A( TileDB row idx 0) at Chromosome NC_001134.8 position 35605 (TileDB column 265822) has too many genotypes in the combined VCF record : 1176 : current limit : 1024 (num_alleles, ploidy) = (48, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
............
12:15:49.745 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location NC_001224.1:1905
............
And then, I use GnarlyGenotyper instead of GenotypeGVCFs.
gatk GnarlyGenotyper \
--reference $ref_path \
--keep-all-sites \
--variant gendb://$variant_dir \
--output $out_dir/1_WGS_2.vcf.gz
I received the following information:
Chromosome NC_001133.9 position 137 (TileDB column 136) has too many alleles in the combined VCF record : 8 : current limit : 7. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.
...........
What should I do to include all the alleles?
What are the differences among genomicsdb-max-alternate-alleles, max-alternate-alleles and max-genotype-count in GenotypeGVCFs and GnarlyGenotyper?
I set the parameters as follows:
gatk GenotypeGVCFs \
--reference $ref_path \
--include-non-variant-sites \
--variant gendb://$variant_dir \
--genomicsdb-max-alternate-alleles 151 \
--max-alternate-alleles 150 \
--max-genotype-count 4096 \
--output $out_dir/1_test_alleles_100.vcf
Although I increased the value of the parameter, I received the following information:
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),78.93753611099947,Cpu time(s),78.13693609399903
16:07:44.365 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location NC_001133.9:31504
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),15.848507972000096,Cpu time(s),15.796389876000026
16:08:04.902 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location NC_001133.9:206811
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),70.24454592800014,Cpu time(s),69.78471311800051
16:09:30.104 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location NC_001134.8:35603
-
Hi rq m,
There are a good reasons to limit the number of alleles at any given site.
1. Any site with 50+ alleles is likely to be a repetitive region and very hard to make accurate calls for, there's probably little information in the extra alleles that are all variants of AAAAAC, AAAAAAAAC, etc.
2. The size of the PLs in the vcf file become intractable since they grow superlinearly with the number of alleles and ploidy. It becomes impossible to store them in memory.
So I don't recommend that you increase the number of alleles at a given site unless you want to spend a ton of time and computer cost and data with extremely questionable value.
The various options are indeed confusing. They're intended to allow LOWERING the limit on alleles instead of using it to increase them. There is a hardcoded 50 allele limit in GenotypeGVCFs that comes into effect no matter how high you put the other values. Im not sure about Gnarly but it's very possible it restricts it to a much lower value to save space and time.
We could definitely improve documentation / UI around these options but in general it's recommended to not use them unless you have a very specific need.
Please sign in to leave a comment.
1 comment